Fluency masks failure
Models produce authoritative prose that reads like strategy. But fluency is a surface property — it tells you nothing about whether the causal reasoning holds up.
ForesightEval Protocol
When AI writes a scenario analysis for your board, how do you know it's any good? ForesightEval is the protocol we built to answer that question — seven measurable dimensions that separate foresight you can stake a decision on from analysis that merely reads well.
The problem
Models produce authoritative prose that reads like strategy. But fluency is a surface property — it tells you nothing about whether the causal reasoning holds up.
Existing benchmarks score isolated predictions. Foresight is a different discipline — its value lies in stress-testing strategy against multiple futures, not calculating the probability of one.
Modern AI models are trained to be helpful. That training teaches them to agree, avoid discomfort, and default to consensus. For risk management, where the entire point is naming uncomfortable truths, this is a structural failure.
Our approach
It is simple to score whether a model’s probability estimate was correct. It is hard to score whether a scenario is coherent, whether it surfaces the disruption a board hasn’t considered, or whether it translates into action inside ninety days. ForesightEval does the hard version, because the easy version is not what strategy teams actually need.
The most dangerous AI foresight is the kind that quietly agrees with the strategy already on the table. ForesightEval explicitly scores whether a model named the uncomfortable scenario, challenged the assumption, or blinked. Analysis that only confirms what leadership already believes does not pass the bar.
A quality metric you cannot audit is not a quality metric. Every ForesightEval score breaks down to its seven dimensions, each dimension to its evidence, each piece of evidence to its source. Scenarios inherit the same discipline through Bayesian anchoring (Tetlock, Shell, IPCC) — probabilities move only on triggered signposts or materially new claims, never from a fresh model run.
In practice
ForesightEval currently runs as the internal quality layer on every Future Space DSGHT.ai publishes. The score is calculated before release, visible on the analysis page, and decomposable to the per-dimension level — so the quality claim can be audited against the evidence.
This is not yet a cross-model benchmark — that track opens with the first retrospective backtests later in 2026. What follows is the standard DSGHT.ai holds its own production work to, published openly rather than kept internal.
Strategic Anticipation Quotient
8.6/ 10
| Dimension | Score | Note |
|---|---|---|
| Scenario Quality | 9.0 | Structurally distinct 2×2 matrix, probabilities sum to 100 % |
| Epistemic Grounding | 10 | Historical analogies, complete structural consistency |
| Unpalatable Truths | 10 | Sovereign Algocracy scenario directly challenges comfort zone |
| Weak Signal Detection | 7.8 | Relies on well-publicised cases; fringe signals underrepresented |
| Actionability | 9.0 | Tension-linked recommendations tied to regulatory milestones |
| Living Foresight | 7.5 | Static probabilities; no temporal metadata or signpost tracking |
| Explainability | 7.0 | Claims metadata missing from artifact; citations unverifiable |
Scored by the DSGHT.ai internal pipeline. Cross-model scoring, human-vs-AI comparison, and retrospective backtests are on the 2026 roadmap.