Predicting the Future: Harnessing the Power of Probabilistic Judgements Through Forecasting Tournaments

April 29, 2025

Predicting the Future: Harnessing the Power of Probabilistic Judgements Through Forecasting Tournaments

From the threat of nuclear war to rogue superintelligent AI to future global pandemics to climate catastrophes, the world is riddled with threats that are both urgent and hard to accurately anticipate. These high-stakes questions are where our traditional data-driven models fail. There is often no historical precedent, no baseline data to go off of, and no obvious way to simulate this future world. So what can we do in the pressing cases when data fails but we still need to be able to anticipate future events?

While many domains still rely heavily on the opinions of credentialed experts – pundits, analysts, and consultants – an alternative solution has gained traction: engineering wisdom from crowds. Scholars of collective intelligence have long posited that the aggregation of diverse, independent opinions can often outperform even the most sophisticated and knowledgeable individual experts, particularly in highly uncertain domains. Moreover, the fallacy and biases of expertise have become increasingly apparent across a wide range of fields. Recent work in political science, for example, has shown that experts often overestimate the likelihood of democratic erosion and struggle to predict which political messages will resonate with the public, revealing a potential ceiling to expert intuition (or human intuition, more broadly).

"Probability"

Figure 1. Experts consistently overestimated the likelihood of democratic breakdown in the U.S. across all surveyed periods, suggesting potential biases in elite judgment under uncertainty. Source: Andrew Little and Brightline Watch (2024)

Each point in the plot represents a single political message, with the true persuasive effect of the message on the x-axis (based on experimental data) and the average predicted effectiveness on the y-axis.

Figure 2. Each point in the plot represents a single political message, with the true persuasive effect of the message on the x-axis (based on experimental data) and the average predicted effectiveness on the y-axis The lack of a correlation within each group indicates that both the mass public and political practitioners (experts) incorrectly predict how persuasive these messages were. Source: Broockman et. al (2024)

But crowds have biases too. They are susceptible to groupthink, where conformity squashes out non-dominant opinions; preference falsification where individuals misrepresent their true beliefs to align with perceived norms; or herding where individuals rely on particular cues from others.

One way around these biases of crowds, however, is through a structured and disciplined way to elicit and aggregate judgments from a crowd: forecasting tournaments.

Forecasting tournaments are contests among individuals or teams in which participants are asked to make specific, quantifiable probabilistic predictions about objectively resolvable questions and evaluated through rigorous scoring metrics and are typically financially incentivized.

By asking questions like “How many job openings in the US will the Bureau of Labor Statistics (BLS) report for July 2025?” or “What percentage of Arizona will be facing severe drought or worse on July 5th, 2025?” and providing structured environments for participants to reason or deliberate how these questions might resolve, these tournaments are able to incentivize accuracy, belief updating, and enable us to tap into the strengths of collective intelligence while minimizing its well-known pitfalls. In doing so, they don't just identify who, in the aggregate, is right more often— they help build intelligence systems that make better judgments possible.

History of Forecasting Tournaments

The origins of modern forecasting tournaments trace back to the Delphi method, also known as the Estimate-Talk-Estimate method. This is a forecasting technique initially developed by the RAND Corporation during the Cold War as a structured way to elicit and aggregate expert predictions of the potential impact of novel technology on global conflict, but has since been applied to many other policy-making domains. Initially developed to counter the shortcomings of traditional forecasting methods reliant on statistical modeling or public deliberation, the Delphi method had anonymous panel of domain experts undergo 4 steps: (1) individual and anonymous forecasts are provided by each expert, (2) responses are aggregated and information is distilled back to participants by a facilitator, (3) the experts update their forecasts in response to the information and, (4) results (usually converge and) are aggregated once more to produce an “expert group consensus” which can then be implemented.

Soon after this, forecasting tournaments became formalized in academic circles as they made their way into the toolkits of disciplines like management, psychology, decision science, and intelligence. The work of psychologists Amos Tversky and Daniel Kahneman drew heavily on the intuitions and frameworks underlying the Delphi method, and forecasting tournaments more broadly.

In 2011, the U.S. Office of Incisive Analysis, a branch of the Intelligence Advanced Research Projects Activity (IARPA), launched an intelligence project titled the Aggregative Contingent Estimation (ACE) aimed at increasing “the accuracy, precision, and timeliness of intelligence forecasts for a broad range of event types, through the development of advanced techniques that elicit, weight, and combine the judgments of many intelligence analysts." This project consists of a forecasting tournament open to the U.S. intelligence community to identify cutting-edge methods to forecast geopolitical events. A team out of the University of Pennsylvania led by Philip Tetlock and Barbara Mellers – pioneers in forecasting research – dubbed the “Good Judgement Project” developed a method which won this competition, beating even teams of expert analysts with access to classified information. The team recruited thousands of participants to make forecasts and utilized interventions based on key findings in social science to help identify and train exceptional forecasters, as well as different ways to aggregate forecasts. This team sparked a resurgence in interest in forecasting tournaments by demonstrating that well-calibrated crowds could yield important insights and predictions about high-stakes events in the future.

As forecasting made its way into economics, so did the now popular concept of the “prediction market”— Robin Hanson being one of the pioneer figures in conceptualizing markets as sociotechnical systems of information aggregation. With the advancement of communication technology, contemporary platforms like Manifold, Metaculus, and Kalshi have been able to scale up the idea of forecasting tournaments and prediction markets by harnessing real-time participation from thousands of forecasters around the world across domains. In these markets, participants can wage real money on the outcomes of future events and have often beat experts and analysts in many domains, including predicting the winners of elections, COVID rates, and innovations in AI.

Scoring Forecasts

Scoring metrics are central to determining the accuracy of probabilistic forecasts. The most common way of assessing predictive accuracy (or rather, error) is through the calculation of a Brier Score. The Brier Score measures the mean squared difference between predicted probabilities and actual outcomes and is defined for multiple binary events as:

$BS = (1/N) * ∑^{N}_{t=1} (f_t - o_t)^2$

where f_t corresponds to the forecasted probability (between 0 and 1) and o_t represents the binary outcome (did the event happen or not?). The more accurate a forecaster is, the lower their Brier Score will be for a series of forecasts.

While the Brier Score remains the standard for many forecasting tournaments, other metrics include the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE), and more novel elicitation metrics like reciprocal scoring – all which emphasize distinct aspects of what makes an accurate forecast and what might constitute “good judgment.”

Determinants of Accurate Forecasters and Forecasts

There have been many attempts to unpack why certain forecasters and forecast aggregations are consistently more accurate than others. Two main explanations have to do with the individual-level psychological profiles of certain forecasters and the setup of forecasting tournaments. Tetlock and Meller’s Good Judgement Project was, in part, famous for its identification of a group of “super forecasters”– individuals who consistently outperformed others in forecasting tournaments– and the traits of these types of forecasters. Namely, they found that they scored high on fluid intelligence, political knowledge, and actively open-minded thinking, but also treated beliefs as hypotheses to be tested, not identities to defend, and were better at inductive reasoning, pattern detection, cognitive flexibility, and open-mindedness. They’re more granular and precise in their probability estimates, more sensitive to scope and logic, and less prone to cognitive biases. They also update forecasts often, read widely, and refine their judgments through discussion. It’s the combination of a particular cognitive profile, rigorous habits, and collaborative environments that sets them apart from others. Using these findings, the GJP and others have since created “debiasing programs” aimed at training people to become better forecasters.

Beyond individual-level traits, the design of the forecasting tournament also plays a key role in how accurate aggregate forecasts are. Different tournament structures, like adversarial collaboration that encourages healthy debate, deliberative processes that surface diverse viewpoints, and sophisticated aggregation algorithms that weight forecasters based on past performance, can significantly enhance aggregate accuracy, as well.

The Future of Forecasting

While forecasting tournaments act as an important tool for policy-making in an increasingly uncertain world, this method is not excluded from critiques. Scholars such as Ruha Benjamin have argued that technocratic approaches to knowledge and policy-making— including forecasting— often reinforce and reproduce social inequalities, often validating them behind veils of objectivity and rationality. Prediction markets are manipulatable, public-facing forecasts fall subject to performativity or an ability to be exogenous from the world it seeks to predict, and these considerations sit squarely in debates around the politics of prediction. Who gets to forecast? Who gets to be legitimated as an “expert”? Who gets to design the systems used to forecast?

As we look towards the future of forecasting, developments in artificial intelligence—such as hybrid human–LLM forecasting and fully autonomous LLM-agent forecasting—will only deepen these concerns, raising additional questions about transparency, accountability, and what biases are embedded within such systems.

But these concerns also point toward an underexplored application: community-based and participatory forecasting. Forecasting historically has ignored certain knowledge systems, but incorporating localized knowledge from communities whose futures we seek to forecast can help fill in blindspots where traditional forecasting tournaments continue to fail and can help ensure the democratization of all that forecasting has to offer.

Predicting the Future: Harnessing the Power of Probabilistic Judgements Through Forecasting Tournaments

Topics