# A new approach to event-driven forecasting

Published: May 28, 2026
Author: Astrid Atle & David Perntoft
Canonical: https://predli.com/blog/a-new-approach-to-event-driven-forecasting

> A practical look at where large language models add value to time series forecasting - and where they do not. We share findings from a Lund University × Predli master’s thesis on event-driven prediction with agentic LLM orchestration.

## Introduction

Every year, on the Monday after Black Friday, logistics operators face a forecasting problem that their models were never designed to solve.

A system trained on years of regular weekly cycles must suddenly predict what happens when four overlapping events collide within a few weeks: a Singles Day pulse, a Black Week build-up, the Black Friday spike itself, and a delivery after-wave into early December. The model will reconstruct the underlying weekly rhythm with precision. And then it will miss the spike entirely.

Not because the model is weak. Because the information that would explain the spike is not in the data. It lives in a marketing calendar, a planning document, a news feed - just not in a form any numerical model can consume.

This is the problem we set out to solve in our master’s thesis at Lund University.

## The obvious solution has a well-known flaw

Large language models are an obvious candidate for bridging this gap. They read text. They reason about events. Feed them a time series alongside a description of an upcoming campaign and ask them to predict the effect - problem solved.

Except it is not. Language models tokenise numbers in ways that distort ordinal relationships. They produce confident forecasts that are subtly wrong. Ask a language model to process a raw numerical sequence and it will answer with authority and get the details wrong in ways that are hard to detect.

The naive approach - one model, all inputs, one output - inherits the worst of both worlds.

![Classical statistical forecasting compared with LLM-integrated forecasting](/blog/forecasting-paradigm.jpg)

  Two forecasting paradigms compared. Classical statistical models see only the numerical history and miss event-driven spikes. An LLM-integrated system reasons jointly over numbers and textual context, captures the spike, and emits an auditable reasoning trace.

## A strict division of labour

Our system is built on one design constraint that runs through every component: **the language model never produces a number.**

All numerical computation is delegated to validated statistical implementations - SARIMA, state-space models, STL decomposition. The language model’s contribution is restricted to what it actually does reliably: interpreting natural language, retrieving relevant historical analogues, and translating qualitative event descriptions into structured adjustments to a statistical baseline.

Every numerical output in the system can be traced back either to a statistical procedure or to a piece of measured historical evidence. Nothing comes from free-form generation.

## How the pipeline works

The architecture is a sequence of specialised agents, each with a narrow responsibility.

![The agentic forecasting pipeline](/blog/forecasting-pipeline.jpg)

  The agentic forecasting pipeline. Statistical descriptors and domain context feed the Hypothesis Generator. Pruned hypotheses are dropped; survivors enter the Forecaster–Evaluator refinement loop. The Aggregator selects the best-performing hypothesis as the statistical baseline, adjusted by the Scenario Generator using future event information.

The most consequential component is the scenario generator. When a future event is described - a product launch, a price reduction, a public holiday - the system searches for historical analogies, both within the series’ own history and in an external knowledge bank of precedent cases. From those analogies it constructs three scenario specifications: optimistic, expected, and conservative. These are applied as deterministic multiplicative adjustments to the statistical baseline. The language model selects the shape and the analogues. The magnitudes come from empirical quantiles of historical impacts.

## The results - and the failure that mattered most

The system was evaluated on three simulated datasets with controlled data-generating processes: a primary care centre, a logistics hub, and a music streaming catalogue. Simulated data was a deliberate choice - real-world series rarely come with ground truth for event effects.

  66%Reduction in forecast error - logistics scenario
  59%Reduction in forecast error - primary care scenario
  2–2.5&times;Higher error for Chronos-Bolt on event-driven windows

Chronos-Bolt, a state-of-the-art numerical foundation model, performed 2.0 to 2.5 times worse on event-driven test windows - not because it is a poor model, but because it has no access to the information that drives the level shifts in question.

The more revealing result came from deliberately breaking the system.

  The music streaming case
  When we removed all historical analogies but kept the future event description - “a new single will be released on June 21st” - performance degraded by **276%** relative to the full pipeline. With no precedent to draw on, the language model had no basis for calibrating the magnitude of a release. It produced a narrow interval that failed to capture the actual spike entirely.

  This looks like a failure. We think it is evidence the system is working as intended.

  A less carefully designed system might have produced a wide interval to nominally cover the outcome, or assigned a plausible-sounding magnitude with no supporting evidence. Our system instead signalled that it had no evidence on which to base an adjustment. In an operational setting, a system that can identify the absence of usable evidence is considerably more valuable than one that reports unjustified confidence.

## When this approach helps - and when it does not

LLM event reasoning is not a general-purpose improvement to forecasting accuracy. It produces measurable signal only when three conditions hold simultaneously.

  - **The event must be material.** On an ordinary day with no events, the system adds nothing over a well-tuned statistical baseline and the additional computation is wasted.

  - **A grounding source must exist.** At least one analogous event must be present in the knowledge bank. Genuine novelty defeats the retrieval step - and the language model correctly declines to assign a magnitude it cannot justify.

  - **The description must be specific.** “A marketing campaign” is too vague to be useful. “A 30% price reduction on athletic footwear, three-day duration, social media driven” gives the system something concrete to match against.

When all three conditions hold, the gains are substantial. When any one fails, the system degrades - but gradually, not catastrophically.

## What this means for forecasting in practice

The question is not whether language models belong in forecasting pipelines. For event-driven series - logistics, retail, healthcare, media - the information gap between what drives outcomes and what numerical models can see is real and consequential.

The question is how to use them without inheriting their failure modes. A model constrained to retrieve, rank, and translate - while a statistical engine handles the arithmetic - adds genuine signal and fails in ways that are visible and interpretable.

The capability curve on foundation models is not flattening. The next generation of numerical forecasting models will be more capable. But the structural gap - between what drives a spike and what appears in a time series - will not close on its own. The information exists. The question is whether the system is designed to use it.

*This post summarises the master’s thesis *Integrating Natural Language Events into Time Series Forecasting through Agentic LLM Orchestration* by Astrid Atle and David Perntoft, Department of Mathematical Statistics and Industrial Engineering and Management at Lund University, conducted in collaboration with Predli. The full thesis is available on request.*
