Data & research · June 29, 2026

Options Flow Historical Data: How to Access, Backtest, and Evaluate Signal Accuracy

Historical options flow data is the foundation for any serious evaluation of whether unusual activity has real predictive value or is just noise that gets retroactively explained. Here is what historical flow data contains, how to access it, how to build a backtest, and the accountability gap that most flow tools quietly ignore.

What historical options flow data contains

A live options flow feed shows prints as they happen. Historical flow data is the same feed archived by session. Each record captures the state of the print at execution time, not after-the-fact reconstruction.

A quality historical dataset includes, per print:

FieldWhy it matters for research
TimestampMaps the print to a market regime, session, and underlying spot price
Ticker & underlying spotRequired to compute forward returns
Contract type (CALL/PUT)Defines the implied directional bet
Strike & expiry (DTE)Separates urgent short-dated bets from long-dated hedges
Volume & open interestThe Vol/OI ratio, the strongest single factor in unusualness scoring
Premium paidConviction proxy: size filters out noise
Aggressor side (bid/ask/mid)Ask-side indicates urgency; bid-side is more ambiguous
Execution type (sweep/block)Sweeps cross multiple venues instantly and are strongly directional
Unusualness scoreThe composite signal quality metric, needed to segment outcomes by tier
Sector/market capAllows sector-level aggregation in the backtest

What historical flow data does not contain is the underlying's forward price; that must be joined from a separate OHLCV source (Polygon, Finnhub, yfinance) based on the snapshot timestamp.

How far back providers store it

Depth of history varies widely and is rarely disclosed upfront:

For evaluating whether high-scoring flow signals carry real directional information, 12 months is the practical minimum. You need enough samples per sector and score tier to generate confidence intervals below ±10 percentage points.

How to backtest options flow signals

A rigorous options flow backtest follows five steps:

  1. Define your signal universe. What qualifies as a signal? For example: score ≥ 70, premium ≥ $100,000, aggressor side = ASK, execution = SWEEP. This becomes your filter. Apply it identically to every session in your dataset: no look-ahead, no manual selection.
  2. Record the snapshot. For each qualifying signal, record the underlying's closing price on the signal date (or the midpoint price closest to the signal timestamp for intraday precision). This is your entry reference price.
  3. Measure forward returns. At your chosen horizon (1 trading day, 3 days, 5 days), record the underlying's closing price. The forward return = (exit price − entry price) / entry price × 100%.
  4. Classify the outcome. A CALL signal is a directional hit if the forward return is positive; a PUT signal is a hit if the return is negative. This directional classification, not options P&L, is the cleanest signal quality metric (it removes IV, decay, and bid-ask noise from the equation).
  5. Aggregate by tier, type, and sector. Calculate hit-rate (% correct directional calls), average forward return, and standard deviation by score tier (EXTREME / ELEVATED / NOTABLE), option type (CALL / PUT), and sector (Technology, Biotech, Energy, etc.). The tier × sector cross-tab is where the most actionable patterns appear.

Common backtesting mistakes

What hit-rates to expect (honestly)

Honest published research on options flow signal accuracy is sparse, because most providers have no incentive to publish numbers that might disappoint users. The figures that do appear in academic literature and independent analysis suggest:

A few important calibrations:

Python workflow for signal analysis

A minimal backtest on exported options flow data using pandas and yfinance:

import pandas as pd
import numpy as np
import yfinance as yf
from datetime import timedelta

# Load your historical flow export (CSV from RadarPulse or API response)
flow = pd.read_csv("flow_history.csv", parse_dates=["timestamp"])

# 1. Filter for your signal universe
signals = flow[
    (flow["score"] >= 70) &
    (flow["premium"] >= 100_000) &
    (flow["side"] == "ASK") &
    (flow["kind"] == "SWEEP")
].copy()

# 2. For each signal, fetch the underlying's price at signal date + forward horizons
def forward_return(ticker, signal_date, days):
    start = signal_date
    end = signal_date + timedelta(days=days + 5)  # buffer for market closures
    hist = yf.Ticker(ticker).history(start=start, end=end)
    if hist.empty or len(hist) < 2:
        return None
    entry = hist["Close"].iloc[0]
    # Find the Nth trading day after signal
    idx = min(days, len(hist) - 1)
    exit_ = hist["Close"].iloc[idx]
    return (exit_ - entry) / entry * 100

results = []
for _, row in signals.iterrows():
    ret_3d = forward_return(row["ticker"], row["timestamp"].date(), 3)
    if ret_3d is None:
        continue
    # Directional hit: CALL needs positive return, PUT needs negative
    is_hit = (row["type"] == "CALL" and ret_3d > 0) or \
             (row["type"] == "PUT" and ret_3d < 0)
    results.append({
        "ticker": row["ticker"],
        "type": row["type"],
        "score": row["score"],
        "sector": row.get("sector", "Other"),
        "ret_3d": ret_3d,
        "hit": is_hit,
        "flag": "EXTREME" if row["score"] >= 85 else "ELEVATED" if row["score"] >= 70 else "NOTABLE"
    })

df = pd.DataFrame(results)

# 3. Aggregate by flag and sector
summary = (
    df.groupby(["flag", "type"])
    .agg(
        count=("hit", "size"),
        hit_rate=("hit", "mean"),
        avg_ret=("ret_3d", "mean"),
        std_ret=("ret_3d", "std")
    )
    .round(3)
)
print(summary[summary["count"] >= 30])  # only report cells with enough data

Key notes: use yfinance for quick backtests but Polygon or CBOE data for production research (yfinance data is not adjusted for splits in all edge cases and lacks intraday resolution). The forward_return function uses closing prices; for intraday precision, join to the minute-level OHLCV nearest the signal timestamp.

Sector-level heatmap

import seaborn as sns
import matplotlib.pyplot as plt

pivot = df.pivot_table(
    values="hit", index="sector", columns="flag",
    aggfunc=lambda x: x.mean() if len(x) >= 10 else None
)
sns.heatmap(pivot, annot=True, fmt=".0%", cmap="RdYlGn",
            center=0.5, vmin=0.4, vmax=0.7)
plt.title("Directional hit-rate by sector × score tier (3-day)")
plt.tight_layout()
plt.show()

This heatmap often reveals that a few sector × tier combinations drive most of the edge. For example, EXTREME call flow in Technology and Healthcare outperforms EXTREME call flow in Consumer Staples, which is structurally less catalyst-driven.

Evaluating provider data quality

Not all historical flow datasets are equivalent. Evaluate a provider on these dimensions before building research on their data:

Question to askWhy it matters
How far back does the data go?Less than 12 months is insufficient for multi-regime analysis
Is it raw OPRA tape or pre-filtered?Pre-filtered data may exclude prints your criteria would catch; raw tape includes noise you'd need to filter yourself
Are Vol/OI ratios computed correctly?Some providers use total OI across all strikes; correct is same-strike OI at trade time, a significant difference for short-dated prints
Is aggressor side included?Without bid/ask classification, you can't filter for urgency, the most important execution quality signal
Are multi-leg (spread) prints separated from single-leg?Multi-leg prints may look like unusual directional flow but are often synthetic positions, hedges, or risk reversals with no strong directional bias
Does the provider publish their own outcome data?A provider confident in their signal quality should track outcomes. If they don't, ask why.

The accountability gap: why most tools hide track records

The options flow tool market has a systematic accountability problem. Because past prints are abundant, it's trivially easy to find examples that look prescient: EXTREME calls on a name three days before an earnings beat, large put sweeps ahead of a sector selloff. Social media amplifies these examples because they're compelling stories.

What you almost never see is the denominator: out of all the EXTREME calls in the same period, how many actually preceded upside? That number is available from the same data. It just doesn't get posted because it's usually closer to 55–60% than 90%, and "55% directional accuracy on high-scoring signals" is harder to tweet than a screenshot of a 100x put.

The correct standard is a prospective, systematic track record:

RadarPulse's Smart-Money Scorecard is built on this standard. Every EXTREME and ELEVATED print scored from a live session is logged with the underlying spot price, and the forward move is measured automatically as the session data accumulates. The track record builds prospectively, without cherry-picking, and the methodology is documented. The numbers that emerge are honest ones, useful for calibrating how much weight to put on any given signal, not a marketing claim about performance.

Accessing historical data via API

Most quality options flow tools expose historical data through a dedicated endpoint alongside their live feed. The typical pattern:

# RadarPulse historical flow endpoint (Elite tier, staged for next release)
GET /api/v1/flow/historical?from=2026-06-01&to=2026-06-29&score_min=70&type=CALL&limit=500

Authorization: x-api-key YOUR_KEY_HERE

# Response: paginated list of prints from the specified date range
{
  "prints": [
    {
      "ticker": "NVDA",
      "type": "CALL",
      "strike": 135,
      "dte": 7,
      "premium": 2450000,
      "volOI": 12.4,
      "side": "ASK",
      "kind": "SWEEP",
      "score": 91,
      "flag": "EXTREME",
      "spot": 131.20,
      "timestamp": "2026-06-15T10:23:41Z",
      "sector": "Technology"
    }
    // ...
  ],
  "total": 842,
  "next_cursor": "eyJ0cyI6MTc1MDAwMDAwMH0="
}

Key parameters to look for in a historical endpoint:

For building a research pipeline, fetch historical data in batches of 30-day windows, cache to local Parquet files, and join with yfinance / Polygon for forward prices. Avoid re-fetching the same date ranges repeatedly; most historical endpoints count against rate limits even for repeated identical queries.

See the options flow API guide for authentication patterns, rate limit management, and WebSocket vs REST trade-offs in more detail.

What to do with the data once you have it

Beyond backtesting hit-rates, historical options flow data supports several other research workflows:

Pre-earnings pattern analysis

Filter historical prints to the 5 trading days before each company's earnings report and aggregate by score tier and direction. The question: do EXTREME call sweeps in the week before earnings show a higher directional hit-rate than the session average? If yes, earnings-window flow deserves higher conviction weight in your live workflow.

Congress × flow confluence scoring

Tag each historical print against the congressional disclosure data for the same ticker (available from RadarPulse's Congress tracker and the STOCK Act disclosure database). Compute hit-rates for prints where Congress was also active in the same name vs. prints without congressional overlap. This cross-domain validation is one of the most differentiated research questions available from public data.

Sector rotation timing

Aggregate historical EXTREME flow by sector per week. Build a time series of sector-level unusual activity premium. Identify weeks where a sector saw concentrated unusual flow, then measure the sector ETF's forward return at 5 and 10 trading days. This builds a signal for sector rotation timing, based not on price action (which is after-the-fact) but on real-money options positioning.

Score calibration

Run the backtest across multiple score thresholds (65, 70, 75, 80, 85, 90) and plot hit-rate vs. threshold. The inflection point, where hit-rates start improving meaningfully, is the empirically supported threshold for your specific universe and time period. This calibration is more reliable than using a threshold that was set arbitrarily at product launch.

Frequently asked questions

Is historical options flow data the same as options chain history?

No. Options chain history (historical OHLCV per contract, open interest per strike per day) is widely available from CBOE, Nasdaq, and data vendors. Options flow historical data is a subset: it captures only the unusual prints (sweeps, large blocks, high Vol/OI trades), along with the execution context (aggressor side, sweep vs block) that options chain history doesn't include. You cannot derive flow data from options chain history because the chain history shows end-of-day snapshots, not intraday execution details.

Can I access CBOE options flow data directly?

CBOE distributes options market data through its DataShop product. The raw OPRA tape (all options prints) is available for institutional subscribers, typically via FTP or SFTP in large daily flat files. It includes every trade but without the scoring, filtering, or execution-side tagging that flow tools provide. Building a flow tool from raw OPRA tape requires significant engineering: parsing 1–5GB daily files, computing Vol/OI ratios at the time of each trade (not end-of-day), identifying sweeps across multiple exchanges, and computing unusualness scores.

How many samples do I need for a meaningful backtest?

Per cell (each tier × type × sector combination), 30 samples is the minimum for a confidence interval narrow enough to be useful. For a standard backtest with EXTREME / ELEVATED tiers, CALL / PUT types, and 10 sectors, you need roughly 600 samples in the most granular cuts, achievable with 6–12 months of data from a tool with a reasonable premium floor filter.

What's the difference between backtesting and forward testing?

A backtest runs on historical data: it tells you how a strategy would have performed if you'd followed it in the past. Forward testing (also called paper trading or out-of-sample testing) applies the same strategy to live data and records the actual outcomes as they happen. Forward tests are more credible because they can't be unconsciously biased by the analyst seeing the outcomes before building the rules. RadarPulse's Scorecard is a forward test: signals are locked in at execution time and outcomes are measured prospectively.

Do options flow signals work differently in bear markets?

Put flow signals show stronger hit-rates in bear markets than call flow signals, for the intuitive reason that the underlying trend reinforces bearish directional bets. But aggregate put flow hit-rates in bear markets can be misleadingly high because any put signal benefits from the downtrend regardless of whether it was informational. The signal quality metric that survives regime changes better is not raw hit-rate but the excess hit-rate above the baseline for put flow in the prevailing regime: measuring the signal against a regime-appropriate null, not a fixed 50%.

RadarPulse Scorecard: the only transparent, prospective track record for unusual options flow. Every EXTREME and ELEVATED signal scored from a live session is recorded and measured forward. No cherry-picking, no retroactive selection. See the methodology →

RadarPulse is currently in its pre-launch phase. Historical data, API access, and the live Scorecard are building with every session.

Join the waitlist →