Options Flow Historical Data: How to Access, Backtest, and Evaluate Signal Accuracy
Historical options flow data is the foundation for any serious evaluation of whether unusual activity has real predictive value or is just noise that gets retroactively explained. Here is what historical flow data contains, how to access it, how to build a backtest, and the accountability gap that most flow tools quietly ignore.
What historical options flow data contains
A live options flow feed shows prints as they happen. Historical flow data is the same feed archived by session. Each record captures the state of the print at execution time, not after-the-fact reconstruction.
A quality historical dataset includes, per print:
| Field | Why it matters for research |
|---|---|
| Timestamp | Maps the print to a market regime, session, and underlying spot price |
| Ticker & underlying spot | Required to compute forward returns |
| Contract type (CALL/PUT) | Defines the implied directional bet |
| Strike & expiry (DTE) | Separates urgent short-dated bets from long-dated hedges |
| Volume & open interest | The Vol/OI ratio, the strongest single factor in unusualness scoring |
| Premium paid | Conviction proxy: size filters out noise |
| Aggressor side (bid/ask/mid) | Ask-side indicates urgency; bid-side is more ambiguous |
| Execution type (sweep/block) | Sweeps cross multiple venues instantly and are strongly directional |
| Unusualness score | The composite signal quality metric, needed to segment outcomes by tier |
| Sector/market cap | Allows sector-level aggregation in the backtest |
What historical flow data does not contain is the underlying's forward price; that must be joined from a separate OHLCV source (Polygon, Finnhub, yfinance) based on the snapshot timestamp.
How far back providers store it
Depth of history varies widely and is rarely disclosed upfront:
- 30–90 days: most common in entry-tier plans. Enough for pattern lookups, not enough for multi-regime backtesting.
- 6–12 months: covers a full market cycle but may miss major volatility events (2020 COVID crash, 2022 rate hike cycle).
- 2–5 years: necessary for statistically meaningful sector-level analysis and regime comparison. Usually a paid premium tier.
- Full OPRA tape (all options prints): available from CBOE and Nasdaq directly, extremely large, priced for institutions. Most retail flow tools pre-filter this tape for unusual prints only.
For evaluating whether high-scoring flow signals carry real directional information, 12 months is the practical minimum. You need enough samples per sector and score tier to generate confidence intervals below ±10 percentage points.
How to backtest options flow signals
A rigorous options flow backtest follows five steps:
- Define your signal universe. What qualifies as a signal? For example: score ≥ 70, premium ≥ $100,000, aggressor side = ASK, execution = SWEEP. This becomes your filter. Apply it identically to every session in your dataset: no look-ahead, no manual selection.
- Record the snapshot. For each qualifying signal, record the underlying's closing price on the signal date (or the midpoint price closest to the signal timestamp for intraday precision). This is your entry reference price.
- Measure forward returns. At your chosen horizon (1 trading day, 3 days, 5 days), record the underlying's closing price. The forward return = (exit price − entry price) / entry price × 100%.
- Classify the outcome. A CALL signal is a directional hit if the forward return is positive; a PUT signal is a hit if the return is negative. This directional classification, not options P&L, is the cleanest signal quality metric (it removes IV, decay, and bid-ask noise from the equation).
- Aggregate by tier, type, and sector. Calculate hit-rate (% correct directional calls), average forward return, and standard deviation by score tier (EXTREME / ELEVATED / NOTABLE), option type (CALL / PUT), and sector (Technology, Biotech, Energy, etc.). The tier × sector cross-tab is where the most actionable patterns appear.
Common backtesting mistakes
- Survivorship sampling: only including signals that were followed by large moves. This is the most common form of options flow cherry-picking and produces meaninglessly inflated hit-rates.
- Intraday timing games: using the low of day as the entry price for call signals and the high of day as entry for put signals. A fair backtest uses the same objective price (closing price or signal-time midpoint) for every signal.
- Ignoring small samples: reporting a 100% hit-rate from 4 signals in biotech is worse than reporting a 60% hit-rate from 80 signals. Require a minimum sample (30+ per cell) before reporting a rate.
- Mixing market regimes: a backtest that runs across 2020 (crash + recovery), 2021 (gamma squeeze), and 2022 (rate hikes) is averaging over very different environments. Segment by regime to find where signals are strongest.
- Forgetting the denominator: reporting 10 winning examples without disclosing how many total signals were evaluated in the same period.
What hit-rates to expect (honestly)
Honest published research on options flow signal accuracy is sparse, because most providers have no incentive to publish numbers that might disappoint users. The figures that do appear in academic literature and independent analysis suggest:
- EXTREME tier (score 85+, Vol/OI 10×+, premium $250K+, ask-side sweep): 3-day directional hit-rates in liquid large-cap names of 58–65% across bull-market regimes. Weaker in high-volatility environments (VIX above 25).
- ELEVATED tier (score 70–84): 3-day hit-rates of 53–59%: meaningful outperformance over random, but with considerably more variance than EXTREME.
- NOTABLE tier (score 55–69): 3-day hit-rates of 50–54%. Individually noisy, but sector aggregates still show actionable patterns, especially in healthcare and energy where catalyst-driven flow is more concentrated.
A few important calibrations:
- These are directional hit-rates on the underlying, not options P&L. A 60% directional hit-rate on the stock does not mean 60% of options positions profit. Theta decay, IV changes, and bid-ask costs often turn correctly-directed trades into losses.
- Hit-rates degrade meaningfully when DTE is short and the signal is in biotech ahead of a binary catalyst (FDA decision, earnings); the signal may be about volatility magnitude, not direction.
- Congress + flow confluence on the same ticker, where available data exists, tends to show higher hit-rates than standalone flow; a cross-domain signal from two distinct data sources is harder to explain as coincidence.
Python workflow for signal analysis
A minimal backtest on exported options flow data using pandas and yfinance:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import timedelta
# Load your historical flow export (CSV from RadarPulse or API response)
flow = pd.read_csv("flow_history.csv", parse_dates=["timestamp"])
# 1. Filter for your signal universe
signals = flow[
(flow["score"] >= 70) &
(flow["premium"] >= 100_000) &
(flow["side"] == "ASK") &
(flow["kind"] == "SWEEP")
].copy()
# 2. For each signal, fetch the underlying's price at signal date + forward horizons
def forward_return(ticker, signal_date, days):
start = signal_date
end = signal_date + timedelta(days=days + 5) # buffer for market closures
hist = yf.Ticker(ticker).history(start=start, end=end)
if hist.empty or len(hist) < 2:
return None
entry = hist["Close"].iloc[0]
# Find the Nth trading day after signal
idx = min(days, len(hist) - 1)
exit_ = hist["Close"].iloc[idx]
return (exit_ - entry) / entry * 100
results = []
for _, row in signals.iterrows():
ret_3d = forward_return(row["ticker"], row["timestamp"].date(), 3)
if ret_3d is None:
continue
# Directional hit: CALL needs positive return, PUT needs negative
is_hit = (row["type"] == "CALL" and ret_3d > 0) or \
(row["type"] == "PUT" and ret_3d < 0)
results.append({
"ticker": row["ticker"],
"type": row["type"],
"score": row["score"],
"sector": row.get("sector", "Other"),
"ret_3d": ret_3d,
"hit": is_hit,
"flag": "EXTREME" if row["score"] >= 85 else "ELEVATED" if row["score"] >= 70 else "NOTABLE"
})
df = pd.DataFrame(results)
# 3. Aggregate by flag and sector
summary = (
df.groupby(["flag", "type"])
.agg(
count=("hit", "size"),
hit_rate=("hit", "mean"),
avg_ret=("ret_3d", "mean"),
std_ret=("ret_3d", "std")
)
.round(3)
)
print(summary[summary["count"] >= 30]) # only report cells with enough data
Key notes: use yfinance for quick backtests but Polygon or CBOE data for production research (yfinance data is not adjusted for splits in all edge cases and lacks intraday resolution). The forward_return function uses closing prices; for intraday precision, join to the minute-level OHLCV nearest the signal timestamp.
Sector-level heatmap
import seaborn as sns
import matplotlib.pyplot as plt
pivot = df.pivot_table(
values="hit", index="sector", columns="flag",
aggfunc=lambda x: x.mean() if len(x) >= 10 else None
)
sns.heatmap(pivot, annot=True, fmt=".0%", cmap="RdYlGn",
center=0.5, vmin=0.4, vmax=0.7)
plt.title("Directional hit-rate by sector × score tier (3-day)")
plt.tight_layout()
plt.show()
This heatmap often reveals that a few sector × tier combinations drive most of the edge. For example, EXTREME call flow in Technology and Healthcare outperforms EXTREME call flow in Consumer Staples, which is structurally less catalyst-driven.
Evaluating provider data quality
Not all historical flow datasets are equivalent. Evaluate a provider on these dimensions before building research on their data:
| Question to ask | Why it matters |
|---|---|
| How far back does the data go? | Less than 12 months is insufficient for multi-regime analysis |
| Is it raw OPRA tape or pre-filtered? | Pre-filtered data may exclude prints your criteria would catch; raw tape includes noise you'd need to filter yourself |
| Are Vol/OI ratios computed correctly? | Some providers use total OI across all strikes; correct is same-strike OI at trade time, a significant difference for short-dated prints |
| Is aggressor side included? | Without bid/ask classification, you can't filter for urgency, the most important execution quality signal |
| Are multi-leg (spread) prints separated from single-leg? | Multi-leg prints may look like unusual directional flow but are often synthetic positions, hedges, or risk reversals with no strong directional bias |
| Does the provider publish their own outcome data? | A provider confident in their signal quality should track outcomes. If they don't, ask why. |
The accountability gap: why most tools hide track records
The options flow tool market has a systematic accountability problem. Because past prints are abundant, it's trivially easy to find examples that look prescient: EXTREME calls on a name three days before an earnings beat, large put sweeps ahead of a sector selloff. Social media amplifies these examples because they're compelling stories.
What you almost never see is the denominator: out of all the EXTREME calls in the same period, how many actually preceded upside? That number is available from the same data. It just doesn't get posted because it's usually closer to 55–60% than 90%, and "55% directional accuracy on high-scoring signals" is harder to tweet than a screenshot of a 100x put.
The correct standard is a prospective, systematic track record:
- Every signal meeting the criteria is recorded at the time of the signal, not retroactively
- Outcome is measured at a fixed, pre-declared horizon (1d, 3d, 5d)
- Results are published once the sample is large enough to be meaningful (30+ outcomes)
- The methodology (what counts as a signal, what counts as a hit) is published alongside the numbers
RadarPulse's Smart-Money Scorecard is built on this standard. Every EXTREME and ELEVATED print scored from a live session is logged with the underlying spot price, and the forward move is measured automatically as the session data accumulates. The track record builds prospectively, without cherry-picking, and the methodology is documented. The numbers that emerge are honest ones, useful for calibrating how much weight to put on any given signal, not a marketing claim about performance.
Accessing historical data via API
Most quality options flow tools expose historical data through a dedicated endpoint alongside their live feed. The typical pattern:
# RadarPulse historical flow endpoint (Elite tier, staged for next release)
GET /api/v1/flow/historical?from=2026-06-01&to=2026-06-29&score_min=70&type=CALL&limit=500
Authorization: x-api-key YOUR_KEY_HERE
# Response: paginated list of prints from the specified date range
{
"prints": [
{
"ticker": "NVDA",
"type": "CALL",
"strike": 135,
"dte": 7,
"premium": 2450000,
"volOI": 12.4,
"side": "ASK",
"kind": "SWEEP",
"score": 91,
"flag": "EXTREME",
"spot": 131.20,
"timestamp": "2026-06-15T10:23:41Z",
"sector": "Technology"
}
// ...
],
"total": 842,
"next_cursor": "eyJ0cyI6MTc1MDAwMDAwMH0="
}
Key parameters to look for in a historical endpoint:
- Date range (
from/to): ISO 8601 dates or Unix timestamps - Score filter (
score_min): pre-filter server-side to reduce payload size - Pagination cursor: essential for large date ranges; avoid offset-based pagination (offset becomes slow on large tables)
- Underlying spot price: must be included in the historical record, not looked up later, to ensure the entry reference is the price at signal time, not the price when you fetch
For building a research pipeline, fetch historical data in batches of 30-day windows, cache to local Parquet files, and join with yfinance / Polygon for forward prices. Avoid re-fetching the same date ranges repeatedly; most historical endpoints count against rate limits even for repeated identical queries.
See the options flow API guide for authentication patterns, rate limit management, and WebSocket vs REST trade-offs in more detail.
What to do with the data once you have it
Beyond backtesting hit-rates, historical options flow data supports several other research workflows:
Pre-earnings pattern analysis
Filter historical prints to the 5 trading days before each company's earnings report and aggregate by score tier and direction. The question: do EXTREME call sweeps in the week before earnings show a higher directional hit-rate than the session average? If yes, earnings-window flow deserves higher conviction weight in your live workflow.
Congress × flow confluence scoring
Tag each historical print against the congressional disclosure data for the same ticker (available from RadarPulse's Congress tracker and the STOCK Act disclosure database). Compute hit-rates for prints where Congress was also active in the same name vs. prints without congressional overlap. This cross-domain validation is one of the most differentiated research questions available from public data.
Sector rotation timing
Aggregate historical EXTREME flow by sector per week. Build a time series of sector-level unusual activity premium. Identify weeks where a sector saw concentrated unusual flow, then measure the sector ETF's forward return at 5 and 10 trading days. This builds a signal for sector rotation timing, based not on price action (which is after-the-fact) but on real-money options positioning.
Score calibration
Run the backtest across multiple score thresholds (65, 70, 75, 80, 85, 90) and plot hit-rate vs. threshold. The inflection point, where hit-rates start improving meaningfully, is the empirically supported threshold for your specific universe and time period. This calibration is more reliable than using a threshold that was set arbitrarily at product launch.
Frequently asked questions
Is historical options flow data the same as options chain history?
No. Options chain history (historical OHLCV per contract, open interest per strike per day) is widely available from CBOE, Nasdaq, and data vendors. Options flow historical data is a subset: it captures only the unusual prints (sweeps, large blocks, high Vol/OI trades), along with the execution context (aggressor side, sweep vs block) that options chain history doesn't include. You cannot derive flow data from options chain history because the chain history shows end-of-day snapshots, not intraday execution details.
Can I access CBOE options flow data directly?
CBOE distributes options market data through its DataShop product. The raw OPRA tape (all options prints) is available for institutional subscribers, typically via FTP or SFTP in large daily flat files. It includes every trade but without the scoring, filtering, or execution-side tagging that flow tools provide. Building a flow tool from raw OPRA tape requires significant engineering: parsing 1–5GB daily files, computing Vol/OI ratios at the time of each trade (not end-of-day), identifying sweeps across multiple exchanges, and computing unusualness scores.
How many samples do I need for a meaningful backtest?
Per cell (each tier × type × sector combination), 30 samples is the minimum for a confidence interval narrow enough to be useful. For a standard backtest with EXTREME / ELEVATED tiers, CALL / PUT types, and 10 sectors, you need roughly 600 samples in the most granular cuts, achievable with 6–12 months of data from a tool with a reasonable premium floor filter.
What's the difference between backtesting and forward testing?
A backtest runs on historical data: it tells you how a strategy would have performed if you'd followed it in the past. Forward testing (also called paper trading or out-of-sample testing) applies the same strategy to live data and records the actual outcomes as they happen. Forward tests are more credible because they can't be unconsciously biased by the analyst seeing the outcomes before building the rules. RadarPulse's Scorecard is a forward test: signals are locked in at execution time and outcomes are measured prospectively.
Do options flow signals work differently in bear markets?
Put flow signals show stronger hit-rates in bear markets than call flow signals, for the intuitive reason that the underlying trend reinforces bearish directional bets. But aggregate put flow hit-rates in bear markets can be misleadingly high because any put signal benefits from the downtrend regardless of whether it was informational. The signal quality metric that survives regime changes better is not raw hit-rate but the excess hit-rate above the baseline for put flow in the prevailing regime: measuring the signal against a regime-appropriate null, not a fixed 50%.
RadarPulse Scorecard: the only transparent, prospective track record for unusual options flow. Every EXTREME and ELEVATED signal scored from a live session is recorded and measured forward. No cherry-picking, no retroactive selection. See the methodology →
RadarPulse is currently in its pre-launch phase. Historical data, API access, and the live Scorecard are building with every session.
Join the waitlist →