HF Data Library

Free, research-grade, high-frequency U.S. equity data for academics and researchers. Documented, version-controlled, and updated weekly.

1,391 Tickers
1.53B 1-Minute Bars
23+ Years of Data
27 Academic Variables

What is this?

One-minute OHLCV bars for 1,391 U.S. equities and ETFs, from December 2002 through the present. Sourced from the consolidated tape (CTA/UTP) via PiTrading and IEX Exchange HIST.

Three cleaning versions so you can choose the level of processing appropriate for your research. Twenty-seven pre-computed academic variables per ticker per day. Full methodology documentation.

Updated every week. No subscription. No paywall. Licensed under CC BY 4.0.

# Python — load any ticker in seconds
import pandas as pd

df = pd.read_parquet("AAPL_clean.parquet")
print(df.head())

# datetime              Open    High    Low     Close   Volume
# 2002-12-30 09:30:00   0.98    0.99    0.98    0.98    842900
# 2002-12-30 09:31:00   0.98    0.99    0.98    0.99    521400
# ...

Three cleaning versions. You choose.

Raw

Version 1: Raw

Data as received from the source. No outlier removal, no gap-filling. Prices are split/dividend adjusted. 1,533,403,126 bars.

Best for: Market microstructure research, missingness analysis, studying the data itself.

Clean

Version 2: Clean

Nine-step cleaning pipeline applied: outside-hours removal, non-positive prices, OHLC violations, duplicate bars, Brownlees-Gallo outlier filter. Gaps preserved. 1,533,014,567 bars.

Best for: Volatility estimation, spread measurement, jump detection — most empirical finance.

Filled

Version 3: Filled

Clean data with LOCF gap-filling to produce a regular 390-bar daily grid (09:30–15:59 ET). Every bar flagged as original or filled. 2,342,519,726 bars.

Best for: Machine learning, backtesting systems, time-series models requiring regular grids.

27 pre-computed academic variables

Computed daily for each ticker in each cleaning version. Ready to use in your research.

σ

Volatility

Realized variance (1-min, 5-min), bipower variation, Parkinson range, Yang-Zhang OHLC

Spreads

Roll (1984) implied spread, Corwin-Schultz (2012) high-low spread

Autocorrelation

First-order return AC(1), variance ratio VR(5), VR(10)

Jump Detection

BNS z-statistic, jump indicators at 1% and 5% significance

$

Liquidity

Amihud illiquidity, daily dollar volume, share volume, observed trade count

Data Quality

Gap rate, observed/filled bar counts, longest gap, bars since last trade

Multiple ways to access the data

Browser Download

Download individual tickers or pre-packaged bundles (S&P 500, Nasdaq 100, by sector). Click and go — no account needed for basic downloads.

Browse Downloads
{ }

REST API

Programmatic access to any ticker, date range, and version. JSON, CSV, or parquet. Free API key with 300 requests/minute. Python, R, and Stata examples provided.

API Docs
📦

Bulk Download

Full dataset dump — all 1,391 tickers, all versions, all timeframes. Parquet format. Updated weekly.

Full Dataset

How this compares

Feature HF Data Library CRSP/TAQ Yahoo Finance Polygon.io
Price Free $25,000+/yr Free $199+/mo
Frequency 1-minute bars Tick-level Daily only 1-minute bars
Cleaning versions 3 versions 1 version None None
Cleaning documentation Full pipeline Minimal None None
Academic variables 27 measures None None None
Data quality scores Per-ticker No No No
REST API Free No Unofficial Paid
DOI / Citable Zenodo DOI No No No
License CC BY 4.0 Restrictive ToS restricted Commercial
Updated Weekly (automated) Quarterly Daily Real-time