HF Data Library

Free, research-grade, high-frequency U.S. equity data for academics and researchers. Documented, version-controlled, and updated weekly.

1,391 Tickers
1.53B 1-Minute Bars
23+ Years of Data
27 Academic Variables

What is this?

One-minute OHLCV bars for 1,391 U.S. equities and ETFs, from December 2002 through the present.

Three cleaning versions so you can choose the level of processing appropriate for your research. Twenty-seven pre-computed academic variables per ticker per day. Full methodology documentation.

Updated every week. No subscription. No paywall. Licensed under CC BY 4.0.

# Python — load any ticker in seconds
import pandas as pd

df = pd.read_parquet("AAPL_clean.parquet")
print(df.head())

# datetime              Open    High    Low     Close   Volume
# 2002-12-30 09:30:00   0.98    0.99    0.98    0.98    842900
# 2002-12-30 09:31:00   0.98    0.99    0.98    0.99    521400
# ...

Three cleaning versions. You choose.

Raw

Version 1: Raw

Data as received from the source. No outlier removal, no gap-filling. Prices are split/dividend adjusted. 1,533,403,126 bars.

Best for: Market microstructure research, missingness analysis, studying the data itself.

Clean

Version 2: Clean

Nine-step cleaning pipeline applied: outside-hours removal, non-positive prices, OHLC violations, duplicate bars, Brownlees-Gallo outlier filter. Gaps preserved. 1,533,014,567 bars.

Best for: Volatility estimation, spread measurement, jump detection — most empirical finance.

Filled

Version 3: Filled

Clean data with LOCF gap-filling to produce a regular 390-bar daily grid (09:30–15:59 ET). Every bar flagged as original or filled. 2,342,519,726 bars.

Best for: Machine learning, backtesting systems, time-series models requiring regular grids.

27 pre-computed academic variables

Computed daily for each ticker in each cleaning version. Ready to use in your research.

σ

Volatility

Realized variance (1-min, 5-min), bipower variation, Parkinson range, Yang-Zhang OHLC

Spreads

Roll (1984) implied spread, Corwin-Schultz (2012) high-low spread

Autocorrelation

First-order return AC(1), variance ratio VR(5), VR(10)

Jump Detection

BNS z-statistic, jump indicators at 1% and 5% significance

$

Liquidity

Amihud illiquidity, daily dollar volume, share volume, observed trade count

Data Quality

Gap rate, observed/filled bar counts, longest gap, bars since last trade

Multiple ways to access the data

Browser Download

Download individual tickers or pre-packaged bundles (S&P 500, Nasdaq 100, by sector). Click and go — no account needed for basic downloads.

Browse Downloads
{ }

REST API

Programmatic access to any ticker, date range, and version. JSON, CSV, or parquet. Free API key with 300 requests/minute. Python, R, and Stata examples provided.

API Docs
📦

Bulk Download

Full dataset dump — all 1,391 tickers, all versions, all timeframes. Parquet format. Updated weekly.

Full Dataset

How this compares

Feature HF Data Library CRSP/TAQ Yahoo Finance Polygon.io
Price Free $25,000+/yr Free $199+/mo
Frequency 1-minute bars Tick-level Daily only 1-minute bars
Cleaning versions 3 versions 1 version None None
Cleaning documentation Full pipeline Minimal None None
Academic variables 27 measures None None None
Data quality scores Per-ticker No No No
REST API Free No Unofficial Paid
DOI / Citable Zenodo DOI No No No
License CC BY 4.0 Restrictive ToS restricted Commercial
Updated Weekly (automated) Quarterly Daily Real-time