Data Versions

Three cleaning levels. Choose the one that matches your research design.

Every ticker is available in three versions. The same underlying trade data, processed to different levels. You choose what's appropriate for your research question.

Version 1

Raw

Data as received from the source. No outlier removal, no gap-filling. The only modification is splice-boundary adjustment at the PiTrading/IEX transition (March 2022). Prices are split and dividend adjusted by the source.

What's in it

  • One-minute OHLCV bars where trades occurred
  • Gaps where no trades happened (no bars for those minutes)
  • Possible duplicates, outside-hours bars, OHLC violations
  • 1.5+B total bars

Best for

  • Market microstructure research
  • Studying data quality and missingness patterns
  • Comparing with your own cleaning pipeline
  • Auditing the cleaning steps applied in Version 2
Version 2

Clean

Nine-step cleaning pipeline applied. Outside-hours bars removed, OHLC violations removed, duplicates removed, Brownlees-Gallo outlier filter applied. Gaps are preserved — if no trade occurred, there is no bar.

What's in it

  • One-minute OHLCV bars, cleaned
  • 388,559 bars removed (0.025% of total)
  • Gaps preserved as-is (irregular time series)
  • 1.5+B total bars

Best for

  • Volatility estimation (realized variance, bipower variation)
  • Spread measurement (Roll, Corwin-Schultz)
  • Jump detection (BNS test)
  • Most empirical finance research
Recommended default. Unless your research specifically requires raw data, start here.

A note on gap-filled data

Some researchers use LOCF (Last Observation Carried Forward) gap-filling to produce a regular 390-bar daily grid. This library does not distribute a gap-filled version.

The accompanying paper shows that LOCF introduces systematic biases: it suppresses realized volatility, compresses bid-ask spreads, inflates return autocorrelation, and reduces jump detection power. These effects scale with the gap rate and are most severe for less liquid tickers.

Researchers who need a regular grid can apply LOCF to the Clean version in one line of pandas:

df = df.set_index('datetime').resample('1min').ffill()

We recommend documenting this step and its implications in any paper that uses filled data.

Side-by-side comparison

PropertyRawClean
Total bars1,533,403,1261,533,014,567
Bars removed0388,559
Gaps preserved?YesYes
Outliers removed?NoYes
Ready for most research?NoYes
Download Data