Data Versions

Three cleaning levels. Choose the one that matches your research design.

Every ticker is available in three versions. The same underlying trade data, processed to different levels. You choose what's appropriate for your research question.

Version 1

Raw

Data as received from the source. No outlier removal, no gap-filling. The only modification is splice-boundary adjustment at the PiTrading/IEX transition (March 2022). Prices are split and dividend adjusted by the source.

What's in it

  • One-minute OHLCV bars where trades occurred
  • Gaps where no trades happened (no bars for those minutes)
  • Possible duplicates, outside-hours bars, OHLC violations
  • 1,533,403,126 total bars

Best for

  • Market microstructure research
  • Studying data quality and missingness patterns
  • Comparing with your own cleaning pipeline
  • Auditing the cleaning steps applied in Version 2
Version 2

Clean

Nine-step cleaning pipeline applied. Outside-hours bars removed, OHLC violations removed, duplicates removed, Brownlees-Gallo outlier filter applied. Gaps are preserved — if no trade occurred, there is no bar.

What's in it

  • One-minute OHLCV bars, cleaned
  • 388,559 bars removed (0.025% of total)
  • Gaps preserved as-is (irregular time series)
  • 1,533,014,567 total bars

Best for

  • Volatility estimation (realized variance, bipower variation)
  • Spread measurement (Roll, Corwin-Schultz)
  • Jump detection (BNS test)
  • Most empirical finance research
Recommended default. Unless your research specifically requires filled data or raw data, start here.
Version 3

Filled

Clean data with LOCF (Last Observation Carried Forward) gap-filling. Every trading day has exactly 390 bars (09:30–15:59 ET). Missing bars are filled with the last observed close price; volume is set to zero. Every filled bar is flagged.

What's in it

  • Regular 390-bar daily grid
  • Filled bars flagged with is_filled = True
  • Volume = 0 for filled bars
  • 2,342,519,726 total bars

Best for

  • Machine learning models requiring regular grids
  • Backtesting systems
  • Time-series models (GARCH, HAR, etc.)
  • Any analysis requiring equally-spaced observations
Known biases. LOCF introduces stale prices that suppress realized volatility, compress spreads, inflate autocorrelation, and reduce jump detection power. These effects are systematic and scale with gap rate. The accompanying paper quantifies these biases across all 1,391 tickers. Use this version with awareness of these effects.

Side-by-side comparison

PropertyRawCleanFilled
Total bars1,533,403,1261,533,014,5672,342,519,726
Bars removed0388,559388,559
Bars filled00809,505,159
Regular grid?NoNoYes (390/day)
Gaps preserved?YesYesNo (filled)
Outliers removed?NoYesYes
Filled flag?N/AN/AYes
Download Data