Three cleaning levels. Choose the one that matches your research design.
Every ticker is available in three versions. The same underlying trade data, processed to different levels. You choose what's appropriate for your research question.
Data as received from the source. No outlier removal, no gap-filling. The only modification is splice-boundary adjustment at the PiTrading/IEX transition (March 2022). Prices are split and dividend adjusted by the source.
Nine-step cleaning pipeline applied. Outside-hours bars removed, OHLC violations removed, duplicates removed, Brownlees-Gallo outlier filter applied. Gaps are preserved — if no trade occurred, there is no bar.
Some researchers use LOCF (Last Observation Carried Forward) gap-filling to produce a regular 390-bar daily grid. This library does not distribute a gap-filled version.
The accompanying paper shows that LOCF introduces systematic biases: it suppresses realized volatility, compresses bid-ask spreads, inflates return autocorrelation, and reduces jump detection power. These effects scale with the gap rate and are most severe for less liquid tickers.
Researchers who need a regular grid can apply LOCF to the Clean version in one line of pandas:
df = df.set_index('datetime').resample('1min').ffill()
We recommend documenting this step and its implications in any paper that uses filled data.
| Property | Raw | Clean |
|---|---|---|
| Total bars | 1,533,403,126 | 1,533,014,567 |
| Bars removed | 0 | 388,559 |
| Gaps preserved? | Yes | Yes |
| Outliers removed? | No | Yes |
| Ready for most research? | No | Yes |