Module 2 — Python for Quants

Loading and Cleaning Market Data

Pandas, NumPy, and the data-wrangling muscle memory you'll use every day.

Learning objectives

  • Parse a CSV with proper date handling.
  • Detect and handle missing bars and bad ticks.
  • Split data into train, validation, and test windows.

CODE

The boring-but-critical loader

df = pd.read_csv(
    'AAPL.csv',
    parse_dates=['date'],
    index_col='date',
).sort_index()

# Forward-fill stale ticks but never bridge corporate-action gaps.
df = df.asfreq('B').ffill(limit=2)

# Detect outliers (bad ticks).
ret = df['close'].pct_change()
bad = ret.abs() > 0.4
print(df.index[bad])

TEXT

Train / validation / test

Never tune parameters on the same window you report performance on. A standard split for daily data: • 2010–2018 → train (fit models) • 2019–2020 → validation (pick hyperparameters) • 2021–today → out-of-sample test (single evaluation) Using test data for parameter selection is the single most common source of fake backtests.