SkillHub

time-series-analysis

v1.0.0

Comprehensive time series data science skill covering feature engineering, model training, and competition-winning strategies for forecasting and prediction problems.

Sourced from ClawHub, Authored by dubnium0

Installation

Please help me install the skill `time-series-analysis` from SkillHub official store. npx skills add dubnium0/time-series-analysis

Time Series Data Science - Complete Guide

Expert time series data scientist specializing in forecasting, sequential prediction, and competition-winning strategies. This skill covers the complete pipeline from EDA to production-ready models.

Core Principles

Key Lessons from Winning Solutions

  1. Feature Engineering > Model Complexity
  2. Focus on 5-10 most predictive features, not all available
  3. Lag, rolling, and EWM features are often more valuable than the raw data
  4. Interaction features between top predictors can be game-changers

  5. Time-Based Validation is Critical

  6. NEVER use random splits for time series
  7. Train on past, validate on future (e.g., ts_index <= threshold)
  8. Leakage from future data will destroy real-world performance

  9. Weights Matter in Scoring

  10. If weights are provided, use them directly in training
  11. High-weight samples disproportionately affect score
  12. Sample weighting in model.fit() is better than custom loss

  13. Multi-Seed Ensemble for Robustness

  14. Train same model with different random seeds
  15. Average predictions reduces variance
  16. Common seeds: 42, 2024, or any fixed set

Feature Engineering Toolkit

1. Lag Features

GROUP_COLS = ['entity_id', 'category', 'horizon']

for lag in [1, 3, 5, 10]:
    df[f'{col}_lag{lag}'] = df.groupby(GROUP_COLS)[col].shift(lag)

2. Rolling Statistics

for window in [5, 10, 20]:
    df[f'{col}_roll_mean{window}'] = df.groupby(GROUP_COLS)[col].transform(
        lambda x: x.rolling(window, min_periods=1).mean()
    )
    df[f'{col}_roll_std{window}'] = df.groupby(GROUP_COLS)[col].transform(
        lambda x: x.rolling(window, min_periods=1).std()
    )

3. Exponential Weighted Mean (EWM)

for span in [5, 10]:
    df[f'{col}_ewm{span}'] = df.groupby(GROUP_COLS)[col].transform(
        lambda x: x.ewm(span=span, adjust=False).mean()
    )

4. Difference Features

df[f'{col}_diff1'] = df.groupby(GROUP_COLS)[col].diff(1)
df[f'{col}_diff_pct'] = df.groupby(GROUP_COLS)[col].pct_change(1)

5. Interaction Features

# Difference between related features
df['feat_diff'] = df['feature_a'] - df['feature_b']

# Ratio between features
df['feat_ratio'] = df['feature_a'] / (df['feature_b'] + 1e-7)

# Product interactions
df['feat_product'] = df['feature_a'] * df['feature_b']

6. Target Encoding (for categories)

# Compute on training data only (ts_index <= threshold)
train_only = df[df.ts_index <= VAL_THRESHOLD]

enc_stats = {
    'category': train_only.groupby('category')['target'].mean().to_dict(),
    'global_mean': train_only['target'].mean()
}

# Apply to all data
df['category_enc'] = df['category'].map(enc_stats['category']).fillna(enc_stats['global_mean'])

7. Temporal Signals

# Cyclical encoding for periodicity
df['t_cycle'] = np.sin(2 * np.pi * df['ts_index'] / period)
df['t_cycle_cos'] = np.cos(2 * np.pi * df['ts_index'] / period)

# Normalized time position
df['ts_normalized'] = df['ts_index'] / df['ts_index'].max()

# Time bins
df['ts_bin'] = pd.cut(df['ts_index'], bins=10, labels=False)

Model Training Patterns

LightGBM Configuration (Competition-Tested)

lgb_cfg = {
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.015,
    'n_estimators': 4000,
    'num_leaves': 80,
    'min_child_samples': 200,
    'feature_fraction': 0.6,
    'bagging_fraction': 0.7,
    'bagging_freq': 5,
    'lambda_l1': 0.1,
    'lambda_l2': 10.0,
    'verbosity': -1
}

Multi-Seed Ensemble Training

val_pred = np.zeros(len(y_val))
test_pred = np.zeros(len(X_test))

for seed in [42, 2024]:
    model = lgb.LGBMRegressor(**lgb_cfg, random_state=seed)

    model.fit(
        X_train, y_train,
        sample_weight=w_train,  # Use weights directly
        eval_set=[(X_val, y_val)],
        eval_sample_weight=[w_val],
        callbacks=[lgb.early_stopping(200, verbose=False)]
    )

    val_pred += model.predict(X_val) / 2
    test_pred += model.predict(X_test) / 2

Horizon-Specific Models

# Train separate model per forecast horizon
for horizon in [1, 3, 10, 25]:
    train_h = df[df.horizon == horizon]
    test_h = test_df[test_df.horizon == horizon]

    # Build features, train model
    model = train_model(train_h, test_h)
    predictions[horizon] = model.predict(test_h)

Validation Strategies

Time-Based Split

VAL_THRESHOLD = int(df['ts_index'].max() * 0.85)

train_mask = df['ts_index'] <= VAL_THRESHOLD
val_mask = df['ts_index'] > VAL_THRESHOLD

X_train = df.loc[train_mask, feature_cols]
X_val = df.loc[val_mask, feature_cols]

Expanding Window Cross-Validation

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(df):
    # Train on expanding window
    pass

Custom Metrics

def weighted_rmse_score(y_true, y_pred, weights):
    """Weighted RMSE skill score (higher is better)"""
    denom = np.sum(weights * y_true**2)
    if denom <= 0:
        return 0.0
    numer = np.sum(weights * (y_true - y_pred)**2)
    ratio = numer / denom
    return float(np.sqrt(1.0 - np.clip(ratio, 0.0, 1.0)))

EDA Checklist

  1. Target Analysis
  2. Distribution by time period
  3. Distribution by category/horizon
  4. Trend and seasonality detection

  5. Missing Values

  6. Pattern analysis (random vs systematic)
  7. Group-based imputation strategy

  8. Weight Distribution

  9. Concentration analysis
  10. Impact on scoring metric

  11. Feature Correlations

  12. Correlation with target
  13. Multicollinearity between features

  14. Temporal Patterns

  15. Stationarity tests
  16. Rolling statistics visualization

Common Pitfalls to Avoid

Pitfall Solution
Random train/test split Use time-based split
Using future data for encoding Compute stats on train only
Ignoring sample weights Use sample_weight in fit()
Too many features Focus on top 5-10 predictors
Single model Multi-seed ensemble
Overfitting validation Large early stopping patience

Competition Workflow

graph TD
    A[Load Data] --> B[Compute Encoding Stats on Train]
    B --> C[Build Features]
    C --> D[Time-Based Split]
    D --> E{For Each Horizon}
    E --> F[Train Multi-Seed Ensemble]
    F --> G[Validate & Score]
    G --> H[Generate Predictions]
    H --> I[Aggregate & Submit]

Quick Reference Commands

# Run complete pipeline
python train_winning.py

# Generate submission
python generate_submission.py

# Validate submission format
python -c "
import pandas as pd
sub = pd.read_csv('submission.csv')
print(f'Rows: {len(sub)}, Cols: {list(sub.columns)}')
print(sub.head())
"

Integration with Other Workflows

  • Use with /data-analyst for comprehensive EDA
  • Use with /data-scientist for advanced feature engineering
  • Use with /fintech-engineer for financial risk analysis
  • Combine predictions with /quant-analyst for portfolio strategies