Bankruptcy Prediction

Project Overview

Predict corporate bankruptcy from financial statement data

The Challenge

Given 64 anonymized financial attributes (Attr1–Attr64) for 10,000 companies, predict which will go bankrupt (class = 1) vs. survive (class = 0). An additional 8,000 unlabeled companies form the test set.

Why It Matters

Bankruptcy prediction is critical for credit risk assessment, lending decisions, and investment analysis. Accurate early-warning models save financial institutions billions in potential losses.

✓

The Approach

A three-stage pipeline: Winsorization to handle extreme outliers, KNN Imputation to fill missing values, and a tuned XGBoost classifier evaluated with 5-fold stratified CV.

End-to-End Pipeline

1Raw Data
10K train · 8K test

→

2Winsorization
0.5th–99.5th pctl

→

3KNN Imputation
k=5, uniform

→

4XGBoost
1,000 trees

→

5Submission
P(bankrupt)

Preprocessing Pipeline

Two critical steps before modeling: outlier control and imputation

Step 1: Winsorization

Financial data often contains extreme outliers — a single company reporting wildly different ratios can distort the entire model. Winsorization clips values to the 0.5th and 99.5th percentiles, preserving signal while removing noise at the tails.

Winsorizer (Custom sklearn Transformer)

class Winsorizer(BaseEstimator, TransformerMixin):
    def __init__(self, lower=0.5, upper=99.5):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        X = np.asarray(X, dtype=float)
        self.lower_bounds_ = np.nanpercentile(X, self.lower, axis=0)
        self.upper_bounds_ = np.nanpercentile(X, self.upper, axis=0)
        return self

    def transform(self, X):
        return np.clip(X, self.lower_bounds_, self.upper_bounds_)

Why not just drop outliers? Dropping rows loses data; simple clipping at fixed values is arbitrary. Percentile-based winsorization adapts to each feature’s distribution while keeping the row count intact.

Step 2: KNN Imputation

Many financial attributes have missing values — companies may not report every ratio. KNN Imputation (k=5) fills gaps by averaging values from the 5 most similar companies, preserving inter-feature relationships better than mean/median imputation.

KNN Imputation Pipeline

# Fit on winsorized training data only (no data leakage)
knn_imputer = KNNImputer(n_neighbors=5, weights="uniform")
knn_imputer.fit(X_full_win)

# Transform both train and test
X_full_imp = knn_imputer.transform(X_full_win)
X_test_imp = knn_imputer.transform(X_test_win)

# Result: zero NaNs remaining
print(np.isnan(X_full_imp).any())  # False
print(np.isnan(X_test_imp).any())  # False

Why KNN over mean/median? Financial ratios are correlated. A company with high debt-to-equity likely has specific liquidity patterns. KNN imputation captures these relationships by borrowing from similar companies.

XGBoost Model

Gradient-boosted trees with careful hyperparameter tuning

n_estimators1,000Number of boosting rounds

max_depth6Tree depth limit

learning_rate0.02Slow learning for generalization

subsample0.9Row sampling per tree

colsample_bytree0.8Feature sampling per tree

reg_lambda (L2)1.0Ridge regularization

reg_alpha (L1)0.1Lasso regularization

tree_methodexactExact split finding

XGBoost Configuration

xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="auc",
    n_estimators=1000,
    max_depth=6,
    learning_rate=0.02,
    subsample=0.9,
    colsample_bytree=0.8,
    reg_lambda=1.0,       # L2 regularization
    reg_alpha=0.1,        # L1 regularization
    tree_method="exact",
    n_jobs=-1
)

# 5-fold stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb, X, y, cv=skf, scoring="roc_auc")
print(f"Mean CV AUC: {cv_scores.mean():.4f}")  # 0.9073

Design Decisions

Low Learning Rate + High Trees

A learning rate of 0.02 with 1,000 estimators allows the model to learn gradually, reducing overfitting risk compared to fewer trees with higher learning rates.

Dual Regularization

L1 (alpha=0.1) encourages sparsity in feature selection while L2 (lambda=1.0) prevents any single feature from dominating — critical with 64 features.

Row & Column Sampling

Subsample=0.9 and colsample=0.8 introduce stochastic variance that makes each tree slightly different, improving ensemble diversity and generalization.

Exact Tree Method

Using exact split finding (vs. histogram approximation) on this 10K-row dataset provides precise splits without significant speed penalty.

Results

5-fold stratified cross-validation performance

0.907Mean CV AUC

0.936Best Fold AUC

0.875Worst Fold AUC

3rd / 53Competition Rank

AUC by Fold

AUC Distribution

Fold-by-Fold Breakdown

Fold	AUC Score	vs. Mean	Assessment
Fold 1	0.9179	+0.011	Above average
Fold 2	0.9359	+0.029	Best fold
Fold 3	0.8963	−0.011	Slightly below
Fold 4	0.9116	+0.004	Near average
Fold 5	0.8746	−0.033	Hardest split
Mean	0.9073	—	—

Key Takeaways

Consistent performance: AUC ranges from 0.875 to 0.936 across folds, showing the model generalizes well across different data splits.
Winsorization was critical: Without outlier capping, extreme financial ratios caused individual trees to overfit to anomalous companies.
KNN > mean imputation: KNN imputation improved AUC by ~0.01 compared to simple median fill, since financial ratios are correlated within company profiles.
3rd place finish: The combination of careful preprocessing and well-tuned XGBoost proved more effective than complex multi-model ensembles used by many teams.

Corporate Bankruptcy
Prediction