MGMT 571 — Machine Learning Competition

Corporate Bankruptcy
Prediction

A binary classification model predicting whether a company will go bankrupt, using XGBoost with custom preprocessing on 64 financial attributes. Placed 3rd out of 53 teams in a graduate-level ML competition.

0.907Mean CV AUC
3rdOut of 53 Teams
64Financial Features
18,000Total Records

Project Overview

Predict corporate bankruptcy from financial statement data

!

The Challenge

Given 64 anonymized financial attributes (Attr1–Attr64) for 10,000 companies, predict which will go bankrupt (class = 1) vs. survive (class = 0). An additional 8,000 unlabeled companies form the test set.

$

Why It Matters

Bankruptcy prediction is critical for credit risk assessment, lending decisions, and investment analysis. Accurate early-warning models save financial institutions billions in potential losses.

The Approach

A three-stage pipeline: Winsorization to handle extreme outliers, KNN Imputation to fill missing values, and a tuned XGBoost classifier evaluated with 5-fold stratified CV.

End-to-End Pipeline

1Raw Data
10K train · 8K test
2Winsorization
0.5th–99.5th pctl
3KNN Imputation
k=5, uniform
4XGBoost
1,000 trees
5Submission
P(bankrupt)

Data Overview

Anonymized financial attributes from corporate balance sheets

10,000Training Samples
8,000Test Samples
64Features (Attr1–64)
BinaryTarget (0 or 1)

Class Distribution (Training Set)

Missing Values per Feature

Preprocessing Pipeline

Two critical steps before modeling: outlier control and imputation

Step 1: Winsorization

Financial data often contains extreme outliers — a single company reporting wildly different ratios can distort the entire model. Winsorization clips values to the 0.5th and 99.5th percentiles, preserving signal while removing noise at the tails.

Winsorizer (Custom sklearn Transformer)
class Winsorizer(BaseEstimator, TransformerMixin):
    def __init__(self, lower=0.5, upper=99.5):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        X = np.asarray(X, dtype=float)
        self.lower_bounds_ = np.nanpercentile(X, self.lower, axis=0)
        self.upper_bounds_ = np.nanpercentile(X, self.upper, axis=0)
        return self

    def transform(self, X):
        return np.clip(X, self.lower_bounds_, self.upper_bounds_)

Why not just drop outliers? Dropping rows loses data; simple clipping at fixed values is arbitrary. Percentile-based winsorization adapts to each feature’s distribution while keeping the row count intact.

Step 2: KNN Imputation

Many financial attributes have missing values — companies may not report every ratio. KNN Imputation (k=5) fills gaps by averaging values from the 5 most similar companies, preserving inter-feature relationships better than mean/median imputation.

KNN Imputation Pipeline
# Fit on winsorized training data only (no data leakage)
knn_imputer = KNNImputer(n_neighbors=5, weights="uniform")
knn_imputer.fit(X_full_win)

# Transform both train and test
X_full_imp = knn_imputer.transform(X_full_win)
X_test_imp = knn_imputer.transform(X_test_win)

# Result: zero NaNs remaining
print(np.isnan(X_full_imp).any())  # False
print(np.isnan(X_test_imp).any())  # False

Why KNN over mean/median? Financial ratios are correlated. A company with high debt-to-equity likely has specific liquidity patterns. KNN imputation captures these relationships by borrowing from similar companies.

XGBoost Model

Gradient-boosted trees with careful hyperparameter tuning

n_estimators1,000Number of boosting rounds
max_depth6Tree depth limit
learning_rate0.02Slow learning for generalization
subsample0.9Row sampling per tree
colsample_bytree0.8Feature sampling per tree
reg_lambda (L2)1.0Ridge regularization
reg_alpha (L1)0.1Lasso regularization
tree_methodexactExact split finding
XGBoost Configuration
xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="auc",
    n_estimators=1000,
    max_depth=6,
    learning_rate=0.02,
    subsample=0.9,
    colsample_bytree=0.8,
    reg_lambda=1.0,       # L2 regularization
    reg_alpha=0.1,        # L1 regularization
    tree_method="exact",
    n_jobs=-1
)

# 5-fold stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb, X, y, cv=skf, scoring="roc_auc")
print(f"Mean CV AUC: {cv_scores.mean():.4f}")  # 0.9073

Design Decisions

Low Learning Rate + High Trees

A learning rate of 0.02 with 1,000 estimators allows the model to learn gradually, reducing overfitting risk compared to fewer trees with higher learning rates.

Dual Regularization

L1 (alpha=0.1) encourages sparsity in feature selection while L2 (lambda=1.0) prevents any single feature from dominating — critical with 64 features.

Row & Column Sampling

Subsample=0.9 and colsample=0.8 introduce stochastic variance that makes each tree slightly different, improving ensemble diversity and generalization.

Exact Tree Method

Using exact split finding (vs. histogram approximation) on this 10K-row dataset provides precise splits without significant speed penalty.

Results

5-fold stratified cross-validation performance

0.907Mean CV AUC
0.936Best Fold AUC
0.875Worst Fold AUC
3rd / 53Competition Rank

AUC by Fold

AUC Distribution

Fold-by-Fold Breakdown

FoldAUC Scorevs. MeanAssessment
Fold 10.9179+0.011Above average
Fold 20.9359+0.029Best fold
Fold 30.8963−0.011Slightly below
Fold 40.9116+0.004Near average
Fold 50.8746−0.033Hardest split
Mean0.9073

Key Takeaways

  • Consistent performance: AUC ranges from 0.875 to 0.936 across folds, showing the model generalizes well across different data splits.
  • Winsorization was critical: Without outlier capping, extreme financial ratios caused individual trees to overfit to anomalous companies.
  • KNN > mean imputation: KNN imputation improved AUC by ~0.01 compared to simple median fill, since financial ratios are correlated within company profiles.
  • 3rd place finish: The combination of careful preprocessing and well-tuned XGBoost proved more effective than complex multi-model ensembles used by many teams.