A binary classification model predicting whether a company will go bankrupt, using XGBoost with custom preprocessing on 64 financial attributes. Placed 3rd out of 53 teams in a graduate-level ML competition.
Predict corporate bankruptcy from financial statement data
Given 64 anonymized financial attributes (Attr1–Attr64) for 10,000 companies, predict which will go bankrupt (class = 1) vs. survive (class = 0). An additional 8,000 unlabeled companies form the test set.
Bankruptcy prediction is critical for credit risk assessment, lending decisions, and investment analysis. Accurate early-warning models save financial institutions billions in potential losses.
A three-stage pipeline: Winsorization to handle extreme outliers, KNN Imputation to fill missing values, and a tuned XGBoost classifier evaluated with 5-fold stratified CV.
Anonymized financial attributes from corporate balance sheets
Two critical steps before modeling: outlier control and imputation
Financial data often contains extreme outliers — a single company reporting wildly different ratios can distort the entire model. Winsorization clips values to the 0.5th and 99.5th percentiles, preserving signal while removing noise at the tails.
class Winsorizer(BaseEstimator, TransformerMixin):
def __init__(self, lower=0.5, upper=99.5):
self.lower = lower
self.upper = upper
def fit(self, X, y=None):
X = np.asarray(X, dtype=float)
self.lower_bounds_ = np.nanpercentile(X, self.lower, axis=0)
self.upper_bounds_ = np.nanpercentile(X, self.upper, axis=0)
return self
def transform(self, X):
return np.clip(X, self.lower_bounds_, self.upper_bounds_)
Why not just drop outliers? Dropping rows loses data; simple clipping at fixed values is arbitrary. Percentile-based winsorization adapts to each feature’s distribution while keeping the row count intact.
Many financial attributes have missing values — companies may not report every ratio. KNN Imputation (k=5) fills gaps by averaging values from the 5 most similar companies, preserving inter-feature relationships better than mean/median imputation.
# Fit on winsorized training data only (no data leakage)
knn_imputer = KNNImputer(n_neighbors=5, weights="uniform")
knn_imputer.fit(X_full_win)
# Transform both train and test
X_full_imp = knn_imputer.transform(X_full_win)
X_test_imp = knn_imputer.transform(X_test_win)
# Result: zero NaNs remaining
print(np.isnan(X_full_imp).any()) # False
print(np.isnan(X_test_imp).any()) # False
Why KNN over mean/median? Financial ratios are correlated. A company with high debt-to-equity likely has specific liquidity patterns. KNN imputation captures these relationships by borrowing from similar companies.
Gradient-boosted trees with careful hyperparameter tuning
xgb = XGBClassifier(
objective="binary:logistic",
eval_metric="auc",
n_estimators=1000,
max_depth=6,
learning_rate=0.02,
subsample=0.9,
colsample_bytree=0.8,
reg_lambda=1.0, # L2 regularization
reg_alpha=0.1, # L1 regularization
tree_method="exact",
n_jobs=-1
)
# 5-fold stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb, X, y, cv=skf, scoring="roc_auc")
print(f"Mean CV AUC: {cv_scores.mean():.4f}") # 0.9073
A learning rate of 0.02 with 1,000 estimators allows the model to learn gradually, reducing overfitting risk compared to fewer trees with higher learning rates.
L1 (alpha=0.1) encourages sparsity in feature selection while L2 (lambda=1.0) prevents any single feature from dominating — critical with 64 features.
Subsample=0.9 and colsample=0.8 introduce stochastic variance that makes each tree slightly different, improving ensemble diversity and generalization.
Using exact split finding (vs. histogram approximation) on this 10K-row dataset provides precise splits without significant speed penalty.
5-fold stratified cross-validation performance
| Fold | AUC Score | vs. Mean | Assessment |
|---|---|---|---|
| Fold 1 | 0.9179 | +0.011 | Above average |
| Fold 2 | 0.9359 | +0.029 | Best fold |
| Fold 3 | 0.8963 | −0.011 | Slightly below |
| Fold 4 | 0.9116 | +0.004 | Near average |
| Fold 5 | 0.8746 | −0.033 | Hardest split |
| Mean | 0.9073 | — | — |