Research2026年6月15日2 分钟阅读

基于 LightGBM 的巨大儿预测模型：AUC 0.920 的实现路径

记录从数据清洗、W-KNN 插补到 Optuna 调参的完整实验流程，最终模型在验证集上取得 AUC 0.920、F1 0.567 的结果。

LightGBM机器学习医学AI巨大儿Optuna

研究背景

巨大儿（出生体重 ≥ 4000g）是围产期常见的高危情况，与难产、产后出血、新生儿低血糖等并发症密切相关。准确的产前预测对于临床决策具有重要意义。

本研究基于医院电子病历数据，构建了一个基于 LightGBM 的巨大儿预测模型。

数据集

指标	数值
样本总量	2,847 例
巨大儿比例	18.3%
特征数量	15 个
缺失率	平均 6.2%

核心特征

筛选后保留 15 个临床特征，包括：

孕前 BMI、孕期体重增长
孕 28 周血糖、空腹血糖
胎儿腹围（AC）、股骨长（FL）
产次、孕周、母亲身高

数据预处理

缺失值处理

采用 W-KNN（加权 K 近邻） 插补方法，相比传统均值填充保留了更多特征间的相关性。

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5, weights='distance')
X_imputed = imputer.fit_transform(X)

类别不平衡

使用 SMOTE 过采样将正负样本比调整至 1:2。

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_imputed, y)

模型开发

Optuna 超参数优化

import optuna
import lightgbm as lgb

def objective(trial):
    params = {
        'n_estimators':     trial.suggest_int('n_estimators', 100, 1000),
        'learning_rate':    trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth':        trial.suggest_int('max_depth', 3, 10),
        'num_leaves':       trial.suggest_int('num_leaves', 20, 150),
        'min_child_samples':trial.suggest_int('min_child_samples', 10, 100),
        'subsample':        trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
    }
    model = lgb.LGBMClassifier(**params, random_state=42)
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

最优参数

best_params = {
    'n_estimators':      487,
    'learning_rate':     0.0523,
    'max_depth':         6,
    'num_leaves':        63,
    'min_child_samples': 28,
    'subsample':         0.823,
    'colsample_bytree':  0.791,
}

实验结果

验证集性能

指标	数值
AUC	0.920
F1 Score	0.567
Sensitivity	0.731
Specificity	0.884
PPV	0.462
NPV	0.954

模型对比

模型	AUC	F1
LightGBM（本研究）	0.920	0.567
XGBoost	0.903	0.541
Random Forest	0.891	0.523
Logistic Regression	0.847	0.478
Decision Tree	0.812	0.501

结论

基于 LightGBM 的巨大儿预测模型在验证集上取得了 AUC 0.920 的结果，具有较好的临床应用潜力。

高 NPV（0.954）意味着该模型能有效排除非巨大儿，减少不必要的干预。

下一步将进行外部验证，并推进论文撰写。