Hyperparameter tuning based on death discrimination

This notebook tunes and validates an XGBoost survival model (survival:cox) for all-cause mortality prediction in a substance use disorder (SUD) treatment cohort. It uses 5 multiple imputations and ~56 predictors (demographic, clinical, socioeconomic, and treatment features).

Hyperparameter optimization is performed with Optuna using 5-fold cross-validation and dual stratification (treatment plan type + death event status), with fallback to simpler stratification when rare strata make 5-fold splitting infeasible.

The search is bi-objective (Pareto optimization), maximizing multi-horizon IPCW Uno’s C-index and minimizing Integrated Brier Score (IBS). Metrics are evaluated at 3, 6, 12, 36, and 60 months. For IBS, XGBoost risk scores are converted to absolute survival probabilities using Breslow baseline estimation.

The workflow uses stochastic multi-imputation tuning (one imputation per trial) followed by rescoring top Pareto candidates across all imputations to select a final robust configuration. Final performance is reported with bootstrap optimism correction and 95% confidence intervals.

Runs are CPU-only and use seed 2125 for reproducibility.

Author

ags

Published

March 5, 2026

Hyperparameter tuning XGBOOST (death as a reference)

0. Package loading and installation

Automatically generated by Colab.

Original file is located at https://colab.research.google.com/drive/1FMHIud9Hi0rIxnMqRfRFdzBQpKEzI796

Code

# Commented out IPython magic to ensure Python compatibility.
# For Jupyter/Colab notebooks
%reset -f
import gc
gc.collect()

import numpy as np
import pandas as pd
import time

#conda install -c conda-forge \
#    numpy \
#    scipy \
#    pandas \
#    pyarrow \
#    scikit-survival \
#    spyder \
#    lifelines

# conda install -c conda-forge fastparquet
# conda install -c conda-forge xgboost
# conda install -c conda-forge pytorch cpuonly
# conda install -c pytorch pytorch cpuonly
# conda install -c conda-forge matplotlib
# conda install -c conda-forge seaborn
# conda install spyder-notebook -c spyder-ide
# conda install notebook nbformat nbconvert
# conda install -c conda-forge xlsxwriter
# conda install -c conda-forge shap

# import subprocess, sys

# subprocess.check_call([
#     sys.executable,
#     "-m",
#     "pip",
#     "install",
#     "matplotlib"
# ])

# subprocess.check_call([
#     sys.executable,
#     "-m",
#     "pip",
#     "install",
#     "seaborn"
# ])

print("numpy:", np.__version__)


from sksurv.metrics import (
    concordance_index_ipcw,
    brier_score,
    integrated_brier_score
)
from sksurv.util import Surv

#Dput
def dput_df(df, digits=6):
    data = {
        "columns": list(df.columns),
        "data": [
            [round(x, digits) if isinstance(x, (float, np.floating)) else x
             for x in row]
            for row in df.to_numpy()
        ]
    }
    print(data)


#Glimpse function
def glimpse(df, max_width=80):
    print(f"Rows: {df.shape[0]} | Columns: {df.shape[1]}")
    for col in df.columns:
        dtype = df[col].dtype
        preview = df[col].astype(str).head(5).tolist()
        preview_str = ", ".join(preview)
        if len(preview_str) > max_width:
            preview_str = preview_str[:max_width] + "..."
        print(f"{col:<30} {str(dtype):<15} {preview_str}")
#Tabyl function
def tabyl(series):
    counts = series.value_counts(dropna=False)
    props = series.value_counts(normalize=True, dropna=False)
    return pd.DataFrame({"value": counts.index,
                         "n": counts.values,
                         "percent": props.values})
#clean_names
import re

def clean_names(df):
    """
    Mimic janitor::clean_names for pandas DataFrames.
    - Lowercase
    - Replace spaces and special chars with underscores
    - Remove non-alphanumeric/underscore
    """
    new_cols = []
    for col in df.columns:
        # lowercase
        col = col.lower()
        # replace spaces and special chars with underscore
        col = re.sub(r"[^\w]+", "_", col)
        # strip leading/trailing underscores
        col = col.strip("_")
        new_cols.append(col)
    df.columns = new_cols
    return df

numpy: 2.0.1

Load data

Code


from pathlib import Path

BASE_DIR = Path(
    r"G:\My Drive\Alvacast\SISTRAT 2023\data\20241015_out\pred1"
)

import pickle

with open(BASE_DIR / "imputations_list_jan26.pkl", "rb") as f:
    imputations_list_jan26 = pickle.load(f)

imputation_1 = pd.read_parquet(
    BASE_DIR / "imputation_1.parquet",
    engine="fastparquet"
)

Code


import pandas as pd

for i in range(1, 6):
    globals()[f"imputation_nodum_{i}"] = pd.read_parquet(
        BASE_DIR / f"imputation_nondum_{i}.parquet",
        engine="fastparquet"
    )

Code

from IPython.display import display, HTML
import io
import sys

def fold_output(title, func):
    buffer = io.StringIO()
    sys.stdout = buffer
    func()
    sys.stdout = sys.__stdout__
    
    html = f"""
    <details>
      <summary>{title}</summary>
      <pre>{buffer.getvalue()}</pre>
    </details>
    """
    display(HTML(html))


fold_output(
    "Show imputation_nodum_1 structure",
    lambda: imputation_nodum_1.info()
)

fold_output(
    "Show imputation_1 structure",
    lambda: imputation_1.info()
)

Show imputation_nodum_1 structure


RangeIndex: 88504 entries, 0 to 88503
Data columns (total 43 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   readmit_time_from_adm_m        88504 non-null  float64
 1   death_time_from_adm_m          88504 non-null  float64
 2   adm_age_rec3                   88504 non-null  float64
 3   porc_pobr                      88504 non-null  float64
 4   dit_m                          88504 non-null  float64
 5   sex_rec                        88504 non-null  object 
 6   tenure_status_household        88504 non-null  object 
 7   cohabitation                   88504 non-null  object 
 8   sub_dep_icd10_status           88504 non-null  object 
 9   any_violence                   88504 non-null  object 
 10  prim_sub_freq_rec              88504 non-null  object 
 11  tr_outcome                     88504 non-null  object 
 12  adm_motive                     88504 non-null  object 
 13  first_sub_used                 88504 non-null  object 
 14  primary_sub_mod                88504 non-null  object 
 15  tipo_de_vivienda_rec2          88504 non-null  object 
 16  national_foreign               88504 non-null  int32  
 17  plan_type_corr                 88504 non-null  object 
 18  occupation_condition_corr24    88504 non-null  object 
 19  marital_status_rec             88504 non-null  object 
 20  urbanicity_cat                 88504 non-null  object 
 21  ed_attainment_corr             88504 non-null  object 
 22  evaluacindelprocesoteraputico  88504 non-null  object 
 23  eva_consumo                    88504 non-null  object 
 24  eva_fam                        88504 non-null  object 
 25  eva_relinterp                  88504 non-null  object 
 26  eva_ocupacion                  88504 non-null  object 
 27  eva_sm                         88504 non-null  object 
 28  eva_fisica                     88504 non-null  object 
 29  eva_transgnorma                88504 non-null  object 
 30  ethnicity                      88504 non-null  float64
 31  dg_psiq_cie_10_instudy         88504 non-null  bool   
 32  dg_psiq_cie_10_dg              88504 non-null  bool   
 33  dx_f3_mood                     88504 non-null  int32  
 34  dx_f6_personality              88504 non-null  int32  
 35  dx_f_any_severe_mental         88504 non-null  bool   
 36  any_phys_dx                    88504 non-null  bool   
 37  polysubstance_strict           88504 non-null  int32  
 38  readmit_event                  88504 non-null  float64
 39  death_event                    88504 non-null  int32  
 40  readmit_time_from_disch_m      88504 non-null  float64
 41  death_time_from_disch_m        88504 non-null  float64
 42  center_id                      88475 non-null  object 
dtypes: bool(4), float64(9), int32(5), object(25)
memory usage: 25.0+ MB

Show imputation_1 structure


RangeIndex: 88504 entries, 0 to 88503
Data columns (total 78 columns):
 #   Column                                                              Non-Null Count  Dtype  
---  ------                                                              --------------  -----  
 0   readmit_time_from_adm_m                                             88504 non-null  float64
 1   death_time_from_adm_m                                               88504 non-null  float64
 2   adm_age_rec3                                                        88504 non-null  float64
 3   porc_pobr                                                           88504 non-null  float64
 4   dit_m                                                               88504 non-null  float64
 5   national_foreign                                                    88504 non-null  int32  
 6   ethnicity                                                           88504 non-null  float64
 7   dg_psiq_cie_10_instudy                                              88504 non-null  bool   
 8   dg_psiq_cie_10_dg                                                   88504 non-null  bool   
 9   dx_f3_mood                                                          88504 non-null  int32  
 10  dx_f6_personality                                                   88504 non-null  int32  
 11  dx_f_any_severe_mental                                              88504 non-null  bool   
 12  any_phys_dx                                                         88504 non-null  bool   
 13  polysubstance_strict                                                88504 non-null  int32  
 14  readmit_time_from_disch_m                                           88504 non-null  float64
 15  readmit_event                                                       88504 non-null  float64
 16  death_time_from_disch_m                                             88504 non-null  float64
 17  death_event                                                         88504 non-null  int32  
 18  sex_rec_woman                                                       88504 non-null  float64
 19  tenure_status_household_illegal_settlement                          88504 non-null  float64
 20  tenure_status_household_owner_transferred_dwellings_pays_dividends  88504 non-null  float64
 21  tenure_status_household_renting                                     88504 non-null  float64
 22  tenure_status_household_stays_temporarily_with_a_relative           88504 non-null  float64
 23  cohabitation_alone                                                  88504 non-null  float64
 24  cohabitation_with_couple_children                                   88504 non-null  float64
 25  cohabitation_family_of_origin                                       88504 non-null  float64
 26  sub_dep_icd10_status_drug_dependence                                88504 non-null  float64
 27  any_violence_1_domestic_violence_sex_abuse                          88504 non-null  float64
 28  prim_sub_freq_rec_2_2_6_days_wk                                     88504 non-null  float64
 29  prim_sub_freq_rec_3_daily                                           88504 non-null  float64
 30  tr_outcome_adm_discharge_adm_reasons                                88504 non-null  float64
 31  tr_outcome_adm_discharge_rule_violation_undet                       88504 non-null  float64
 32  tr_outcome_completion                                               88504 non-null  float64
 33  tr_outcome_dropout                                                  88504 non-null  float64
 34  tr_outcome_referral                                                 88504 non-null  float64
 35  adm_motive_another_sud_facility_fonodrogas_senda_previene           88504 non-null  float64
 36  adm_motive_justice_sector                                           88504 non-null  float64
 37  adm_motive_sanitary_sector                                          88504 non-null  float64
 38  adm_motive_spontaneous_consultation                                 88504 non-null  float64
 39  first_sub_used_alcohol                                              88504 non-null  float64
 40  first_sub_used_cocaine_paste                                        88504 non-null  float64
 41  first_sub_used_cocaine_powder                                       88504 non-null  float64
 42  first_sub_used_marijuana                                            88504 non-null  float64
 43  first_sub_used_opioids                                              88504 non-null  float64
 44  first_sub_used_tranquilizers_hypnotics                              88504 non-null  float64
 45  primary_sub_mod_cocaine_paste                                       88504 non-null  float64
 46  primary_sub_mod_cocaine_powder                                      88504 non-null  float64
 47  primary_sub_mod_alcohol                                             88504 non-null  float64
 48  primary_sub_mod_marijuana                                           88504 non-null  float64
 49  tipo_de_vivienda_rec2_other_unknown                                 88504 non-null  float64
 50  plan_type_corr_m_pai                                                88504 non-null  float64
 51  plan_type_corr_m_pr                                                 88504 non-null  float64
 52  plan_type_corr_pg_pai                                               88504 non-null  float64
 53  plan_type_corr_pg_pr                                                88504 non-null  float64
 54  occupation_condition_corr24_inactive                                88504 non-null  float64
 55  occupation_condition_corr24_unemployed                              88504 non-null  float64
 56  marital_status_rec_separated_divorced_annulled_widowed              88504 non-null  float64
 57  marital_status_rec_single                                           88504 non-null  float64
 58  urbanicity_cat_1_rural                                              88504 non-null  float64
 59  urbanicity_cat_2_mixed                                              88504 non-null  float64
 60  ed_attainment_corr_2_completed_high_school_or_less                  88504 non-null  float64
 61  ed_attainment_corr_3_completed_primary_school_or_less               88504 non-null  float64
 62  evaluacindelprocesoteraputico_logro_intermedio                      88504 non-null  float64
 63  evaluacindelprocesoteraputico_logro_minimo                          88504 non-null  float64
 64  eva_consumo_logro_intermedio                                        88504 non-null  float64
 65  eva_consumo_logro_minimo                                            88504 non-null  float64
 66  eva_fam_logro_intermedio                                            88504 non-null  float64
 67  eva_fam_logro_minimo                                                88504 non-null  float64
 68  eva_relinterp_logro_intermedio                                      88504 non-null  float64
 69  eva_relinterp_logro_minimo                                          88504 non-null  float64
 70  eva_ocupacion_logro_intermedio                                      88504 non-null  float64
 71  eva_ocupacion_logro_minimo                                          88504 non-null  float64
 72  eva_sm_logro_intermedio                                             88504 non-null  float64
 73  eva_sm_logro_minimo                                                 88504 non-null  float64
 74  eva_fisica_logro_intermedio                                         88504 non-null  float64
 75  eva_fisica_logro_minimo                                             88504 non-null  float64
 76  eva_transgnorma_logro_intermedio                                    88504 non-null  float64
 77  eva_transgnorma_logro_minimo                                        88504 non-null  float64
dtypes: bool(4), float64(69), int32(5)
memory usage: 48.6 MB

Code

from IPython.display import display, Markdown

if isinstance(imputations_list_jan26, list) and len(imputations_list_jan26) > 0:
    display(Markdown(f"**First element type:** `{type(imputations_list_jan26[0])}`"))

    if isinstance(imputations_list_jan26[0], dict):
        display(Markdown(f"**First element keys:** `{list(imputations_list_jan26[0].keys())}`"))

    elif isinstance(imputations_list_jan26[0], (pd.DataFrame, np.ndarray)):
        display(Markdown(f"**First element shape:** `{imputations_list_jan26[0].shape}`"))

First element type: <class 'pandas.DataFrame'>

First element shape: (88504, 56)

This code block:

Imports the pickle library: This library implements binary protocols for serializing and de-serializing a Python object structure.
Specifies the file_path: It points to the .pkl file you selected.
Opens the file in binary read mode ('rb'): This is necessary for loading pickle files.
Loads the object: pickle.load(f) reads the serialized object from the file and reconstructs it in memory.
Prints confirmation and basic information: It verifies that the file was loaded and shows the type of the loaded object, and some details about the first element if it’s a list containing common data structures.

Format data

Due to inconsistencies and structural heterogeneity across previously merged datasets, we decided not to proceed with a direct inspection and comparison of column names between the first imputed dataset from imputations_list_jan26 (which likely included dummy-encoded variables) and imputation_nodum_1 (which likely retained non–dummy-encoded variables).

Instead, we reconstructed the analytic datasets de novo using the most recent source files available in the original directory (BASE_DIR). Time-to-event variables were re-derived to ensure internal consistency. Variables that could introduce information leakage (e.g., time from admission) were excluded, and the center identifier variable was removed prior to modeling.

Code

#1.2. Build Surv objects from df_final
from IPython.display import display, Markdown
from sksurv.util import Surv

for i in range(1, 6):
    # Get the DataFrame
    df = globals()[f"imputation_nodum_{i}"]

    # Extract time and event arrays
    time_readm  = df["readmit_time_from_disch_m"].to_numpy()
    event_readm = (df["readmit_event"].to_numpy() == 1)
    time_death  = df["death_time_from_disch_m"].to_numpy()
    event_death = (df["death_event"].to_numpy() == 1)

    # Create survival objects
    y_surv_readm = Surv.from_arrays(event=event_readm, time=time_readm)
    y_surv_death = Surv.from_arrays(event=event_death, time=time_death)

    # Store in global variables (optional but matches your pattern)
    globals()[f"y_surv_readm_{i}"]  = y_surv_readm
    globals()[f"y_surv_death_{i}"]  = y_surv_death

    # Print info
    display(Markdown(f"\n--- Imputation {i} ---"))
    display(Markdown(
    f"**y_surv_readm dtype:** {y_surv_readm.dtype}  \n"
    f"**shape:** {y_surv_readm.shape}"
    ))
    display(Markdown(
    f"**y_surv_death dtype:** {y_surv_death.dtype}  \n"
    f"**shape:** {y_surv_death.shape}"
    ))

— Imputation 1 —