Workflow of data management and analysis - Chilean substance use treatment administrative data

Consolidation of Agreement 1 Databases from 2010 to 2024

Funded by project FONDECYT regular 1191282

Authors
Affiliations

Álvaro Castillo-Carniglia

Full professor, Departamento Nacional de Salud Pública, Facultad de Medicina, Universidad San Sebastián y Núcleo Milenio para la Evaluación y Análisis de Políticas de Drogas

Andrés González-Santa Cruz

Ph.D. student, Public Health, UCH, Chile

Amaru Agüero Jiménez

Ph.D. student, Social Complexity Sciences, UDD, Chile

SISTRAT Datasets

This repository is organized into four main sections: data preparation, deduplication, predictive modeling, and documentation.


1. Data Preparation & Standardization

Core Datasets

  1. Data Preparation and Standardization of C1, TOP, Mortality & Hospitalizations

  2. Data Preparation and Standardization of C2 to C6


2. Data Cleaning & Deduplication (C1)

  1. Deduplication of C1 – Part 1 (Initial Processing)

  2. Deduplication of C1 – Part 2 (Advanced Matching Procedures)

  3. Deduplication of C1 – Part 3 (Resolution & Validation)

  4. Deduplication of C1 – Part 4 (Final Dataset Consolidation)


3. Predictive Modeling Pipeline

Database Formatting

  1. Descriptive Glimpse of the Database (Prediction Format)

  2. Translating survival analysis workflows from R to Python: data preprocessing

Machine Learning – XGBoost

  1. XGBoost – Discrimination-Based Model & SHAP Explanations

  2. XGBoost – Hyperparameter Tuning (Readmission)

  3. XGBoost – Hyperparameter Tuning (Death)


Penalized Survival Models

  1. Elastic Net Cox Proportional Hazards – Variable Importance & Performance Metrics

Deep Learning – DeepHit

  1. DeepHit – Hyperparameter Tuning Based on Discrimination

  2. DeepHit – Discrimination & SHAP Explanations


Deep Learning – DeepSurv

  1. DeepSurv – Hyperparameter Tuning Based on Discrimination

  2. DeepSurv – Discrimination & SHAP Explanations


Prediction & ML-informed survival modeling

  1. Step 2.1: Format and Export Databases for ML, Deep Learning Survival Models & Non-Proportional Hazards, Traditional Cox PH Baseline & Performance Evaluation

  2. Step 2.2: SHAP-Informed Models — ML-Enhanced Selection, Splines & Complex Interactions

  3. Step 2.3: Comparison of metrics of SHAP-Informed Models — ML-Enhanced Selection, Splines & Complex Interactions vs. Traditional Full PH‑adjusted Cox with stratification, DCA, threshold scenarios and threshold-dependent performance metrics


4. Documentation

  1. Codebook of C1

The main processes are summarized in the following figures.


Figure 1. Diagram of data preparation

To open in a new window

Diagram

Figure 2. Diagram of Thesis project

To open in a new window

Diagram

Back to top