Useful Links
- Data Reference from borealisdata.
- Code Sample (Coming soon)
Background
Sepsis, a severe response to infection leading to organ dysfunction or death, is a leading cause of mortality in children, particularly in developing countries. In 2017, an estimated 48.9 million cases of sepsis were reported globally, with children accounting for over half of these cases. A staggering 85% of these cases occurred in low- and middle-income countries. Many sepsis-related deaths could be prevented with early detection and timely treatment using simple, highly effective interventions like antimicrobials and fluid resuscitation. Predicting the risk of in-hospital mortality based on admission data can help clinicians identify critically ill children and allocate limited resources efficiently. However, there is currently no recommended mortality prediction model for LMICs. The 2024 Pediatric Sepsis Data Challenge aims to address this issue by inviting participants to develop open-source algorithms using a real-world data-informed synthetic dataset to predict in-hospital mortality and length of stay in children with sepsis.
The Challenge Objective
Participants are asked to design a working, open-source algorithm to predict in-hospital mortality and length of stay using only the provided synthetically generated dataset. We are seeking a parsimonious model that provides the optimal trade-off between effort in data collection and performance in clinical risk prediction. Ideally, the model should be capable of running on a mobile device, considering environments with unreliable electrical supply and internet connectivity.
The Challenge Data
Challenge Data Source
The original data used to generate the real-world data (RWD) informed synthetic training set available to participants was obtained from a prospective, multisite, observational cohort study of children with suspected sepsis aged 6 months to 60 months at admission to hospitals in Uganda (9). The study enrolled participants from six hospitals in Uganda. In total, these facilities have a catchment area that includes 30 districts with a total population of approximately 8·2 million individuals, including approximately 1·4 million children younger than 5 years of age, which provides a reasonable representation of the national Ugandan population outside Kampala (2).
The study team enrolled children with suspected sepsis, defined as children who were admitted with a proven or suspected infection, as determined by the treating medical team. Children who resided outside the hospital catchment or who were admitted for a short-term (less than 24 hour) observation period, trauma, or immediately after birth (i.e., without first being discharged home) were excluded.
At admission, trained study nurses systematically collected data (full dataset A) on clinical, social, and demographic variables. Clinical data included anthropometry (to determine malnutrition status), vital signs, simple laboratory parameters (glucose, malaria rapid diagnostic test [RDT], HIV RDT, hematocrit, lactate), clinical signs and symptoms, co-morbidities, and healthcare history, including previous hospital admissions. Social and demographic variables included maternal and household details.
Challenge Data Generation
For this challenge, we have created an RWD-informed synthetically generated training data set to reduce the risk of re-identification in this highly vulnerable population. The synthetic training set was generated from a random subset of the original data (full dataset A) of 2686 records (70% of the total dataset - training dataset B). All challenge solutions will be evaluated against the remaining 1235 records (30% of the total dataset - test dataset C).
The synthetic dataset was created in R Statistical Software using the synthpop package (7). We used the non-parametric classification and regression tree (CART) method for synthesising all variables. Variables were synthesised sequentially, with the first variable (the outcome variable, in-hospital mortality) synthesised via sampling with replacement, and subsequent variables synthesised conditionally on all previous variables. Missing data (31% of all data cells) were also synthesised as part of this process, and rules were specified to indicate when a variable was conditionally missing based on the value of another variable. All direct identifiers were removed to reduce the risk of re-identification, and data collected during discharge or post-discharge from the facility were not included as they should not be used to predict in-hospital mortality. The univariate and bivariate distributions of each predictor with the outcome variable were compared between the synthetic dataset and the withheld test dataset C. We evaluated several measures of distribution divergence, including the maximum mean discrepancy (MMD; smaller values indicate more similar datasets) for continuous variables and the Kullback-Leibler (KL) divergence (normalized value between 0 and 1 where higher values indicate more similar datasets) for continuous and categorical variables to minimize the likelihood of statistical variations between the generated synthetic and RWD dataset.
The full training dataset contains 148 variables. Not all variables are available in all records, however. Variables consist of clinical, social, and laboratory values.
Synthetic Data Results
Univariable distributions between the synthetic data and the test dataset C were similar [10]. The bivariate distributions between all predictor variables compared against the outcome, in-hospital mortality, were also similar, with some exceptions where a categorical variable was poorly represented.
Divergence statistics suggested that the synthetic dataset and test dataset C were similar. The MMD between the synthetic dataset and test dataset C was 0.043; for reference, the MMD between the training dataset B and test dataset C was 0.037. The KL-divergence between the synthetic dataset and test dataset C was 0.905 and 0.985 for continuous and categorical variables, respectively; the KL-divergence between the training dataset B and test dataset C was 0.923 and 0.987 (10). See (10) for a complete description of the dataset distributions and evaluation statistics.
How It Works
There are two phases for the Challenge: a preliminary phase and an active phase. The preliminary phase of the Challenge allows us to introduce and ‘beta test’ the data, scores, and submission system before the active phase of the Challenge. Participation in the preliminary phase is mandatory for participating in the active phase of the Challenge because it helps us to improve the active phase. The preliminary phase also gives teams a chance to get acquainted with the data and variables and begin work on creating preliminary algorithms. Teams must submit 1 entries during the preliminary phase to be eligible to participate in the active phase.
During the active phase, teams will have 3 opportunities over this period to submit an entry. After each round, teams are given results and can submit again. Please note that the number of opportunities to submit may increase. We will keep teams updated.
We require teams to provide both the training code and the corresponding trained model for their submissions. During the model training phase, teams will incorporate any preprocessed and reclassified training data, with any modifications to the training data being an integral part of the process. Each team's training code will be evaluated on the entire training dataset. Subsequently, the trained model obtained from the previous step will be evaluated on the original hidden data (dataset A).
We are requesting that teams submit their code in Python. Upon submission, we will download the teams' code and execute it within containerized environments on an Google VM Instance platform. A maximum of 72 hours is allocated for the training process, followed by an additional 24 hours for validation and testing. To assist teams, we will provide a baseline model implemented in Python. This baseline model is a random forest classifier that inputs all available features. Categorical variables are transformed into a set of binary (0/1) variables, representing the distinct values within each categorical variable.
The challenge consists of two phases: an unofficial phase and an official phase. The unofficial phase serves as a testing ground for the data, scores, and submission system before the official phase begins. Participation in the unofficial phase is mandatory to transition to the official phase, as it allows us to make improvements based on feedback and suggestions. Additionally, the unofficial phase allows teams to familiarize themselves with the data and variables and start developing preliminary algorithms. During the unofficial phase, teams can submit up to 5 algorithms. In the official phase, teams will have 10 opportunities to submit an algorithm.
Synthetic Data Results
Univariable distributions between the synthetic data and the test dataset C were similar [10]. The bivariate distributions between all predictor variables compared against the outcome, in-hospital mortality, were also similar, with some exceptions where a categorical variable was poorly represented. Divergence statistics suggested that the synthetic dataset and test dataset C were similar. The MMD between the synthetic dataset and test dataset C was 0.043; for reference, the MMD between the training dataset B and test dataset C was 0.037. The KL-divergence between the synthetic dataset and test dataset C was 0.905 and 0.985 for continuous and categorical variables, respectively; the KL-divergence between the training dataset B and test dataset C was 0.923 and 0.987 [10].See [10] for a complete description of the dataset distributions and evaluation statistics.
References
- Rudd KE, Johnson SC, Agesa KM, Shackelford KA, Tsoi D, Kievlan DR, et al. Global, regional, and national sepsis incidence and mortality, 1990-2017: analysis for the Global Burde\n of Disease Study. Lancet (London, England). 2020;395(10219):200-11.
- Vincent JL. The Clinical Challenge of Sepsis Identification and Monitoring. PLoS Med. 2016 May 17;13(5):e1002022. doi: 10.1371/journal.pmed.1002022. PMID: 27187803; PMCID: PMC4871479.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4871479/
- Seki T, Kawazoe Y, Ohe K. Machine learning-based prediction of in-hospital mortality using admission laboratory data: A retrospective, single-site study using electronic health record data. PLoS One. 2021 Feb 5;16(2):e0246640. doi: 10.1371/journal.pone.0246640. PMID: 33544775; PMCID: PMC7864463. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0246640#:~:text=Abstract,depending%20on%20the%20patient%27s%20severity.
- Vogenberg FR. Predictive and prognostic models: implications for healthcare decision-making in a modern recession. Am Health Drug Benefits. 2009 Sep;2(6):218-22. PMID: 25126292; PMCID: PMC4106488.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4106488/
- Reinhart K, Daniels R, Kissoon N, Machado FR, Schachter RD, Finfer S. Recognizing Sepsis as a Global Health Priority - A WHO Resolution. N Engl J Med. 2017 Aug 3;377(5):414-417. doi: 10.1056/NEJMp1707170. Epub 2017 Jun 28. PMID: 28658587.https://www.nejm.org/doi/10.1056/NEJMp1707170
- Ogero M, Sarguta RJ, Malla L, et al Prognostic models for predicting in-hospital paediatric mortality in resource-limited countries: a systematic review BMJ Open 2020;10:e035045. doi: 10.1136/bmjopen-2019-035045 https://bmjopen.bmj.com/content/10/10/e035045.full#ref-7
- Nowok B, Raab GM, Dibben C (2016). “synthpop: Bespoke Creation of Synthetic Data in R.” Journal of Statistical Software, 74(11), 1–26. doi:10.18637/jss.v074.i11.
- Uganda Bureau of Statistics Population and censuses. https://www.ubos.org/explore-statistics/20/ 2022. Date accessed: September 5, 2022
- Wiens MO, Bone JN, Kumbakumba E, Businge S, Tagoola A, Sherine SO, Byaruhanga E, Ssemwanga E, Barigye C, Nsungwa J, Olaro C, Ansermino JM, Kissoon N, Singer J, Larson CP, Lavoie PM, Dunsmuir D, Moschovis PP, Novakowski S, Komugisha C, Tayebwa M, Mwesigwa D, Zhang C, Knappett M, West N, Nguyen V, Mugisha NK, Kabakyenga J. Mortality after hospital discharge among children younger than 5 years admitted with suspected sepsis in Uganda: a prospective, multisite, observational cohort study. Lancet Child Adolesc Health. 2023 May 11:S2352-4642(23)00052-4. doi: 10.1016/S2352-4642(23)00052-4. Epub ahead of print. PMID: 37182535.
- Nguyen V, Huxford C, Rafiei A, Wiens M, Ansermino JM, Kissoon N, Kamaleswaran R. Data Challenges: 2024 Pediatric Sepsis Challenge. Borealis. 2023 June 21; doi.org/10.5683/SP3/TFAV36