Original research

Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort

Abstract

Objective To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.

Methods and analysis This study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).

Results During an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.

Conclusion We developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.

What is already known on this topic

  • Currently, available cancer risk prediction models are generally for each site-specific cancer.

  • Most of the models are developed using traditional biostatistical regression methods, which are less flexible, not suitable for diverse data types and have moderate model performance with limited predictors.

What this study adds

  • The study developed and internally validated machine-learning models that use routine health check-up data to predict the risk of developing any cancer in both men and women.

  • Machine-learning models achieved generally good discriminatory ability with a small set of predictors, achieving similar predictive performance to that from models with a large number of features.

How this study might affect research, practice or policy

  • This study can have important implications for cancer screening research, practice and policy. The integration of machine-learning techniques in cancer risk prediction using routine health check-up data enables more accurate risk stratification of individuals by modelling complex interactions between risk factors, thus assisting in identifying individuals at high risk of developing cancer for early detection of cancer.

Introduction

Cancer has been a serious global health challenge for decades, and an estimated increase of 27.4% in cancer incidence and 31.3% in cancer deaths (approximately 13 million) is expected worldwide over the next decade.1–3 Early detection of cancer holds enormous promise to reducing the increasing burden of cancer worldwide.4 Guideline-based cancer screenings have the potential benefits of detecting cancer early and reducing mortality, and have been recommended for major cancer types, such as lung, colorectal, breast, cervical and prostate cancer. However, the cancer screening rates are not satisfying and many barriers exist to improve cancer screening rates.5 In addition to cancer screening, cancer risk prediction models have been widely used in risk stratification and early detection by improving the accessibility and convenience of risk evaluation, enhancing the consistency and quality of clinical decisions and projecting both short-term and long-term risk with high precision and accuracy.6

Current available guideline-based cancer screenings and cancer risk prediction models generally target one cancer type at a time, and the majority of cancer risk models were developed through traditional regression methods and achieved moderate prediction performance.7–14 One problem raised in single cancer screening is that all-cause mortality has no significant reduction while cancer-specific mortality declines,15 and a possible explanation is that populations who are eligible for specific cancer screening have higher risk of other cancers.16 To improve reduction in overall cancer mortality and cover a broad set of cancer types at one test, there is intense interest in developing blood-based multicancer early detection (MCED) tests, and some studies have shown promising results in terms of low false-positive rate and reduction in late-stage cancers and 5-year cancer deaths.17 18 Although many advances have been made in MCED tests, the widespread use of MCED tests for cancer screening still needs evaluation.19 On the other hand, it is desirable to have a simple multicancer risk prediction model that predicts the risk of developing any cancer, requires the minimum number of additional tests and can be readily used by primary care clinicians. Such risk prediction models accompanied by cancer-specific models that could guide the downstream screening modality can facilitate early detection and intervention across a range of cancers, which is crucial in improving patient outcomes and survival rates.

Routine health check-ups are increasingly adopted worldwide, during which individuals answer standardised questionnaires, undergo physical examinations and take laboratory tests. These health record data together with linkage to cancer registry provided unique and enticing opportunities to develop such a cancer risk prediction model. However, the longitudinal nature of these data inherently has a complex structure and high dimensionality. Traditional statistical models are generally not suitable to handle these highly diverse and complex data types. Machine-learning (ML) models on the other hand have proven the advantage of inferring implicit data patterns and making accurate predictions from complex data. Although ML models have been used to predict the risk of specific cancers (skin, breast, liver and lung cancer),20–22 there are limited health check-up data-based ML models that can predict the risk of developing any cancer.

In this study, we developed ML-based general cancer risk prediction models that use routine health check-up records collected over decades in a large Asian population. We improved the model’s clarity by evaluating variable importance and by constructing a model with the minimum number of variables.

Methods

Data source

Our study population was from the MJ Cohort, which contains prospective cohort data from a self-paying medical screening programme conducted by the MJ Health Management Institution, Taiwan. The data consist of standardised clinic and questionnaire data collected between 1996 and 2008 from four geographically representative MJ locations across Taiwan.23 The questionnaire dataset includes demographic, medical history, lifestyle, diet, personal and family disease history information updated at participants’ every visit and the clinic dataset includes a series of medical test results including, but not limited to blood, urine, body measurements, functional tests and physical examination.23 Details of the data collection process have been published previously.23–26

Study population

We included 441 648 MJ cohort participants, who were enrolled between 1996 and 2007 and were aged ≥20 years at their first visit. A total of 8099 participants were excluded based on the following exclusion criteria: with a cancer history at recruitment (n=6749), with invalid follow-up information (n=1138), with less than 1-year follow-up (n=212). The final cohort consisted of 433 549 participants, and 9907 incident cancer cases were identified during follow-up.

Considering sex disparities in the incidence of cancer,27 we split the overall cohort by sex. The male cohort consisted of 208 599 participants (48.1%), and the female cohort comprised 224 950 participants (51.9%). Within each cohort, we randomly split the data into a training cohort (80%) and a validation cohort (20%). The male training cohort comprised 166 878 participants with 4114 participants diagnosed with cancer during follow-up, and the male validation cohort composed of 41 721 patients with 1029 participants diagnosed with cancer during follow-up. The female training cohort comprised 179 959 participants with 3811 diagnosed with cancer during follow-up, and the female validation cohort comprised 44 991 participants with 953 diagnosed with cancer during follow-up.

Ascertainment and follow-up of cancer

The primary outcome of interest was cancer incidence during follow-up period since the baseline visit. Cancer cases were identified by linking each participant’s unique identification number to Taiwan Cancer Registry File and Taiwan Death File and classified according to International Classification of Diseases, Ninth revision codes.23 Follow-up started at the date of participants’ baseline enrolment and ended at the date of cancer diagnosis, the date of death or the end of cohort follow-up (31 December 2007), whichever came first.

Smoking-related cancers include oral, oesophagus, lung, stomach, liver, cervix, bladder and colorectal cancers, which is a subset of smoking-related cancers defined in ‘The Health Consequences of Smoking-50 Years of Progress—A Report of the Surgeon General’ after matching with cancer types in MJ cohort.28 Obesity-related cancers include oesophageal, cervical, ovarian, thyroid, breast, liver, stomach and colorectal cancers.29

Candidate predictors and data preprocessing

We included 99 questionnaire-based variables and 90 medical test-based variables. These 189 features include demographic characteristics (2 variables), lifestyle variables (23 variables), personal health history (52 variables), personal medication history (8 variables), family history of cancer and other diseases (14 variables), medical tests including blood (19 variables), urine (16 variables), functional tests (36 variables) and physical examination (19 variables). Of these variables, we removed 27 variables with missingness greater than 20% in the sex-specific cohort. We further removed one variable for the female cohort due to its limited variability across participants. This resulted in 155 candidate variables for the male cohort and 160 candidate variables for the female cohort. Then, we imputed missing values by random sampling from non-missing values with R package mice in each cohort.30 To avoid data leakage, the imputation was performed after splitting the dataset into training and validation cohorts. Prior to model training and prediction, we applied Yeo-Johnson power transformation, centre and scale standardisation to all features with R package caret.31

Model development

Online supplemental figure 1 depicts the development of ML-based models for cancer risk prediction. The codes to implement the models are provided in the online supplemental file 1.

Reference model

Following the conventional approach, we constructed multivariable Cox proportional models by including age and selected lifestyle variables in each cohort. The included lifestyle variables were smoking status, drinking, physical activity, vegetable intake, fruit intake and waist-to-hip ratio.

Full model

We trained ML-based survival models to predict cancer risk in male and female cohorts separately. We implemented three baseline ML survival prediction models, including Lasso-Cox, Random Survival Forests (RSF) and eXtreme Gradient Boosting (XGBoost) in the training set with fivefold cross-validation to fine-tune the model parameters. R packages including glmnet, randomForestSRC and xgboost were used.

Light model

Among the 155 candidate variables for the male cohort and 160 candidate variables for the female cohort, we searched for a minimum set of variables by three steps. First, based on each of the full models (Lasso-Cox, RSF and XGBoost), we selected the top 50 variables by averaging the variable importance from the 5-fold cross-validation. We then calculated the number of times that a variable was presented in the three model and only selected the variables that were selected by at least two models. We further ordered these selected variables according to the sum of rankings within each model and evaluated how area under the curve (AUC) varied with the increment of variables according to the order. Predictors included in Lasso-Cox models, RSF models and XGBoost models were all in the form of continuous variables when used for training and validating models.

Statistical analysis

Model performance

The performance of each model was assessed in terms of discriminatory accuracy and calibration in the validation set. When evaluating discriminatory accuracy, we compared Harrell’s concordance index (C-index) and time-dependent AUC of all three models. The 95% CIs of performance metrics were calculated through 100 bootstrapped resamplings. Calibration curves were evaluated by the agreement between observed survival probabilities and predicted survival probabilities at 3, 5 and 10 years. For each model, we ranked the predicted individual risks and divided risks into 10 groups according to deciles. Then, we computed the mean survival probability for each group. The calibration curves were visually measured as a trend of the predicted probabilities on the x-axis and the corresponding observed probabilities on the y-axis.

Sensitivity analysis

We assessed the robustness of our findings by evaluating the model performance in (1) subgroups of participants stratified by cancer types; (2) participants with complete data without imputation; (3) participants whose follow-up period was at least 3 years and (4) participants who were aged ≥40 years.

Risk stratification

We grouped participants into 10 groups according to their projected 3-year, 5-year and 10-year risks of cancer and compared with their observed cancer risks. To further illustrate how our models delivered personalised prediction to individuals with various risk profiles, we applied the light models developed to predict the 10-year risk of developing any cancer using hypothetical examples that covered various risk profiles.

Patient and public involvement

Patients or the public were not involved in the design, or conduct, or reporting, or dissemination plans of our research.

Results

Baseline characteristics

Table 1 summarises selected baseline characteristics of the male and female cohorts. The mean age was 40.6 years (SD: 13.6) in the male cohort and 40.4 years (SD: 13.7) in the female cohort. Within a median follow-up of 8 years (IQR: 4–11 years), a total of 9907 incident cancer cases were reported, and the characteristics of the patients with a cancer diagnosis were shown in online supplemental table 1. The median age at cancer diagnosis in MJ cohort was 57 years. Also, we compared average age at cancer diagnosis for major cancers in our cohort with those from other sources in Taiwan,32 and found that the difference in age at cancer diagnoses were within the range of 3–7 years. The percentage of cancer cases increased from 2.86% in those of age between 40 and 59 years to 10.49% in those of age ≥60 years in the male cohort while the corresponding increase in female cohort was smaller (from 2.92% to 6.33%). Never-smokers and never-drinkers comprised 45.3% and 60.2% of the male cohort, respectively, while the proportions were much higher (82.8% and 81.2%, respectively) in the female cohort.

Table 1
|
Baseline characteristics of the study participants*

Model performance in the validation dataset

Based on the model performance, we selected a minimum set of variables and constructed a light model that achieved similar predictive performance to the full model that included all the candidate predictors. The light model contains 31 variables for the male cohort and 11 variables for the female cohort.

The XGBoost model consistently outperformed the reference multivariable Cox proportional model (figure 1), Lasso-Cox model (online supplemental tables 2 and 4) and RSF model (online supplemental tables 3 and 5) in terms of discriminative accuracy in both male and female validation cohorts. For example, in the male validation cohort, the XGBoost light model predicted 3-year all cancer incidence with an AUC of 0.876 (95% CI 0.858 to 0.894) (online supplemental table 6), as compared with an AUC of 0.849 (95% CI 0.827 to 0.871) by the Lasso-Cox model (online supplemental table 2) and an AUC of 0.849 (95% CI 0.827 to 0.871) by the RSF model (online supplemental table 3). In female cohort, XGBoost achieved a 3-year AUC of 0.746 (95% CI 0.721 to 0.771), compared with 0.738 (95% CI 0.712 to 0.764) by the Lasso-Cox model and 0.737 (95% CI 0.712 to 0.762) by the RSF model (online supplemental tables 2 and 3). The performance gains in 3-year AUC by the XGBoost model in both cohorts were statistically significant (p<0.05, online supplemental tables 4 and 5). Similar statistically significant performance gains by the XGBoost model were observed for the C-index, 5-year AUC and 10-year AUC in the full model as well as the light model (online supplemental tables 4 and 5).

Figure 1
Figure 1

Discriminative accuracy of reference model and the XGBoost model (light model) in male (A) and female (B) cohorts in 3, 5 and 10 years. The reference model was multivariable Cox proportional models including age, smoking status, drinking, physical activity, vegetable intake, fruit intake and waist-hip ratio. The XGBoost model included 31 variables in male cohort and 11 variables in female cohort.

When applying XGBoost models to participants whose ages are above the median (40 years) age at enrolment in MJ cohort, the performance metrics of models in male cohort (3-year AUC of 0.818 and 95% CI 0.795 to 0.841) and female cohort (3-year AUC of 0.641 and 95% CI 0.605 to 0.677) were slightly lower than full model results (online supplemental table 6). Meanwhile, the XGBoost performances were still better than Lasso in the older participants group (online supplemental table 7).

In terms of the calibration, the XGBoost models showed good calibration between observed risk and predicted risk in both male and female cohorts (figure 2), and the calibration curves for the Lasso-Cox models and RSF models are shown in online supplemental figures 2 and 3.

Figure 2
Figure 2

Calibration curves of XGBoost model in male (A) and female (B) cohorts at 3, 5 and 10 years. The triangles represent grouped observations based on predicted risk by XGBoost model. The x-axis of each triangle refers to the average level of predicted risk within the group; the y-axis refers to observed risk of the group. The red line is the fitted linear calibration curve. The dashed line is referenced as perfect calibration.

Variable importance

Relative importance ranks of the 31 variables in the male cohort and 11 variables in the female cohort were illustrated in figure 3A,C. Age was ranked as the most important variable, followed by alpha-fetoprotein (AFP) in both male and female cohorts. Moreover, the partial dependence plot of AFP depicts increasing risks with increasing AFP levels in both male and female cohorts (figure 3B,D). Higher AFP was associated with a 30% increased risk of developing cancer (≥3 ng/mL vs <3 ng/mL, HR 1.30, 95% CI 1.23 to 1.38) in the male cohort and by 27% (≥4 ng/mL vs<4 ng/mL, HR 1.27, 95% CI 1.19 to 1.36) in the female cohort (online supplemental table 7). Besides AFP, eight biomarkers from laboratory test were significantly associated with the risk of cancer incidence in both male and female cohorts. These biomarkers were serum aspartate transaminase (AST), carcinoembryonic antigen (CEA), gamma-glutamyl transferase, red cell distribution width, platelet, haematocrit, alkaline phosphatase and stool test. Changes in absolute 5-year cancer risk as the value of age, AFP and AST increase for a hypothetical male participant (A) and female participant (B) are shown in online supplemental figure 4.

Figure 3
Figure 3

Scaled variable importance and the partial dependence plots of top four important features in male (A, B) and female (C, D) cohorts based on the light model. The size of circles in (A, C) represents the value of variable importance with larger sizes representing greater importance. AFP, alpha-fetoprotein; ALB, serum albumin; ALP, alkaline phosphatase; ALT, alanine amino transferase; AST, serum aspartate transaminase; CA, calcium; CEA, carcinoembryonic antigen; CHOL, cholesterol; CRE, creatine; ERY, erythrocyte; FEV1, forced expiratory volume in 1 s; GGT, gamma-glutamyl transferase; GLO, globulin; HBV, hepatitis B virus; HCT, haematocrit; HDL, high-density lipoprotein; KUB, plain abdomen X-ray; LEU, leucocyte; MCH, mean corpuscular haemoglobin; PLA, platelet; RDW, red cell distribution width; SPL, Spleen X-ray; T4, thyroxine; UA, urine acid; WCC, white cell count; WHR, waist hip ratio.

Sensitivity analysis

In subgroups of participants stratified by cancer types, which included smoking-related cancers, obesity-related cancers, liver cancer and lung cancer, the XGBoost model (online supplemental table 6) consistently outperformed the Lasso-Cox (online supplemental table 2) and RSF models (online supplemental table 3). XGBoost model achieved high discriminative accuracy in liver cancer (3-year AUC accuracy of 0.961), lung cancer (3-year AUC accuracy of 0.914), smoking-related cancers (3-year AUC accuracy of 0.905) and obesity-related cancers (3-year AUC accuracy of 0.889).

In those with complete data, those who aged ≥40 years old or those whose follow-up period was at least 3 years, XGBoost model achieved 3-year AUC accuracy of 0.892, 0.818 and 0.840 in male cohort, and 3-year AUC accuracy of 0.724, 0.641 and 0.737 in female cohort. Although in subgroups aged ≥40 years old and whose follow-up period was at least 3 years, the AUCs were slightly lower when compared with XGBoost full model performance, the XGBoost performances were still better than Lasso when compared within the same subgroups (online supplemental table 7).

Risk stratification

We divided participants into 10 groups according to their projected 3-year, 5-year and 10-year risks of developing cancer (online supplemental figure 5A, B). We found that the observed risk was at least 10-fold higher in the group with the highest (vs the lowest) predicted risk. Also, cumulative hazard curves confirmed that XGBoost model could stratify patients robustly (online supplemental fiugre 5C and 5D). The high-risk groups were more likely to develop cancer than the low-risk groups in both male and female cohorts (p<0.0001), with HR of 32.98 (95% CI 22.33 to 48.70) and 9.14 (95% CI 7.12 to 11.73), respectively.

We applied the XGBoost model to predict 10-year risk of developing any cancer using hypothetical examples (figure 4). As an example, for a 30-year-old non-smoking male with relatively low-risk profiles in tumour-related laboratory biomarkers, liver-related laboratory biomarkers, blood test laboratory biomarkers and normal colorectal marker levels, his predicted 10-year risk of developing any cancer is 0.36%. However, the risk would increase to 4.72% if the same individual was 65 years old instead. Furthermore, the predicted risk increases sharply to 27.53% if both tumour-related laboratory biomarkers (AFP, CEA) and liver-related laboratory biomarkers (serum AST, serum albumin, gamma-glutamyl transferase, hepatitis B virus, globulin and spleen X-ray) are higher than normal values. For a 65-year-old female, the predicted 10-year risk of developing any cancer increased from 3.53% to 9.03% if both tumour-related laboratory biomarkers and liver-related laboratory biomarkers were above normal values.

Figure 4
Figure 4

Application of the XGBoost risk prediction models in (A) male cohort and (B) female cohort to predict 10-year absolute risk of developing any cancer for hypothetical individuals, with different risk profiles. AFP, alpha-fetoprotein; ALB, serum albumin; ALP, alkaline phosphatase; ALT, alanine amino transferase; AST, serum aspartate transaminase; CA, calcium; CEA, carcinoembryonic antigen; CHOL, cholesterol; CRE, creatine; ERY, erythrocyte; FEV1, forced expiratory volume in 1 s; GGT, gamma-glutamyl transferase; GLO, globulin; HBV, hepatitis B virus; HCT, haematocrit; HDL, high-density lipoprotein; KUB, plain abdomen X-ray; LEU, leucocyte; MCH, mean corpuscular haemoglobin; PLA, platelet; RDW, red cell distribution width; SPL, spleen X-ray; T4, thyroxine; UA, urine acid; WHR, waist hip ratio.

Discussion

In the large prospective MJ cohort, our study demonstrated that ML-based models (Lasso-Cox, RSF and XGBoost) were capable of integrating and using routine health check-up data to predict the overall risks of developing any cancer in the study participants. The models exhibited the capability of predicting risks for specific cancer types to a certain degree. Our models achieved notably good performance. To the best of our knowledge, this study is the first study that developed ML-based overall risk prediction models for developing any cancer using routine health check-up data in a large-scale Asian cohort.

As access to primary care clinics improves and awareness of personal health status increases, more individuals opt to participate in routine physical examinations.33 Consequently, early detection of cancer through on-site examination visits holds promise. However, since most of these examinations only encompass standard laboratory tests and questionnaires,34 it is challenging and often unfeasible to apply all site-specific cancer risk prediction models. Meanwhile, the physician must first identify patients with a higher risk of developing cancer first, and then recommend further testing to determine their risk for specific cancer types. From the patients’ perspective, they may be hesitant to pay for additional tests themselves if their risk of developing cancer is unknown and medical expenses are high.35 36 Therefore, a general cancer risk prediction model specifically designed for primary care clinics and routine physical examinations is warranted. To develop and validate such a risk prediction model for overall cancer incidence, a large sample of study participants, years of follow-up, high-quality data, comprehensive records of cancer incidence and most importantly a powerful ML tool to process and analyse the complex dataset are needed. In the current study, we leveraged the MJ cohort, a large Asian database with health check-up data linked to cancer registry and developed and internally validated a prediction model for general cancer risk, which shows promise for widespread application in primary care clinics.

Historically, there has been no reported prediction models for the overall risk of developing any cancer in the general population. Among symptomatic participants from general practices in UK, Hippisley-Cox and Coupland developed a prediction model for overall cancer risk using ‘red flag’ cancer symptoms (including weight loss, abdominal pain, indigestion, dysphagia, abnormal bleeding, lumps; general symptoms including tiredness, constipation) as well as cancer risk factors.37 38 However, this model was not developed for asymptomatic general population in primary care clinics. More models were developed to predict future risk of cancer at specific sites, including 6 breast cancer models,7–10 7 lung cancer models,11 12 10 colorectal cancer models13 and 6 prostate cancer models14 Advances in next-generation sequencing technology enable the inclusion of complex genetic information in risk prediction models. For example, Kachuri et al demonstrated that integrating cancer-specific polygenetic risk sores (PRS)with family history and modifiable risk factors improved risk prediction for 16 specific cancer types.39 Although PRSs improved prediction accuracy, the costs of introducing genomic technologies into primary clinics are high.40 41 Our models used regular physical examination results from primary clinics with no additional cost and reached fairly good model performance. In the current study, we developed a prediction model considering primary care clinic scenario, included variables commonly tested in physical examinations, such as platelet, AST and CEA.

Volume and types of clinical data expand substantially.42 While more data enhance information, it also introduces noises and complexity, necessitating efficient algorithms for handling large-scale computation and learning non-linear relationships from data. In fields of biomarker discovery and clinical outcome prediction, ML algorithms excel at identifying novel relationships not readily apparent to clinical experts or conventional statistical methods.43 44 In this study, we took the ML approach, which has been validated to outperform traditional models when dealing with high-dimensional data.45–47 Meanwhile, ML models often overlook the value of time in time-to-event data, potentially affecting their performance. In a study of comparing ML and statistical models in clinical risk predictions, Li et al concluded that ML models without considering censoring could result in substantially biased risk prediction, limiting their application in long-term risk prediction.48 To reduce bias and compare models within the same context (in survival settings), we adopted two ML models that implemented algorithms for censored data in R, namely RSF and XGBoost with Cox as objective function, which refers to a statistical function to measure the differences between the observed and predicted values of parameters and the dependent variables. The model performance metrics in all scenarios suggested that ML models improved model performance in survival analysis by extracting effective information from complex feature space when dealing with censored data.

Another common issue of using ML algorithms is the model’s explainability, as it is sometimes difficult to explicitly describe the model and each predictor’s contribution. We used information gain which represents the fractional contribution of each feature to the model to rank the importance of the included predictors.49 Many of the selected variables have been proven to be associated with specific types of cancer in previous research. For example, hepatitis B virus, AFP, serum AST and serum alanine transaminase have been identified as risk factors for liver cancer in several studies.50 Faecal occult blood test has been proven predictive for colorectal cancer.51 More, other important variables included in our light models, such as AFP, CEA, serum AST, serum albumin and platelet, are commonly available from blood test results in primary clinics.

The clinical implications of this cancer risk prediction model in the early detection of cancer are in various scenarios. When applied to annual physical examinations, an individual can benefit from knowing their 3-year, 5-year and 10-year risk of pan-cancer. Individuals with low risk might be encouraged to make lifestyle changes while those with high risk will be prompted to consult health professionals and prepare for further investigation. Another application is to enhance the balance among benefits, harms and costs of cancer screening by accurately targeting candidates. Many established screening guidelines are based on limited sets of conditions,52 which ignore the value of routine health check-up data and result in a large screening pool. With limited sources for follow-up, the screening rates for breast cancer, cervical cancer and colorectal cancer missed Healthy People 2020 targets by 3%, 8% and 10%, respectively.53 The model results by risk stratification can help health professionals practice patient-focused care while alerting high-risk individuals to pursue further screening in a timely and efficient manner.

There were some limitations to this study. First, although our models demonstrated fairly good discriminatory ability, external validation in other populations is still warranted. In addition, when our models are implemented elsewhere with missing values of certain biomarkers, imputation of missing data is required before applying model. The missing value of certain biomarker can be imputed by historical data of the specific biomarker or available biomarkers or predictors with high correlation. Second, participants of the MJ cohort were those who engaged in a medical screening programme and generally had above-average socioeconomic positions, which may limit the generalisability of our findings to other populations. Third, we did not account for temporal changes in the predictors during the follow-up. Fourth, although the duration of follow-up in our study is relatively long (median=8 years), the follow-up time did not accumulate up to the most recent date. We plan to update all the analyses by linking the baseline dataset to a more recent cancer incidence registry database and a death registry database in future studies. Lastly, our models were not developed for predicting outcomes such as cancer recurrence and did not incorporate omics data. However, we developed this model with primary care clinics as the main scenario, and laboratory tests for genomics and other omics data were not widely available in these settings.

This work includes several important strengths. First, our model was developed on a large population with good quality of accuracy and completeness in recording questionnaires and health check-up data. Additionally, long-term follow-up and ascertainment of outcomes were assured by a centralised death file and cancer registry. Second, the laboratory tests adopted in our study were commonly available in primary clinics. Such quantitative medical examinations are less biased compared with self-reported symptoms and help explore dose–response relationships. We presented partial dependence plots that can show the marginal effects of biomarkers on the outcomes, regardless of nonlinear or linear relationships. Third, the interpretability of XGBoost model is enhanced through feature visualisation, feature importance and the relationship between important predictors and cancer risk.

In conclusion, we developed and internally validated machine-learning models to estimate the risk of developing any cancer in men and women using routine health check-up data. External validation is warranted before the implementation of our risk model in clinical practice. The integration of ML techniques in risk prediction enables more accurate risk stratification of individuals by modelling complex interactions between risk factors, thus assisting both health professionals and individuals in the early detection of cancer.