Discussion
In the large prospective MJ cohort, our study demonstrated that ML-based models (Lasso-Cox, RSF and XGBoost) were capable of integrating and using routine health check-up data to predict the overall risks of developing any cancer in the study participants. The models exhibited the capability of predicting risks for specific cancer types to a certain degree. Our models achieved notably good performance. To the best of our knowledge, this study is the first study that developed ML-based overall risk prediction models for developing any cancer using routine health check-up data in a large-scale Asian cohort.
As access to primary care clinics improves and awareness of personal health status increases, more individuals opt to participate in routine physical examinations.33 Consequently, early detection of cancer through on-site examination visits holds promise. However, since most of these examinations only encompass standard laboratory tests and questionnaires,34 it is challenging and often unfeasible to apply all site-specific cancer risk prediction models. Meanwhile, the physician must first identify patients with a higher risk of developing cancer first, and then recommend further testing to determine their risk for specific cancer types. From the patients’ perspective, they may be hesitant to pay for additional tests themselves if their risk of developing cancer is unknown and medical expenses are high.35 36 Therefore, a general cancer risk prediction model specifically designed for primary care clinics and routine physical examinations is warranted. To develop and validate such a risk prediction model for overall cancer incidence, a large sample of study participants, years of follow-up, high-quality data, comprehensive records of cancer incidence and most importantly a powerful ML tool to process and analyse the complex dataset are needed. In the current study, we leveraged the MJ cohort, a large Asian database with health check-up data linked to cancer registry and developed and internally validated a prediction model for general cancer risk, which shows promise for widespread application in primary care clinics.
Historically, there has been no reported prediction models for the overall risk of developing any cancer in the general population. Among symptomatic participants from general practices in UK, Hippisley-Cox and Coupland developed a prediction model for overall cancer risk using ‘red flag’ cancer symptoms (including weight loss, abdominal pain, indigestion, dysphagia, abnormal bleeding, lumps; general symptoms including tiredness, constipation) as well as cancer risk factors.37 38 However, this model was not developed for asymptomatic general population in primary care clinics. More models were developed to predict future risk of cancer at specific sites, including 6 breast cancer models,7–10 7 lung cancer models,11 12 10 colorectal cancer models13 and 6 prostate cancer models14 Advances in next-generation sequencing technology enable the inclusion of complex genetic information in risk prediction models. For example, Kachuri et al demonstrated that integrating cancer-specific polygenetic risk sores (PRS)with family history and modifiable risk factors improved risk prediction for 16 specific cancer types.39 Although PRSs improved prediction accuracy, the costs of introducing genomic technologies into primary clinics are high.40 41 Our models used regular physical examination results from primary clinics with no additional cost and reached fairly good model performance. In the current study, we developed a prediction model considering primary care clinic scenario, included variables commonly tested in physical examinations, such as platelet, AST and CEA.
Volume and types of clinical data expand substantially.42 While more data enhance information, it also introduces noises and complexity, necessitating efficient algorithms for handling large-scale computation and learning non-linear relationships from data. In fields of biomarker discovery and clinical outcome prediction, ML algorithms excel at identifying novel relationships not readily apparent to clinical experts or conventional statistical methods.43 44 In this study, we took the ML approach, which has been validated to outperform traditional models when dealing with high-dimensional data.45–47 Meanwhile, ML models often overlook the value of time in time-to-event data, potentially affecting their performance. In a study of comparing ML and statistical models in clinical risk predictions, Li et al concluded that ML models without considering censoring could result in substantially biased risk prediction, limiting their application in long-term risk prediction.48 To reduce bias and compare models within the same context (in survival settings), we adopted two ML models that implemented algorithms for censored data in R, namely RSF and XGBoost with Cox as objective function, which refers to a statistical function to measure the differences between the observed and predicted values of parameters and the dependent variables. The model performance metrics in all scenarios suggested that ML models improved model performance in survival analysis by extracting effective information from complex feature space when dealing with censored data.
Another common issue of using ML algorithms is the model’s explainability, as it is sometimes difficult to explicitly describe the model and each predictor’s contribution. We used information gain which represents the fractional contribution of each feature to the model to rank the importance of the included predictors.49 Many of the selected variables have been proven to be associated with specific types of cancer in previous research. For example, hepatitis B virus, AFP, serum AST and serum alanine transaminase have been identified as risk factors for liver cancer in several studies.50 Faecal occult blood test has been proven predictive for colorectal cancer.51 More, other important variables included in our light models, such as AFP, CEA, serum AST, serum albumin and platelet, are commonly available from blood test results in primary clinics.
The clinical implications of this cancer risk prediction model in the early detection of cancer are in various scenarios. When applied to annual physical examinations, an individual can benefit from knowing their 3-year, 5-year and 10-year risk of pan-cancer. Individuals with low risk might be encouraged to make lifestyle changes while those with high risk will be prompted to consult health professionals and prepare for further investigation. Another application is to enhance the balance among benefits, harms and costs of cancer screening by accurately targeting candidates. Many established screening guidelines are based on limited sets of conditions,52 which ignore the value of routine health check-up data and result in a large screening pool. With limited sources for follow-up, the screening rates for breast cancer, cervical cancer and colorectal cancer missed Healthy People 2020 targets by 3%, 8% and 10%, respectively.53 The model results by risk stratification can help health professionals practice patient-focused care while alerting high-risk individuals to pursue further screening in a timely and efficient manner.
There were some limitations to this study. First, although our models demonstrated fairly good discriminatory ability, external validation in other populations is still warranted. In addition, when our models are implemented elsewhere with missing values of certain biomarkers, imputation of missing data is required before applying model. The missing value of certain biomarker can be imputed by historical data of the specific biomarker or available biomarkers or predictors with high correlation. Second, participants of the MJ cohort were those who engaged in a medical screening programme and generally had above-average socioeconomic positions, which may limit the generalisability of our findings to other populations. Third, we did not account for temporal changes in the predictors during the follow-up. Fourth, although the duration of follow-up in our study is relatively long (median=8 years), the follow-up time did not accumulate up to the most recent date. We plan to update all the analyses by linking the baseline dataset to a more recent cancer incidence registry database and a death registry database in future studies. Lastly, our models were not developed for predicting outcomes such as cancer recurrence and did not incorporate omics data. However, we developed this model with primary care clinics as the main scenario, and laboratory tests for genomics and other omics data were not widely available in these settings.
This work includes several important strengths. First, our model was developed on a large population with good quality of accuracy and completeness in recording questionnaires and health check-up data. Additionally, long-term follow-up and ascertainment of outcomes were assured by a centralised death file and cancer registry. Second, the laboratory tests adopted in our study were commonly available in primary clinics. Such quantitative medical examinations are less biased compared with self-reported symptoms and help explore dose–response relationships. We presented partial dependence plots that can show the marginal effects of biomarkers on the outcomes, regardless of nonlinear or linear relationships. Third, the interpretability of XGBoost model is enhanced through feature visualisation, feature importance and the relationship between important predictors and cancer risk.
In conclusion, we developed and internally validated machine-learning models to estimate the risk of developing any cancer in men and women using routine health check-up data. External validation is warranted before the implementation of our risk model in clinical practice. The integration of ML techniques in risk prediction enables more accurate risk stratification of individuals by modelling complex interactions between risk factors, thus assisting both health professionals and individuals in the early detection of cancer.