ISI-BUDS 2023 (Irvine Summer Institute in Biostatistics and Undergraduate Data Science)
R Project predicting Colorectal Cancer Stage Using Metabolomic Data
In this project, we analyzed data from a colorectal cancer study, Colorectal Cancer Detection Using Targeted Serum Metabolic Profiling and applied machine learning methods to predict the stage of colorectal cancer.
Explanatory Data Analysis:
- We cleaned the data and replaced outliers with data imputed using SVD.
- PLS-DA (Partial Least Squares Discriminant Analysis) revealed that polyp groups looked more like controls than cancer groups
Prediction:
- We used various machine learning models to predict colorectal cancer stage, for which boosted trees and random forest performed best.
- The interesting thing we did in this project was use Nested CV. Cross-validation normally works by dividing the data into training and testing sets, where the training data is used to train a model and the testing set is used to evaluate model’s performance (so prevent overfitting). Nested CV reduces bias even further by subdividing the data into inner and outer folds, where each fold has multiple CV’s. The inner fold is used to tune model parameters and the outer fold is used to measure performance.
Conclusion:
- Our best models were able to detect cancer at comparable rates to colonoscopy and CT-colonography (the gold standards of colonoscopy screenings, but are also highly invasive) and were superior to other colorectal detection tests that used fecal blood or dna to detect the presence of cancer.