Abstract
Genetic risk scores (GRS) are crucial tools for estimating an individual's genetic liability to various traits and diseases, computed as a weighted sum of trait-associated allele counts. Traditionally, GRS models assume additive, linear effects of risk variants. However, complex traits often involve nonadditive interactions, such as epistasis, which are not captured by these conventional methods. In this study, we investigate the use of random forest (RF) models as a model-free approach for constructing GRS, leveraging RF's capacity to capture complex, nonlinear interactions among genetic variants. Specifically, we introduce two new RF-based GRS strategies to boost RF performance and to incorporate base data information if available, including (1) ctRF, which optimizes linkage disequilibrium (LD) clumping and p-value thresholds within RF; and (2) wRF, which adjusts the chance of SNP inclusion in tree nodes based on their association strength. Through simulation studies and real data applications of Alzheimer's disease, body mass index, and atopy, we find that ctRF consistently outperforms other RF-based methods and classical additive models when traits exhibit complex genetic architectures. Additionally, incorporating informative base data into RF-GRS construction can enhance predictive accuracy. Our findings suggest that RF-based GRS can effectively capture intricate genetic interactions, and offer a robust alternative to traditional GRS methods, especially for complex traits with nonlinear genetic effects.
| Original language | English |
|---|---|
| Article number | e70022 |
| Journal | Genetic Epidemiology |
| Volume | 49 |
| Issue number | 8 |
| DOIs | |
| State | Published - 12 2025 |
Bibliographical note
Publisher Copyright:© 2025 The Author(s). Genetic Epidemiology published by Wiley Periodicals LLC.