Application of Convolutional Neural Network in Establishing Human Variome Database

Project: National Science and Technology CouncilNational Science and Technology Council Academic Grants

Project Details


As the cost of sequencing continues to decrease, human whole-genome sequencing (WGS) is now affordable and being performed at an unprecedented scale. Based on our previous achievement in MOST 104-2321-B-182-007-MY3, we have established mSignatureDB (Nucleic Acids Research, 2018 Jan 4;46(D1):D964-D970.IF=11.561; R/C= 4%, BIOCHEMISTRY & MOLECULAR BIOLOGY), to decipher mutational signatures in human cancers. However, is there disease or phenotype-associated mutational signature originated from germline mutations? In this proposal, we planned to extend our work to identify genetic variants in over 17,000 whole-genome sequencing datasets (Taiwan Biobank and Broad Institute’s genome aggregation database) using convolutional neural network (CNN) approach, and subsequently decompose novel mutational signatures from the resulting mutational profiles obtained from non-cancer samples to answer the question raised by ourselves. We also planned to create a genotype-phenotype database, apply CNN method to unravel all plausible connections between phenotypes and genotypes, constructing the most comprehensive human variome database across the globe. The proposed project aimed to: 1. Establish Spark, CNN, and GPU-accelerated pipelines for GATK4 HaplotypeCaller, AI-based DeepVariant and de novo mutational signature decomposition 2. Establish the most comprehensive human variome database in Taiwan population and extend to cover populations over the world 3. Use GPU-accelerated pipeline to decipher novel mutational signatures in cancer and non-cancer datasets 4. Establish an integrative platform to uncover all plausible links between phenotypes and mutational signatures using the CNN strategy The ultimate goal of this proposal is to establish the 1st AI-based genetic variant database for Taiwan population, which may not only reflect the true population frequencies of genetic variations but also facilitate the prioritization of pathogenic variants and benefit the research in precision medicine.

Project IDs

Project ID:PB10901-2587
External Project ID:MOST108-2221-E182-043-MY3
Effective start/end date01/08/2031/07/21


  • germ-line mutation
  • mutational signature
  • Taiwan Biobank
  • whole-genome sequencing


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.