Project Details
Abstract
Using results derived from our first quest to human Enterovirus (EV) recombination detection by machine learning 2 years ago, we are proposing a 2-yr project to follow-up this problem in a more systematic and comprehensive way. It is known that there are over 100 human EV types (include 97 serotypes and 10+ subgenotypes for EV-A71 and CA-V16) available from the database, clustering into 4 species HEV-A, -B, -C and –D according to their VP1 diversity, each contains as little as 4 serotypes for HEV-D to as many as 57 from HEV-B. In our earlier work, we have successfully developed a deep learning framework using TensorFlow and Keras, in which a Long Short Term Memory Network (LSTM) model was implemented for making classification of 5 EV types (including 3 EV-A71 genotypes and 2 other serotypes CV-A4, CV-A16) and were able to obtain a qualitatively good prediction of an EV recombinant which is known to have inherited its genomic fragments from 4 out of these 5 types. There are two major issues here. The first one is the need to include more than 100 putative parental EV types in the training dataset, not just 5 as we did before. The second issue is the imbalance of data among these types, which may prevent LSTM from having a satisfactory training process for making accurate classification. In this proposal, we are aiming at detecting EV recombination by beginning with a simulation for needed EV sequences suitable for the classification among EV serotypes/genotypes/subgenotypes, followed by importing them into an LSTM model for detecting EV recombination. In the first year, we will collect all publicly available EV genomes and develop a new substitution matrix for each of the four EV species to increase the quality of multiple sequence alignment (MSA) prior to the construction of phylogenetic tree for labeling their types. We will evaluate the alignment quality by comparing MSAs with and without using this new matrix, construct phylogenetic tree, and calculate their sequence identities within types by scanning along the virus genomes to remove unqualified training sequences. Qualified MSAs will be further used to generate simulated sequences by SANTA-SIM for solving the issue of unbalanced data.In the second year, we will optimize the hyper-parameters of our LSTM model using our benchmark data. We will not only compare our result of recombination events to previous studies, but also contrast the performance of our LSTM model to other existing approaches. Finally, a website hosting predicted recombinants and predicting EV recombination per user’s request will be implemented.Based on our preliminary results from previous investigations, we are confident that, once the sequence data are balanced to include all EV types and our LSTM model has its hyperparameters tuned by these benchmark data, a powerful LSTM model can be established for EV recombination detection. Our final LSTM codes and benchmark datasets will be uploaded to GitHub.
Project IDs
Project ID:PB10907-4373
External Project ID:MOST109-2221-E182-043-MY2
External Project ID:MOST109-2221-E182-043-MY2
Status | Finished |
---|---|
Effective start/end date | 01/08/20 → 31/07/21 |
Keywords
- enterovirus
- species
- multiple sequence alignment
- classification
- deep learning
- recombination
- benchmark
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.