Machine Learning Genomics Sequences Classification: Deep Learning vs. LLM


Details
Hello Data Scientists,
Deep learning models like Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) Networks have been effectively applied to genomics text data classification, including DNA and protein sequences. Large Language Models (LLMs) have been trained on vast amounts of text data to perform various language-related tasks, including text classification. LLMs offer a powerful approach for DNA sequence classification, leveraging their ability to learn and generalize from large datasets. By adapting these models for biological data, researchers can gain insights into genomic sequences and their functions, advancing fields such as genomics, bioinformatics, and personalized medicine. Examples of LLMs for DNA sequence classification include BERT, GPT, BioBERT, DNA-BERT, Transformer-XL, XLNet, and T5. Examples of LLMs for protein sequence classification include Evolutionary Scale Models, ProtBERT, and Tasks Assessing Protein Embeddings. The main question of this presentation is: “Can LLMs perform better than deep learning models for genomics text data classification?” This presentation will show a simple comparison of these models for protein sequence classification with multiple class labels. Let's find out if we really need to use LLMs for genomics text data classification.
Thanks
Ernest Bonat, Ph.D.

Machine Learning Genomics Sequences Classification: Deep Learning vs. LLM