Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision
This work presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
This model is built using the HuggingFace platform. Please install the following dependencies:
pip install torch transformers datasets evaluate
The configuration file can be created using the T5Config
class in the transformers
library using the parameters described in the paper. The pre-training script can be run using the following command:
export PROJECT_DIR="YOUR_PROJECT_DIR"
source $PROJECT_DIR/env/bin/activate
export HF_DATASETS_CACHE=$PROJECT_DIR/.cache/
export HF_TRANSFORMERS_CACHE=$PROJECT_DIR/.cache/
python pretrain.py \
--config_name="./dna-byt5-base" \
--output_dir="./dna-byt5-base-output" \
--tokenizer_name="google/byt5-base" \
--dataset_name="PATH_TO_FASTA_PRETRAINING_FILE" \
--max_seq_length="16384" \
--per_device_train_batch_size="4" \
--per_device_eval_batch_size="4" \
--learning_rate="0.00001" \
--weight_decay="0.001" \
--warmup_steps="5000" \
--overwrite_output_dir \
--logging_steps="200" \
--save_steps="2500" \
--eval_steps="15000" \
--seed="42" \
--preprocessing_num_workers="32"
python finetune.py \
--model_name_or_path model/ \
--do_train \
--do_eval \
--dataset_name data/ \
--output_dir outputs/ \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 \
--overwrite_output_dir \
--predict_with_generate \
--num_train_epochs 10 \
--text_column sequence \
--summary_column label \
--preprocessing_num_workers 32
Please see the example file influenza_generation.ipynb
, which uses the pre-trained model to generate mutations of the Influenza virus.
@article{malusare2024understanding,
title={Understanding the natural language of DNA using encoder--decoder foundation models with byte-level precision},
author={Malusare, Aditya and Kothandaraman, Harish and Tamboli, Dipesh and Lanman, Nadia A and Aggarwal, Vaneet},
journal={Bioinformatics Advances},
volume={4},
number={1},
pages={vbae117},
year={2024},
publisher={Oxford University Press}
}
(C) 2023 CLAN Labs, Purdue University