Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Abstract

This work presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

Architecture

This model is built using the HuggingFace platform. Please install the following dependencies:

pip install torch transformers datasets evaluate

Pre-training script

The configuration file can be created using the T5Config class in the transformers library using the parameters described in the paper. The pre-training script can be run using the following command:

export PROJECT_DIR="YOUR_PROJECT_DIR"
source $PROJECT_DIR/env/bin/activate

export HF_DATASETS_CACHE=$PROJECT_DIR/.cache/
export HF_TRANSFORMERS_CACHE=$PROJECT_DIR/.cache/

python pretrain.py \
	--config_name="./dna-byt5-base" \
	--output_dir="./dna-byt5-base-output"  \
	--tokenizer_name="google/byt5-base" \
	--dataset_name="PATH_TO_FASTA_PRETRAINING_FILE" \
	--max_seq_length="16384" \
	--per_device_train_batch_size="4" \
	--per_device_eval_batch_size="4" \
	--learning_rate="0.00001" \
	--weight_decay="0.001" \
	--warmup_steps="5000" \
	--overwrite_output_dir \
	--logging_steps="200" \
	--save_steps="2500" \
	--eval_steps="15000" \
	--seed="42" \
	--preprocessing_num_workers="32"

Fine-tuning script

python finetune.py \
    --model_name_or_path model/ \
    --do_train \
    --do_eval \
    --dataset_name data/ \
    --output_dir outputs/ \
    --per_device_train_batch_size=16 \
    --per_device_eval_batch_size=16 \
    --overwrite_output_dir \
    --predict_with_generate \
	--num_train_epochs 10 \
	--text_column sequence \
	--summary_column label \
	--preprocessing_num_workers 32

Example: Generating mutations using seq2seq model

Please see the example file influenza_generation.ipynb, which uses the pre-trained model to generate mutations of the Influenza virus.

Citing this work

@article{malusare2024understanding,
  title={Understanding the natural language of DNA using encoder--decoder foundation models with byte-level precision},
  author={Malusare, Aditya and Kothandaraman, Harish and Tamboli, Dipesh and Lanman, Nadia A and Aggarwal, Vaneet},
  journal={Bioinformatics Advances},
  volume={4},
  number={1},
  pages={vbae117},
  year={2024},
  publisher={Oxford University Press}
}

(C) 2023 CLAN Labs, Purdue University

Clan-labs/ENBED

About

Resources

Stars

Watchers

Forks

Releases

Languages