Skip to content

Clan-labs/ENBED

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
January 21, 2024 21:28
November 3, 2023 17:24
November 3, 2023 17:28

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Abstract

This work presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

Architecture

This model is built using the HuggingFace platform. Please install the following dependencies:

pip install torch transformers datasets evaluate 

Pre-training script

The configuration file can be created using the T5Config class in the transformers library using the parameters described in the paper. The pre-training script can be run using the following command:

export PROJECT_DIR="YOUR_PROJECT_DIR"
source $PROJECT_DIR/env/bin/activate

export HF_DATASETS_CACHE=$PROJECT_DIR/.cache/
export HF_TRANSFORMERS_CACHE=$PROJECT_DIR/.cache/

python pretrain.py \
	--config_name="./dna-byt5-base" \
	--output_dir="./dna-byt5-base-output"  \
	--tokenizer_name="google/byt5-base" \
	--dataset_name="PATH_TO_FASTA_PRETRAINING_FILE" \
	--max_seq_length="16384" \
	--per_device_train_batch_size="4" \
	--per_device_eval_batch_size="4" \
	--learning_rate="0.00001" \
	--weight_decay="0.001" \
	--warmup_steps="5000" \
	--overwrite_output_dir \
	--logging_steps="200" \
	--save_steps="2500" \
	--eval_steps="15000" \
	--seed="42" \
	--preprocessing_num_workers="32" 

Fine-tuning script

python finetune.py \
    --model_name_or_path model/ \
    --do_train \
    --do_eval \
    --dataset_name data/ \
    --output_dir outputs/ \
    --per_device_train_batch_size=16 \
    --per_device_eval_batch_size=16 \
    --overwrite_output_dir \
    --predict_with_generate \
	--num_train_epochs 10 \
	--text_column sequence \
	--summary_column label \
	--preprocessing_num_workers 32

Example: Generating mutations using seq2seq model

Please see the example file influenza_generation.ipynb, which uses the pre-trained model to generate mutations of the Influenza virus.

Citing this work

@misc{malusare2023understanding,
      title={Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision}, 
      author={Aditya Malusare and Harish Kothandaraman and Dipesh Tamboli and Nadia A. Lanman and Vaneet Aggarwal},
      year={2023},
      eprint={2311.02333},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

(C) 2023 CLAN Labs, Purdue University

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published