Embedding-PFP: Embedding-based protein function prediction

To use this repo, clone/download the goPredSim repo from RostLab first, then merge it with this repo.

This experiment tests the ability of protein embedding models, which take raw protein sequences as input, and return a fix sixed real-valued vector as output. Most of these embeddings are produced by language models, whose goal is to capture sequence information by predicting a part of the sequence based on the rest. For example, in a LSTM model, the vector output of a token is predicted based on the current input token as well as the previous hidden state vector, whose value is dependant on all previous tokens. In the newer BERT model, some tokens in a sequence are masked, and the model is tasked with predicting them based on the other unmasked tokens in the sequence. BERT is an attention-based model, and looks at all other tokens in the sentence and assign to each token an attention weight. To capture the position of each tokens, BERT uses positional encoding, which is then concatenated with the token vector before being fed into the model. Both LSTM and BERT return an output vector each token in the sentence, which contains information about the token itself, as well as the context surrounding it (in the context of PFP, a token is a residue, and a sequence of residues is a sentence).

Many top performing PFP models currently uses traditional sequence alignment methods to find homologues of the query sequence in an annotated sequence database, then transfer these annotations to the query sequence. A major drawback of these systems is scalability: as more annotated sequences are added to the database, inference becomes slower as BLAST has to search for more sequences. This problem is similar to what nearest neighbor models face. In short, homology-based models don't require training, but has O(N) inference time, with N the size of the annotated dataset. On the other hand, regression-based models, which requires protein sequences to be represented as points in an Euclidean space via an embedding model, has O(N) training time but does inference in O(1). Furthermore, regression-based models on embedded protein sequences can effectively make use of information other than sequence homology for function prediction (for example PPI graphs). But most importantly, embedded protein vectors allow us to build end-to-end differentiable models, which mean we can optimize every component of the model simultaneously, allowing them to work well together.

This work is based on the work of RostLab, which used protein embedding vectors generated by ProtBERT, which adapted the BERT model to embed protein sequences. Each of the 20 amino acids is associated with a vector, and using these vectors, ProtBERT first convert the input sequence into a list of vectors. After the final layer of ProtBERT returns a list of context-enriched vectors the same size as the input list, we take the average of this list to get the embedding for the whole protein. Currently, the file fully_connected.py in this repo describes a model comprising of two linear transformations, with a hidden vector dimension of size 256. BERT vectors have 1024 dimensions. The GO annotations are taken from the goPredSim repo, which contained only the deepest GO terms. To help regression models capture the relationships between GO terms, I have propagated all these GO terms upward the GO DAG to include all GO terms who have at least one descendant in the original dataset, except for the three root GO terms, which represent MFO, BPO, and CCO, and thus has no information value. All GO terms that are in the GO DAG, but does not appear in any annotations was ignored, which leaves us with ~25000 GO terms in total. For each embedded sequence input, we construct a binary vector of size equal to the size of all annotated GO terms, with each dimension corresponding to a unique GO term. If that GO term appears in the annotation set of the current sequence, then we set that dimension's value to 1, otherwise the value is 0. The final layer of the prediction model is a sigmoid function, which returns the probability (between 0 and 1) of a sequence being annotated with each GO term. Since CAFA3 only accepts probability values within 2 decimal places, all probabilities lower than 0.01 were ignored.

The negative_sampling.py file describes a model that's mostly identical to the fully_connected.py model, but instead of predicting the probability of every GO terms for each data point during the training phase, we only predict the n positively labeled GO terms and n randomly selected negative GO terms (i.e. GO terms that does not appear in the annotation set of the current sequence). This training strategy is inspired by the negative sampling strategy used in training Word2Vec word embeddings, which was designed to reduce computational complexity. It was also hoped that by reducing the number of negative predictions per forward pass, the gradient would move towards a direction that captures more information about positive GO terms, thus improving recall. Training using negative sampling was only marginally faster, and the difference in Fmax score is inconclusive.

Below are the results of both experiments, with the nearest neighbor method originally proposed by RostLab. The nearest neighbor results were reimplemented and recalculated by myself, which differs quite a bit from the result reported in the original paper. I'm still investigating the reason behind this.

	Negative sampling	Fully connected	Nearest neighbor
BPO	0.373	0.358	0.377
CCO	0.664	0.677	0.724
MFO	0.520	0.506	0.474

Comparing Fmax scores between each protein reveals that 59% of proteins had higher Fmax score when predicted with Fully connected compared to Nearest neighbor. Below is the histogram of the differences between Fmax of the fully connected method (Fmax_fc) and the nearest neighbor method (Fmax_nn)

License

nguye330/Embedding-PFP

About

Resources

License

Stars

Watchers

Forks

Releases

Languages