AntiRef: reference clusters of human antibody sequences

AntiRef (Antibody Reference Clusters), which was inspired by UniRef, provides clustered datasets of filtered human antibody sequences. The Jupyter notebooks in this repository contain all code necessary to recreate AntiRef entirely from scratch:

download: downloads raw data from the Observed Antibody Space (OAS) repository. Note that the combined total of these datasets is quite large -- nearly 4TB after decompression.
filter: filters the raw sequences to ensure only productive, full-length sequences are used to compile AntiRef.
cluster: performs a nested clustering procedure using several identity thresholds. This process is similar to that used by UniRef, although the thresholds were optimized for antibody sequence data rather than general protein seqeunces.

What is nested clustering?

AntiRef is a series of antibody sequence datasets, each clustered at an identity threshold of decreasing stringency. Rather than clustering the filtered input dataset using each threshold in parallel, we perform the clustering sequentially using the output from the previous round as input for the subsequent clustering iteration:

This has two primary benefits. First and most importantly, it ensures that cluster and sequence names are conserved across all AntiRef datasets. Each cluster is named after its representative sequence (as determined by mmseqs), and by using the output of one clustering round as input for the next, we can ensure that the representative sequence will be present in all previous clustering outputs. For example, if we separately clustered the input dataset at 99% and 98% identity, there is the possibility that some cluster representatives in the 98% dataset are not present in the 99% dataset because these sequences were not selected as representatives for their respective 99% cluster.

Why AntiRef?

Biases in the human antibody repertoire result in publicly available antibody sequence datasets containing many duplicate or highly similar sequences. These redundant sequences are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine learning models of human antibodies. Identity-based clustering provides a solution, however, the extremely large size of available antibody repertoire datasets make such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data.

Starting from a dataset of ~335M unique, full-length, productive human antibody sequences from the Observed Antibody Space repository, several AntiRef cluster sets were generated. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef (100, 90 and 50 percent identity) to cluster general protein sequences are suboptimal for antibody clustering. AntiRef provides reference antibody sequence datasets clustered at a range of relevant identity thresholds: 100, 99, 98, 96, 94, 92 and 90 percent. AntiRef90, which uses the lowest clustering threshold of any AntiRef dataset, is roughly one-third the size of the filtered input dataset and less than half the size of the non-redundant AntiRef100.

Where can I download AntiRef datasets?

AntiRef datasets are available on Zenodo and can be downloaded at the following links:

AntiRef100: representative sequences resulting from clustering all filtered AntiRef input sequences at 100% identity.
AntiRef99: representative sequences resulting from clustering AntiRef100 at 99% identity.
AntiRef98: representative sequences resulting from clustering AntiRef99 at 98% identity.
AntiRef96: representative sequences resulting from clustering AntiRef98 at 96% identity.
AntiRef94: representative sequences resulting from clustering AntiRef96 at 94% identity.
AntiRef92: representative sequences resulting from clustering AntiRef94 at 92% identity.
AntiRef90: representative sequences resulting from clustering AntiRef92 at 90% identity.

How should I cite AntiRef?

Antiref has been published in Bioinformatics Advances and can be cited as:

Briney B. (2023). AntiRef: reference clusters of human antibody sequences.
Bioinformatics Advances. https://doi.org/10.1093/bioadv/vbad109.

Zenodo provides a unique DOI for each version of deposited dataset. The DOI of the current version of AntiRef (v2022.12.14) is 10.5281/zenodo.7474336, so an appropriate citation would be:

Briney, Bryan. (2022). AntiRef: reference clusters of human antibody sequences (v2022.12.14). 
[Data set]. Zenodo. https://doi.org/10.5281/zenodo.7474336

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
img		img
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster.ipynb		cluster.ipynb
download.ipynb		download.ipynb
filter.ipynb		filter.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cluster.ipynb

cluster.ipynb

download.ipynb

download.ipynb

filter.ipynb

filter.ipynb

Repository files navigation

AntiRef: reference clusters of human antibody sequences

What is nested clustering?

Why AntiRef?

Where can I download AntiRef datasets?

How should I cite AntiRef?

About

Releases

Packages

Languages

License

brineylab/antiref

Folders and files

Latest commit

History

Repository files navigation

AntiRef: reference clusters of human antibody sequences

What is nested clustering?

Why AntiRef?

Where can I download AntiRef datasets?

How should I cite AntiRef?

About

Resources

License

Stars

Watchers

Forks

Languages