Investigating genome size variation with k-mers

A repository accompanying the manuscript "Measuring the invisible – The sequences causal of genome size differences in eyebrights (Euphrasia) revealed by k-mers".

Motivation

Intraspecific genome size (GS) variation is due to presence/absence variation, which may affect single-copy regions or in genomic repeats. To-date, studies targetting the sequence underpinning GS variation commonly use low pass sequencing ("genome skimming") data analysed with the RepeateExplorer pipeline. Such studies have convincingly identified repeats involved in GS variation, but they necessarily paint an incomplete picture - using genome skimming data, it is not possible to assess the contribution to GS variation of low- and single-copy sequences.

Here, we implement an alternative approach. We use k-mers (short sub-sequences of length 21 generated from sequencing reads) from high-coverage sequencing data sets. We compare k-mer inventories between individuals, which allows us the assess the role of all genomic copy-number classes, from single-copy sequences to highly repetitive satellite DNAs.

Requirements and dependencies

K-mer tool kit. You will need to have KMC3 installed. KMC3 can be set up with anaconda, for instance by running conda install -c bioconda kmc, or (generating a new environment) conda create -n kmc -c bioconda kmc.
File links. You should rename (or generate links to) your sequencing data files so that each sample has a unique prefix that can be used to easily select all of an individual's files.
Quality filtering/trimming. (Optionally, but recommended) trim and clean your sequencing data. Sequnecing errors do not matter much. They generate unique k-mers that do not significantly affect estimates. Sequencing adapter contaminations, however, can show as high-copy number k-mers, biasing genome size estimates. We used fastp.
Oraganellar assemblies. You need reference sequences for the plastid and mitochrondrial genome. You may choose to assemble de novo from your data using GetOrganelle or download something suitable from a repository. These assemblies are then used to remove organelle k-mers from your data, which would otherwise bias genome size estimates.

Running the pipeline

The pipeline has two steps:

Generation of k-mer databases and k-mer spectra (These need to be analysed manually to assess the sequencing coverage (for instance using Tetmer.)
Generation of the scaled and binned joint k-mer spectra

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
00preTreat		00preTreat
01pipeline		01pipeline
02optionallyUseRepeatExplorerResults		02optionallyUseRepeatExplorerResults
03analyseResults		03analyseResults
Becher2022data		Becher2022data
mito_genome		mito_genome
plastid_genome		plastid_genome
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

00preTreat

00preTreat

01pipeline

01pipeline

02optionallyUseRepeatExplorerResults

02optionallyUseRepeatExplorerResults

03analyseResults

03analyseResults

Becher2022data

Becher2022data

mito_genome

mito_genome

plastid_genome

plastid_genome

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Investigating genome size variation with k-mers

Motivation

Requirements and dependencies

Running the pipeline

In more detail

About

Releases

Packages

Languages

License

hannesbecher/genome-size-variation

Folders and files

Latest commit

History

Repository files navigation

Investigating genome size variation with k-mers

Motivation

Requirements and dependencies

Running the pipeline

In more detail

About

Topics

Resources

License

Stars

Watchers

Forks

Languages