Use reference genetic map to interpolate genetic position for a query set of variants
Certain statistical genetic applications require the analyst to annotate variants by genetic position, typically in centimorgans or morgans. Unlike physical position, which is canonically treated as a constant across genomes, genetic position may potentially vary due to a number of more complicated factors. As such, it is typically estimated at a nonuniform mesh of locations across the genome in a dedicated study, and then those estimates are used to create estimates for individual downstream analyses.
In the case that the original estimation mesh overlaps with the experimental variant set, annotation with genetic position is straightforward. However, often one must annotate positions that were not present in the original mesh. To do so, canonically a parameter in units (centimorgans / megabase) is reported across the mesh; this represents the rate of change of genetic distance across physical distance. This is almost always a linear approximation of a nonlinear function. To annotate new sites, the analyst must locate the nearest mesh location with lower physical position than the experimental site, and compute the genetic position for the site using the linear approximation
experimental.gpos = mesh.gpos + (experimental.ppos - mesh.ppos) / 1000000 * mesh.rate
where gpos
is genetic position in centimorgans, ppos
is physical position in the
reference genome, and rate
is the (centimorgans / megabase) parameter mentioned above.
This calculation is fairly straightforward, though minor complications arise in corner cases:
- an experimental variant physically located before the first mesh position on a chromosome is assigned the genetic position 0; this is true of all such variants meeting this condition
- similarly, an experimental variant physically located after the last mesh position on
a chromosome is assigned the genetic distance of that last mesh position
- since the first mesh location is arbitrarily assigned genetic position 0, these are actually the same condition, but they tend to look different at first glance
- experimental genetic position is typically only computed for autosomes and the X chromosome
- experimental genetic position is genome build-specific; there are conceptually different
estimates depending on which reference genome is used in the estimate
- note, however, that many versions of the human genetic recombination map were simply translated from earlier genome reference builds with liftOver, which has substantial flaws
All of the above is manageable, but fiddly and error prone, and is a repeated task that people implement individually each time they do it.
This repo contains a standalone utility that performs this annotation and interpolation on a variety of different input data formats and using an assortment of possible genetic maps. It is (will be) tested such that its output is reliable, and when approximations are required, those will be described and exposed to user configuration as much as possible.
There are two primary installation methods, depending on your needs.
-
If needed, install mamba
-
Install the package with mamba:
mamba create -n igp -c https://raw.githubusercontent.com/lightning-auriga/conda-builds/default/conda-builds -c bioconda -c conda-forge interpolate-genetic-position
-
Activate the resulting environment:
mamba activate igp
-
The tool should now be available as
interpolate-genetic-position.out
See the provided environment.yaml file for conda-formatted dependencies for building this package.
By default, a build process involving a conda/mamba environment is supported.
-
if you wish to use
conda
and it's not currently available, you can install it with the instructions here -
navigate into your project directory (interpolate-genetic-position)
-
create the
conda
environment for installation as follows:mamba env create -f environment.yaml
-
activate the conda environment:
mamba activate interpolate-genetic-position-env
-
update (create) the necessary
configure
scripts withautoreconf
:autoreconf --force --install
- note that this can also be run with
./generate.bash
inside the repo
- note that this can also be run with
-
run
configure
:./configure --with-boost=${CONDA_PREFIX} --with-boost-libdir=${CONDA_PREFIX}/lib
- if you are planning on installing software to a local directory, run instead
./configure --prefix=/install/dir [...]
- if you are planning on installing software to a local directory, run instead
-
run
make -j{ncores}
-
if desired, run
make install
. if permissions issues are reported, see above for reconfiguring with./configure --prefix
.
By default, the final compiled program can be run with
interpolate-genetic-position.out
Parameter | Description |
---|---|
--input -i |
Input file of variants or regions to annotate. Needs to be sorted, chromosome and position. Can be gzipped. If not specified, will be read as plaintext from stdin. |
--preset -p |
Format of input variant file. Accepted formats: bim , map , snp , vcf , bed . |
--genetic-map -g |
Input recombination map. Needs to be sorted, chromosome and position. Can be gzipped (except bigwigs). If not specified, will be read as plaintext from stdin. |
--map-format -m |
Format of recombination map. Accepted formats: bolt , bedgraph , bigwig (see below for further discussion). |
--output -o |
Output file. Will match format of input. Cannot currently be gzipped. If not specified, will be written to stdout. |
--output-format -f |
Format of output file. Accepted formats: bolt , bim , map , `snp (see below for further discussion). |
--verbose -v |
Whether to print extremely verbose debug logs. You probably don't want this. |
--output-morgans |
Report output genetic position in morgans, instead of the default centimorgans. |
--region-step-interval |
Add a fixed genetic distance at the boundaries of end positions of bedfile region queries, such that the output data have a step-like structure. This functionality is included for experimental purposes, and in most applications this setting should be kept at its default of 0. |
--precision |
Specify the number of bits to use for internal floating point calculations. Higher values will lead to higher fidelity calculations, at the cost of RAM and potentially performance. The program is capable of detecting situations where it has experienced catastrophic floating point errors, and in such a situation, try increasing this value to fix the problem. Default is 64. |
--fixed-output-width |
Specify the number of digits after the decimal to be reported in output floating point values. More digits will potentially lead to more consistent downstream interpolation calculations, at the cost of file size. Leaving this unspecified or setting this to 0 causes the output floating point width to be scaled to the number of significant digits per value. |
--help -h |
Print brief help message and exit. |
--version |
Print version string for current build. |
Note that, of the above, either -i
or -g
can be read from stdin, but not both.
This program can attempt to automatically reformat input files into different format output files, but the effectiveness of such a conversion with the available information varies.
Input Format | Valid Output Formats | Notes |
---|---|---|
bim | bim, map, snp | |
map | map | Map files lack allele information, and so allele-containing formats are not possible. |
snp | bim, map, snp | |
vcf | bim, map, snp | Vcf is not itself a supported output format. Note that for markers with multiple alternate alleles, only the first will be reported. |
bed | bolt | Input bed regions are converted into bolt-format genetic maps. |
There are an assortment of different recombination rate/genetic map files available publicly online. The choice of which map to use has a variety of constraints:
- Consider your input experimental data. Depending on the genome build used, you may be constrained in which map you choose. The oldest of these maps dates back to HapMap2/hg18. A large number of maps in later builds are derived directly from this map with liftOver, and are arguably of low quality
- Certain experimental applications might benefit from the use of a recombination map that was estimated from a population that matches the genetic ancestry of your dataset. In general, there haven't been many great findings derived from using population-specific maps, and most people tend to rely on either cosmopolitan or single ancestry (read: European) maps. Most tools don't really seem to care too much
The following primary maps are supported by this tool:
- Genetic maps released with BOLT-LMM (
--map-format bolt
)- Available in the tarball in this directory. Download
and extract the latest version (2.4.1 as of 2023); the map you want is in the
tables
subdirectory.- Make sure you choose the one matching your dataset's genome build
- These maps cover hg17, hg18, hg19 (GRCh37), and hg38 (GRCh38).
- Available in the tarball in this directory. Download
and extract the latest version (2.4.1 as of 2023); the map you want is in the
- Recombination rate tracks from UCSC (
--map-format bigwig
or--map-format bedgraph
)- Available as bigwigs in this directory. Download the version you prefer (I have used this one).
- The link in this readme is just for hg38/GRCh38. You can go digging around for others if you like; they all have benefits and drawbacks. Hopefully the track format is consistent enough that it'll be compatible.
- In theory, this should work on any of the aforementioned genome builds; in practice, I've only tested it for hg38 so far.
- If you have mismatched contig issues with a bigwig, you can convert a bigwig to bedgraph and have it in any chromosome annotation.
However, the bedgraph needs to be sorted by karyotyping order, not the lexicographical sort the UCSC tracks use.
To convert a UCSC track to a bedgraph for use with this program, do some version of the following:
mamba install -c bioconda -c conda-forge ucsc-bigwigtobedgraph
bigWigToBedGraph recombAvg.bw recombAvg.bedgraph
sed 's/^chrX/23/ ; s/^chr//' recombAvg.bedgraph | sort -k 1,1g -k 2,2g -k3,3g | sed -r 's/^23/chrX/ ; s/^([^c])/chr\1/' > recombAvg.sorted.bedgraph
- Note that there's one potentially significant advantage to using bedgraphs: as currently implemented, the program will load all recombination data for a chromosome at once (clearing memory between chromosomes). For most recombination maps and computers, this RAM usage is pretty trivial. However, if you have a particularly fine-grained recombination map, the intervals themselves could be a memory block. If you convert such a map to bedgraph (and sort it) in advance, the program will stream linewise and have fixed, constant, low memory use.
For idiosyncratic reasons, input queries in bed format are emitted as bolt-format genetic maps. This is due to the need to specify both rate and position in the output blocks. Note that query blocks are fragmented based on whatever mosaic pattern in which they happen to intersect with the initial genetic map. Due to the lack of end positions in bolt format maps, a final bolt entry per-chromosome is injected into the output with fixed rate 0, to prevent implied interpolation with the rate of the last estimated block.
The flag --region-step-interval [double-precision value]
causes the bolt-format genetic map emitted for
bedfile queries to contain stepwise increments in genetic position at the end position of each query interval.
This is for experimental purposes; for most practical uses of this tool, this flag should be left at its default of 0.
When --region-step-interval
is non-zero, bedfile column 4 is considered a group label, such that consecutive queries
with the same column 4 entry will not have their genetic position incremented by the fixed specified interval.
This behavior allows distinct query regions to be assigned to the same conceptual unit. This behavior is obviously very niche,
and since --region-step-interval
should almost always be 0, this behavior will have no practical impact in most circumstances.
This program can accept input files as streams and can emit output to stream. To have a file be read from
or written to stream, simply omit its relevant flag (-i
, -g
, -o
). For input streams, the corresponding
format flag should still be set. Only one input stream (either -i
or -g
) can be read from an input stream
per run.
The recombination rates and physical positions involved in these maps creates situations where some interpolations create extremely small differences in estimated recombination rates. Fixed precision in computational calculations and output files can create cascading rounding errors when using genetic maps to annotate, for example, a series of variants on a chromosome where the genetic position of successive variants must never decrease, even by a very small amount.
To allow the user to conditionally avoid this difficulty, the program offers the following:
- the program will check its output genetic positions and error out if it finds a position that decreases relative to the prior output result;
- the program has the configuration flag
--precision
that controls the amount of memory in RAM that is used for calculations; and - the program has the configuration flag
--fixed-output-width
that controls the number of digits after the decimal place that are reported for any decimal value in its output.
If the user encounters the error in the first point above, they can then increase in tandem the values
of --precision
and --fixed-output-width
until the observed errors go away. Unfortunately, it is extremely
challenging to predict how to adjust these values without simply trying different combinations. In a test
case with real data, the permutation of values --precision 256 --fixed-width 60
addressed all observed issues.
This is the most basic expected use case.
interpolate-genetic-position.out -i infile.bim -p bim -g genetic_map.tsv -m bolt -o infile_with_gpos.bim -f bim
Note that vcf output format is not supported, so you'll have to do some further processing if you want to add the interpolated values back into e.g. the INFO field somewhere.
interpolate-genetic-position.out -i infile.vcf.gz -p vcf -g genetic_map.tsv -m bolt -o interpolated_values.map -f map
Regions don't have a single genetic distance associated with them, and so for completeness, when annotating a bed region file, the output becomes a genetic map with associated genetic distance and rate (cM/Mb) data.
interpolate-genetic-position.out -i infile.bed -p bed -g genetic_map.tsv -m bolt -o new_recombination_map.tsv -f bolt
Take a bed file of regions, turn it into its own recombination map, and use that map to annotate a second file
Either the input query file (-i
), or the input recombination map (-g
), but not both at once, can be read
from stream by leaving the corresponding argument unspecified. Note that the corresponding format flag
is still required. The output can also be streamed in the same manner with the same restriction.
interpolate-genetic-position.out -i infile.bed -p bed -g genetic_map.tsv -m bolt -f bolt |
interpolate-genetic-position.out -i infile.vcf -p vcf -m bolt -o annotated_variants.map -f map
See ChangeLog.md.