Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q on coverage, multi-genome alignment, and maf2synteny #41

Open
ohdongha opened this issue Jun 30, 2022 · 4 comments
Open

Q on coverage, multi-genome alignment, and maf2synteny #41

ohdongha opened this issue Jun 30, 2022 · 4 comments

Comments

@ohdongha
Copy link

ohdongha commented Jun 30, 2022

Hi, thanks for this fast and easy-to-run aligner :) I have two questions.

  1. How the "coverage" in the report was calculated? I had a coverage value 0.95 from a human GRCh38 to human T2T alignment run. Is the coverage = (total non-overlapping length covered by the alignment) * 2 / (length of genome 1 + length of genome 2)?
  2. The instruction goes sibeliaz genome1.fa genome2.fa but what if I want to align more than two genomes? Can I just keep adding genome3.fa genome4.fa ...? And in this case how the coverage would be calculated?

Thanks a lot!

@iminkin
Copy link
Collaborator

iminkin commented Jun 30, 2022

Hi!

The coverage is calculated as # the total number of positions in all genomes covered by the alignment / total length of the genomes. The alignments blocks output by SibeliaZ do not overlap, at least they are not supposed to. You can add as many input genomes as you want. One more note on coverage: SibeliaZ filters out repeats with many copies, so sometimes the coverage can be less than expected. This behavior is parameter-dependent though.

@ohdongha ohdongha changed the title Q on coverage and multi-genome alignment Q on coverage, multi-genome alignment, and maf2synteny Jul 1, 2022
@ohdongha
Copy link
Author

ohdongha commented Jul 1, 2022

Thanks, Ilia, for the quick response :) I have a few more questions regarding the "non-overlapping" part since this appears a requirement for maf2synteny.

  1. The 1st alignment from the GRCh38-T2T comparison (attached below) had numerous (=180) sequences from mostly chromosome 1s from the two assemblies. Likewise, the alignments at the beginning of the MAF output contained many sequences aligned. But this number of sequences decreased and later alignments are mostly 1:1.
    human_GRCh38_vs_T2T_firstBlock.maf.txt
    I assume "non-overlapping" means that even if each alignment can be 1:multi, multi:1, or multi:multi, the same genomic region won't appear in more than one alignment. Is it correct?

  2. The 1st alignment appears to be from repeats considering the many sequences in it. I used the default setting (-a 150) which would remove k-mers with frequency >150. Does this mean that repeat with copy number <150 can be still captured in the alignment while higher-copy repeats are more likely removed?

  3. I read that MAF is "reference-based". Does SibeliaZ consider one of the genomes as the reference, e.g. the first one among the two (or many) given as arguments? If so, does the output MAF have a distinction about which is from the reference?

  4. How does the maf2synteny deal with the sequences in 1:multi, multi:1, or multi:multi alignments when merging alignments to create longer synteny blocks? Will they be ignored or somehow some (or one?) of them be included in the synteny block?

  5. Is there an easy way to find out the proportions of each genome that were aligned 1:1 or 1:multi / multi:1 / multi:multi from the MAF or GFF output?

Thanks again!

@iminkin
Copy link
Collaborator

iminkin commented Jul 11, 2022

Hi @ohdongha ,

Sorry for the late reply.

  1. Yes, each position in the genome is only supposed to be a part of a single alignment block.
  2. Yes, lower frequency repeats can be captured, depending on the parameter -a.
  3. MAF can be both reference or non-reference based. I am not sure, but it is likely that the notion of MAF being reference-based comes from the fact that earlier WGA tools like MultiZ were reference-based. SibeliaZ does not treat any particular input genome as a reference. Also, some other aligners (e.g. Cactus) also use MAF, but are not reference-based.
  4. I am not the author of maf2synteny, so I would refer this question to Mikhail Kolmogorov, who wrote the software. AFAIK, maf2synteny follows the general logic of Sibelia, where shorter blocks are gradually merged to form longer ones. This process does not have specific conditions on the multiplicity of the blocks. In other words, the resulting blocks could be either 1:multi, multi:1, or multi:multi, all depending on the actual data.
  5. I think the easiest way to compute it is to write a short Python script analyzing the data. It should be relatively straightforward. For example, BioPython has MAF-parser, which makes reading and operating MAF files very easy (https://biopython.org/docs/1.75/api/Bio.AlignIO.MafIO.html).

Hope this helps!

@ohdongha
Copy link
Author

ohdongha commented Jul 11, 2022

Thanks a lot, Ilia.
I noticed that sometimes the first sequence of the SibeliaZ output maf was on - strand direction and some tools (e.g. phyloFit of PHAST) don’t like it. In case of Cactus, it seems to estimate an ancestral sequence and put it on top of all alignments (even for a pairwise comparison) (and in + direction) so that the ancestral sequence kind of serve as the reference. But these may be more about the MAF specification not very well defined/agreed upon?

I also noticed that some alignment blocks have sequences from only one species (more precisely, only one chromosome of a species). Does SibeliaZ print paralogous alignment as well?

I will also try ‘maf2synteny’ and the MAF-parser and report back!
Dong-Ha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants