Input file preparation #158

molinfzl · 2024-05-06T12:56:14Z

I would like to know how each input file was obtained, and whether I can also use hg38 as a reference genome to apply to other projects.
My current understanding is that the two 2bit files are converted from the genome fasta file, the chain file is obtained by lastz alignment, and the bed12 file is converted from the hg38 genome gff annotation file in ncbi. However, the gene id and transcript id in the isoforms.tsv and U12sites.tsv you provided are different from those I converted by myself. Do you have any suggestions on this?

MichaelHiller · 2024-05-06T13:27:41Z

To the first question. hg38 and the input gene annotation works really well for other placental mammals. For birds, you can use chicken. For other species, you want to pick a well assembled and well annotated species as the reference.

yes, faToTwoBit from the kent src code converts fa to 2bit.
For chains, pls use https://github.com/hillerlab/make_lastz_chains after repeatModeling and repeatMasking both reference and query.
UCSC's gff3ToGenePred can convert an annotation in gff3 to genePred.
genePredToBed converts it into bed12.

Our human annotation is a few years old. So likely you get updated transcripts and new transcripts if you use the current NCBI annotation. For U12, we used a DB that is also now outdated. I would recommend running intronIC to infer U12 introns.

Hope that helps

molinfzl · 2024-05-06T13:40:41Z

Thank you for your answer, it will be very helpful.
I have another question about how to get isoforms. I followed your instructions and found that the website has been updated to GRCh38.p14 genome, and I can also get the gene id and transcript id identified by ensembl. However, my gff file in ncbi is XM_054347076.1 identity type, how will I deal with the corresponding relationship between the two, looking forward to your answer

molinfzl · 2024-05-07T01:54:42Z

Hello, I found a TOGAInput folder in the directory, which contains toga.isoforms.tsv; toga.transcripts.bed; toga.U12introns.tsv. At the same time, you also provided hg38.2bit, are these files complete, can I directly access them as input files, and do I only need to prepare chain files and 2bit files of the query genome

MichaelHiller · 2024-05-07T05:33:05Z

Yes, that is pretty much complete. chrY has mostly only genes on the PAR (including them would lead to many false 2:1 orthologies). And we don't include the _hap, _fix _random, _other scaffolds that often represent variants in the population for the same reason.
These are the exact files we have been using for a while.

We will release an updated input annotation with a new TOGA version (in the works)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input file preparation #158

Input file preparation #158

molinfzl commented May 6, 2024

MichaelHiller commented May 6, 2024

molinfzl commented May 6, 2024

molinfzl commented May 7, 2024 •

edited

MichaelHiller commented May 7, 2024

Input file preparation #158

Input file preparation #158

Comments

molinfzl commented May 6, 2024

MichaelHiller commented May 6, 2024

molinfzl commented May 6, 2024

molinfzl commented May 7, 2024 • edited

MichaelHiller commented May 7, 2024

molinfzl commented May 7, 2024 •

edited