Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input file preparation #158

Open
molinfzl opened this issue May 6, 2024 · 4 comments
Open

Input file preparation #158

molinfzl opened this issue May 6, 2024 · 4 comments

Comments

@molinfzl
Copy link

molinfzl commented May 6, 2024

I would like to know how each input file was obtained, and whether I can also use hg38 as a reference genome to apply to other projects.
My current understanding is that the two 2bit files are converted from the genome fasta file, the chain file is obtained by lastz alignment, and the bed12 file is converted from the hg38 genome gff annotation file in ncbi. However, the gene id and transcript id in the isoforms.tsv and U12sites.tsv you provided are different from those I converted by myself. Do you have any suggestions on this?

@MichaelHiller
Copy link
Collaborator

To the first question. hg38 and the input gene annotation works really well for other placental mammals. For birds, you can use chicken. For other species, you want to pick a well assembled and well annotated species as the reference.

yes, faToTwoBit from the kent src code converts fa to 2bit.
For chains, pls use https://github.com/hillerlab/make_lastz_chains after repeatModeling and repeatMasking both reference and query.
UCSC's gff3ToGenePred can convert an annotation in gff3 to genePred.
genePredToBed converts it into bed12.

Our human annotation is a few years old. So likely you get updated transcripts and new transcripts if you use the current NCBI annotation. For U12, we used a DB that is also now outdated. I would recommend running intronIC to infer U12 introns.

Hope that helps

@molinfzl
Copy link
Author

molinfzl commented May 6, 2024

Thank you for your answer, it will be very helpful.
I have another question about how to get isoforms. I followed your instructions and found that the website has been updated to GRCh38.p14 genome, and I can also get the gene id and transcript id identified by ensembl. However, my gff file in ncbi is XM_054347076.1 identity type, how will I deal with the corresponding relationship between the two, looking forward to your answer

@molinfzl
Copy link
Author

molinfzl commented May 7, 2024

Hello, I found a TOGAInput folder in the directory, which contains toga.isoforms.tsv; toga.transcripts.bed; toga.U12introns.tsv. At the same time, you also provided hg38.2bit, are these files complete, can I directly access them as input files, and do I only need to prepare chain files and 2bit files of the query genome

@MichaelHiller
Copy link
Collaborator

Yes, that is pretty much complete. chrY has mostly only genes on the PAR (including them would lead to many false 2:1 orthologies). And we don't include the _hap, _fix _random, _other scaffolds that often represent variants in the population for the same reason.
These are the exact files we have been using for a while.

We will release an updated input annotation with a new TOGA version (in the works)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants