-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input file preparation #158
Comments
To the first question. hg38 and the input gene annotation works really well for other placental mammals. For birds, you can use chicken. For other species, you want to pick a well assembled and well annotated species as the reference. yes, faToTwoBit from the kent src code converts fa to 2bit. Our human annotation is a few years old. So likely you get updated transcripts and new transcripts if you use the current NCBI annotation. For U12, we used a DB that is also now outdated. I would recommend running intronIC to infer U12 introns. Hope that helps |
Thank you for your answer, it will be very helpful. |
Hello, I found a TOGAInput folder in the directory, which contains toga.isoforms.tsv; toga.transcripts.bed; toga.U12introns.tsv. At the same time, you also provided hg38.2bit, are these files complete, can I directly access them as input files, and do I only need to prepare chain files and 2bit files of the query genome |
Yes, that is pretty much complete. chrY has mostly only genes on the PAR (including them would lead to many false 2:1 orthologies). And we don't include the _hap, _fix _random, _other scaffolds that often represent variants in the population for the same reason. We will release an updated input annotation with a new TOGA version (in the works) |
I would like to know how each input file was obtained, and whether I can also use hg38 as a reference genome to apply to other projects.
My current understanding is that the two 2bit files are converted from the genome fasta file, the chain file is obtained by lastz alignment, and the bed12 file is converted from the hg38 genome gff annotation file in ncbi. However, the gene id and transcript id in the isoforms.tsv and U12sites.tsv you provided are different from those I converted by myself. Do you have any suggestions on this?
The text was updated successfully, but these errors were encountered: