Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bed file error #156

Open
LliliansCalvo opened this issue Apr 22, 2024 · 10 comments
Open

Bed file error #156

LliliansCalvo opened this issue Apr 22, 2024 · 10 comments

Comments

@LliliansCalvo
Copy link

Hi I am trying to run TOGA, so far u successfully ran the first part to make the alignement.
However whenever I try to run TOGA I get an error related to the bed file that I don't really know how to troubleshoot.
Thanks!

./toga.py make_lastz_chains/Dmel_Cfel_2bit/Dm6.Cfel.final.chain.gz /work/FAC/FBM/Drosophila_genome/dm6.ncbiRefSeq_gxf2bed.bed /work/FAC/FBM/Drosophila_genome/dm6.2bit mod_GCA_030586385.1_ASM3058638v1_genomic.2bit --kt --pn Dmel_Cfel_results --cb 3,5 --cjn 500 --ms

ERROR:
#### Initiating TOGA class ####
# python interpreter path: /users/lcalvogo/.conda/envs/llilians_env/bin/python3
# python interpreter version: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0]
Version 1.1.8.dev
Commit: 97eb5a17ce76fccd9858b2ed738c51cd661292aa
Branch: master

# Python package versions
* twobitreader: unknown version
* networkx: 3.2.1
* pandas: 2.1.2
* numpy: 1.26.1
* xgboost: 2.0.1
! scikit-learn: Not installed - will try to install
* joblib: 1.3.2
* h5py: 3.10.0
Calling cmd:
/work/FAC/FBM/TOGA/./configure.sh

Compiling C code...
Model found
CESAR installation found
Command finished with exit code 0.
Does it work?
Calling cmd:
gzip -dc make_lastz_chains/Dmel_Cfel_2bit/Dm6.Cfel.final.chain.gz | /work/FAC/FBM/TOGA/./modules/chain_score_filter stdin 15000 > /work/FAC/FBM/TOGA/Dmel_Cfel_results/temp/genome_alignment.chain

Command finished with exit code 0.
filter_bed: transcript CR11023 skipped: out-of-frame transcript
filter_bed: Error! Bed file is corrupted, thickEnd MUST be >= thickStart.
Problem occurred at line 2, gene NM_078715.4
filter_bed: Error! Bed file is corrupted, thickEnd MUST be >= thickStart.
Problem occurred at line 2, gene NM_078715.4



where 
head  /work/FAC/FBM/Drosophila_genome/dm6.ncbiRefSeq_gxf2bed.bed 
chr2L	7528	9484	CR11023	0	+	7528	9484	0	2	588,1292,	0,664,
chr2L	9838	18570	NM_078715.4	0	-	17133	11214	0	9	310,160,779,1192,106,643,443,109,1506,	8422,7214,5094,3844,3681,2447,1940,1571,0,
chr2L	9838	21376	NM_001258875.2	0	-	17133	11214	0	1	311,	11227,
chr2L	9838	21376	NM_164350.3	0	-	15645	11214	0	9	311,141,779,1192,106,643,443,109,1506,	11227,10041,5094,3844,3681,2447,1940,1571,0,
chr2L	9838	21376	NM_001258873.2	0	-	17133	11214	0	9	241,160,779,1192,106,643,443,109,1506,	11297,7214,5094,3844,3681,2447,1940,1571,0,
chr2L	9838	18570	NM_164349.3	0	-	17133	11214	0	10	310,143,160,779,1192,106,643,443,109,1506,	8422,8187,7214,5094,3844,3681,2447,1940,1571,0,
chr2L	9838	21376	NM_164352.3	0	-	15645	11214	0	10	28,135,141,779,1192,106,643,443,109,1506,	11510,11227,10041,5094,3844,3681,2447,1940,1571,0,
chr2L	9838	21376	NM_001258872.2	0	-	17133	11214	0	10	241,136,160,779,1192,106,643,443,109,1506,	11297,10041,7214,5094,3844,3681,2447,1940,1571,0,
chr2L	9838	21376	NM_164348.3	0	-	19941	11214	0	9	241,136,779,1192,106,643,443,109,1506,	11297,10046,5094,3844,3681,2447,1940,1571,0,
chr2L	9838	21376	NM_164351.3	0	-	15645	11214	0	9	311,136,779,1192,106,643,443,109,1506,	11227,10041,5094,3844,3681,2447,1940,1571,0,
@MichaelHiller
Copy link
Collaborator

Your input dm6.ncbi*bed file is corrupted. All coordinates in bed are always listed such that the start < end, also for - strand genes.
This is the case for the transcription start/end, but not translation (thick) start/end, where you have 17133 11214.

This needs to be fixed.

Afterwards, you can do this check with UCSC kent tools
bedToGenePred dm6.ncbiRefSeq_gxf2bed.bed stdout | genePredCheck stdin
to see if all is fine now.

@LliliansCalvo
Copy link
Author

LliliansCalvo commented Apr 23, 2024

Thanks a lot!

I got a new bed file and still get errors, I understand the last error about the not allowed characters but I don't understand the other ones since the bed file passed the previous checks.
Thanks a lot !

bedToGenePred result.bed stdout | genePredCheck stdin
checked: 35121 failed: 0

head result.bed 
chrX	3373158	3522787	NR_123782.1	0	+	3522787	3522787	0	4	565,1294,56,110,	0,1781,77851,149519,
chrX	686833	749496	NM_057942.5	0	+	687698	747631	0	16	293,286,135,177,755,128,120,156,432,198,3234,282,24,321,335,2108,	0,739,35425,41717,43402,44225,50335,50676,53593,54845,55696,59000,59446,59683,60070,60555,
chrM	14130	14916	rna-Dmel_CR34096	0	-	14916	14916	0	1	786,	0,
chrM	14057	14130	rna-Dmel_CR34095	0	-	14130	14130	0	1	73,	0,
chrM	12734	14058	rna-Dmel_CR34094	0	-	14058	14058	0	1	1324,	0,
chrM	12669	12734	rna-Dmel_CR34093	0	-	12734	12734	0	1	65,	0,
chrM	11720	12659	YP_009047278.1	0	-	11720	12659	0	1	939,	0,
chrM	11637	11703	rna-Dmel_CR34091	0	+	11703	11703	0	1	66,	0,
chrM	10498	11635	YP_009047277.1	0	+	10498	11635	0	1	1137,	0,
chrM	9970	10495	YP_009047276.1	0	+	9970	10495	0	1	525,	0,
chrM	9903	9964	rna-Dmel_CR34088	0	-	9964	9964	0	1	61,	0,
chrM	9837	9903	rna-Dmel_CR34087	0	+	9903	9903	0	1	66,	0,

./toga.py make_lastz_chains/Dmel_Cfel_2bit/Dm6.Cfel.final.chain.gz /work/FAC/FBM/Drosophila_genome/result.bed /work/FAC/FBM/Drosophila_genome/dm6.2bit mod_GCA_030586385.1_ASM3058638v1_genomic.2bit --kt --pn Dmel_Cfel_results --cb 3,5 --cjn 500 --ms

cat slurm-42609212.out 

#### Initiating TOGA class ####
# python interpreter path: /users/lcalvogo/.conda/envs/llilians_env/bin/python3
# python interpreter version: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0]
Version 1.1.8.dev
Commit: 97eb5a17ce76fccd9858b2ed738c51cd661292aa
Branch: master

# Python package versions
* twobitreader: unknown version
* networkx: 3.2.1
* pandas: 2.1.2
* numpy: 1.26.1
* xgboost: 2.0.1
! scikit-learn: Not installed - will try to install
* joblib: 1.3.2
* h5py: 3.10.0
Calling cmd:
/work/FAC/FBM/TOGA/TOGA/./configure.sh

Compiling C code...
Model found
CESAR installation found
Command finished with exit code 0.
Does it work?
Calling cmd:
gzip -dc make_lastz_chains/Dmel_Cfel_2bit/Dm6.Cfel.final.chain.gz | /work/FAC/FBM/TOGA/TOGA/./modules/chain_score_filter stdin 15000 > /work/FAC/FBM/TOGA/TOGA/Dmel_Cfel_results/temp/genome_alignment.chain

Command finished with exit code 0.
filter_bed: transcript rna-Dmel_CR34096 skipped: no CDS
filter_bed: transcript rna-Dmel_CR34095 skipped: no CDS
filter_bed: transcript rna-Dmel_CR34094 skipped: no CDS
filter_bed: transcript NM_057942.5 skipped: out-of-frame transcript
filter_bed: transcript NM_001169159.2 skipped: out-of-frame transcript
filter_bed: transcript YP_009047274.1 skipped: out-of-frame transcript
filter_bed: Error! 35 transcript names contain not allowed characters
* rna-gnl|FlyBase|CR45851-RA
filter_bed: Allowed characters are: a-zA-Z0-9._-
filter_bed: Allowed characters are: a-zA-Z0-9._-

@MichaelHiller
Copy link
Collaborator

I believe the | chars are not allowed.
Maybe replace your gene and transcript IDs with 'normal' names.

The others are warnings of transcripts which do not have a proper reading frame and will be ignored.

@LliliansCalvo
Copy link
Author

Thanks,

Now with 'normal' IDs it stills trows some of the warnings as before and it moves a bit forward but stalls later.

head /work/FAC/FBM/Drosophila_genome/dm6.refGene.gtf.bed
chrX	11933704	11983915	NM_001370018	0	-	11935993	11980057	0	31	2491,245,152,201,164,122,209,203,282,95,183,84,111,165,202,144,564,336,514,198,105,60,116,104,221,156,101,140,319,414,70,	0,2556,2987,5216,5990,6216,14923,15347,16171,18424,20394,28315,28525,28746,30353,30921,34946,36626,37754,38341,39552,39720,40519,42698,42899,43184,43757,45920,46134,47058,50141,
chrX	11933704	11983915	NM_001370019	0	-	11935258	11980057	0	31	2491,245,152,201,164,122,209,203,282,95,183,84,111,165,202,144,564,336,514,198,105,60,116,104,221,156,101,140,319,414,70,	0,2556,2987,5216,5990,6216,14923,15347,16171,18424,20394,28315,28525,28746,30353,30921,34946,36626,37754,38341,39552,39720,40519,42698,42899,43184,43757,45920,46134,47058,50141,
chrX	11933704	11983915	NM_206696	0	-	11936044	11980057	0	31	2491,245,152,201,164,122,209,203,282,95,183,84,111,165,202,144,564,336,514,198,105,60,116,104,221,156,101,140,319,414,70,	0,2556,2987,5216,5990,6216,14923,15347,16171,18424,20394,28315,28525,28746,30353,30921,34946,36626,37754,38341,39552,39720,40519,42698,42899,43184,43757,45920,46134,47058,50141

./toga.py make_lastz_chains/Dmel_Cfel/Dm6.Cfel.final.chain.gz  /work/FAC/FBM/Drosophila_genome/dm6.refGene.gtf.bed   /work/FAC/FBM/Drosophila_genome/dm6.2bit mod_GCA_030586385.1_ASM3058638v1_genomic.2bit --kt --pn Dmel_Cfel_results --cb 3,5 --cjn 500 --ms

filter_bed: transcript NR_001851_3 skipped: no CDS
filter_bed: transcript NR_001851_4 skipped: no CDS
filter_bed: transcript NR_001851_5 skipped: no CDS
filter_bed: transcript NM_001299840 skipped: out-of-frame transcript
filter_bed: transcript NM_001299811 skipped: out-of-frame transcript
[...]

Continue without isoforms file: not provided
Found 1870 sequences in /work/FAC/FBM/Drosophila_genome/dm6.2bit
Found 1870 sequences in /work/FAC/FBM/Drosophila_genome/dm6.2bit
Found 1197 sequences in /work/FAC/FBM/TOGA/TOGA/mod_GCA_030586385.1_ASM3058638v1_genomic.2bit
Traceback (most recent call last):
  File "/work/FAC/FBM/TOGA/TOGA/./toga.py", line 1600, in <module>
    main()
  File "/work/FAC/FBM/TOGA/TOGA/./toga.py", line 1595, in main
    toga_manager = Toga(args)
  File "/work/FAC/FBM/TOGA/TOGA/./toga.py", line 261, in __init__
    self.__check_param_files()
  File "/work/FAC/FBM/TOGA/TOGA/./toga.py", line 339, in __check_param_files
    TogaSanityChecker.check_2bit_file_completeness(self.q_2bit, q_chrom_to_size, self.chain_file)
  File "/work/FAC/FBM/TOGA/TOGA/modules/toga_sanity_checks.py", line 87, in check_2bit_file_completeness
    raise ValueError(err)
ValueError: Error! 2bit file: /work/FAC/FBM/TOGA/TOGA/mod_GCA_030586385.1_ASM3058638v1_genomic.2bit; chain/bed file: /work/FAC/FBM/TOGA/TOGA/Dmel_Cfel_results/temp/genome_alignment.chain; Some chromosomes present in the chain/bed file are not found in the Two bit file. First <=100: JAPIVC010000807
JAPIVC010000905
JAPIVC010000227
JAPIVC010000952
JAPIVC010000420

But those IDs do exist in the 2bit file. I even converted the 2bit back to fasta and the headers are there

grep -e ">" mod_GCA_030586385.1_ASM3058638v1_genomic_2bit.fa | head
>JAPIVC010000001.1
>JAPIVC010000002.1
>JAPIVC010000003.1
>JAPIVC010000004.1
>JAPIVC010000005.1
(llilians_env) [lcalvogo@curnagl TOGA]$ grep -e "JAPIVC010000905" mod_GCA_030586385.1_ASM3058638v1_genomic_2bit.fa | head
>JAPIVC010000905.1

@MichaelHiller
Copy link
Collaborator

You have JAPIVC010000807.1 and not JAPIVC010000807 in the 2bit file

@LliliansCalvo
Copy link
Author

LliliansCalvo commented Apr 24, 2024

But in theory that's not true because:

  1. The bit file was build using the fasta file where the headers have the .1
make_lastz_chains/HL_kent_binaries/faToTwoBit /work/FAC/FBM/Gilly/genome/mod_GCA_030586385.1_ASM3058638v1_genomic.fna mod_GCA_030586385.1_ASM3058638v1_genomic.2bit

grep -e ">" /work/FAC/FBM/Gilly/genome/mod_GCA_030586385.1_ASM3058638v1_genomic.fna | head 
>JAPIVC010000001.1
>JAPIVC010000002.1
>JAPIVC010000003.1
>JAPIVC010000004.1

  1. When I reverse converted the 2bit file back to fasta the headers still have the .1
HL_kent_binaries/twoBitToFa mod_GCA_030586385.1_ASM3058638v1_genomic.2bit mod_GCA_030586385.1_ASM3058638v1_genomic_2bit.fa
 
grep -e ">"  mod_GCA_030586385.1_ASM3058638v1_genomic_2bit.fa| head
>JAPIVC010000001.1
>JAPIVC010000002.1
>JAPIVC010000003.1
>JAPIVC010000004.1
>JAPIVC010000005.1

@MichaelHiller
Copy link
Collaborator

All scaffold names have to match, otherwise the method does not know what is what.
Sorry

@LliliansCalvo
Copy link
Author

Thanks!
Yes, I got that.
But in theory they all match since they all have the .1 (in both fasta and 2bit file). Which is why i dont understand the TOGA error. Unless >header.1 is just not allowed altogether.

@MichaelHiller
Copy link
Collaborator

I am a bit confused. The TOGA error indicates that either the bed or the chain file have JAPIVC010000001 while the 2bit has JAPIVC010000001.1

We always strip the .1 from scaffold name when we load a genome, as there is no reason to have .1 attached.

Could you pls try to remove the .1 from all files that TOGA gets?

@kirilenkobm
Copy link
Member

Hi @LliliansCalvo
pls also check that the chain file has the same names.
There must be a match between chromosome/scaffold names in chain AND bed AND both 2bits.

I suspect the chain file has chromosome/scaffold names without .1...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants