Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input format error #8

Open
hanclair opened this issue Sep 13, 2022 · 4 comments
Open

Input format error #8

hanclair opened this issue Sep 13, 2022 · 4 comments
Assignees

Comments

@hanclair
Copy link

Hello,

I am hitting an error with what I assume to be a file format issue. This is the error message that I get:

2022-09-13 10:09:24,357 | tejaas.utils.args | INFO | Running TEJAAS v1.0.1
2022-09-13 10:09:24,360 | tejaas.utils.args | INFO | Method: rr
2022-09-13 10:09:24,360 | tejaas.utils.args | INFO | Null Model: perm
2022-09-13 10:09:24,360 | tejaas.utils.args | INFO | Sigma_beta: 0.1
2022-09-13 10:09:24,360 | tejaas.main | DEBUG | Using 2 cores
2022-09-13 10:09:24,389 | tejaas.iotools.data | DEBUG | Reading expression levels for trans-eQTL discovery
2022-09-13 10:09:24,769 | tejaas.iotools.data | DEBUG | Reading expression levels for target gene discovery
2022-09-13 10:09:25,710 | tejaas.iotools.data | DEBUG | Found 7155 genes of 218 samples
2022-09-13 10:09:25,710 | tejaas.iotools.data | DEBUG | Reading gencode file for gene information
2022-09-13 10:09:30,776 | tejaas.iotools.data | DEBUG | Selecting common samples of genotype and gene expression
2022-09-13 10:09:30,776 | tejaas.iotools.data | DEBUG | Before expression selection: 7155 genes from 218 samples
Traceback (most recent call last):
  File "/groups/stern/home/hanc/miniconda3/envs/py39/bin/tejaas", line 8, in <module>
    sys.exit(main())
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/main.py", line 73, in main
    data.load()
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/data.py", line 248, in load
    expression_selected = rpkm._normalize_expr(expression[:, exprmask][indices, :])
IndexError: arrays used as indices must be of integer (or boolean) type

This is my Tejaas command:

mpirun -n ${NCORE} ${RUN_PATH} --vcf ${GENOFILE} --include-SNPs ${INCSTRING} --chrom ${CHROM} \
                               --gx ${GXPRFILE} --gxcorr ${GXPRFILE} --gtf ${GENEINFO} --trim --outprefix ${OUTPREFIX_RR} \
                               --method rr --prior-sigma 0.1 --knn 30 --cismask --null perm --psnpthres 0.1 --pgenethres 0.1 --biotype 

This is the format of my expression file:

Gene_IDs        A001    A002    A003    A004    A005   
g5649   4.80323384689818        5.02117811663358        3.92602967965938        3.92602967965938        4.29661519218949        
g5653   4.28097534826939        5.02117811663358        3.92602967965938        3.92602967965938        3.92602967965938        
g5654   3.92602967965938        3.92602967965938        5.08096788235617        5.17450993990352        4.29661519218949        

My VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  A001    A002    A003    A004    A005    
1       243     ScPoQx8_4268;HRSCAF=5273_243    T       C       131     .       .       GT:PL   0/1:64,0,34     0/0:.   ./.:.   0/0:.   ./.:.   
1       247     ScPoQx8_4268;HRSCAF=5273_247    A       G       125     .       .       GT:PL   0/1:64,0,35     0/0:.   ./.:.   0/0:.   ./.:.     
1       275     ScPoQx8_4268;HRSCAF=5273_275    C       T       159     .       .       GT:PL   0/0:.   0/0:.   ./.:.   0/0:.   ./.:.   

My annotation file:

ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        gene    1       1540    0.98    -       .       ID "g1"; gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        mRNA    1       1540    0.98    -       .       ID "g1.t1"; Parent "g1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        CDS     1110    1540    0.98    -       0        Parent "g1.t1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        exon    1110    1540    .       -       .        Parent "g1.t1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        gene    15395   25390   0.65    +       .       ID "g2"; gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        mRNA    15395   25390   0.65    +       .       ID "g2.t1"; Parent "g2" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        CDS     15395   15430   0.65    +       0        Parent "g2.t1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        exon    15395   15430   .       +       .        Parent "g2.t1" gene_type "protein_coding";

This GTF was generated from a GFF3 using rtracklayer in R, and then I manually added gene_type "protein_coding"; to the last field.

Troubleshooting that I tried:

  1. Judging by the solution from the closed issue "GTF read error", I assumed my issue is the GTF file format. I tried to verify that the expression and VCF file formats are acceptable by test running Tejaas with my expression and VCF files and the gencode.v19 GTF that came with the example, but hit the same error as above.
  2. I thought perhaps Tejaas does not allow for VCF with missing data, so I removed all the sites with any missing genotypes in the VCF but also hit the same error as above. (Are VCFs with missing genotypes allowed?)
  3. I simplified the chromosome names in the annotation file to match those from gencode, like this:
chr1    AUGUSTUS        gene    28451   29982   0.29    -       .       ID "g13599"; gene_type "protein_coding";
chr1    AUGUSTUS        mRNA    28451   29982   0.29    -       .       ID "g13599.t1"; Parent "g13599" gene_type "protein_coding";
chr1    AUGUSTUS        CDS     28451   28712   0.43    -       1        Parent "g13599.t1" gene_type "protein_coding";
chr1    AUGUSTUS        exon    28451   28712   .       -       .        Parent "g13599.t1" gene_type "protein_coding";
chr1    AUGUSTUS        CDS     29864   29982   0.29    -       0        Parent "g13599.t1" gene_type "protein_coding";
chr1    AUGUSTUS        exon    29864   29982   .       -       .        Parent "g13599.t1" gene_type "protein_coding";

but ended up with a different error:

2022-09-13 12:23:29,867 | tejaas.utils.args | INFO | Running TEJAAS v1.0.1
2022-09-13 12:23:29,872 | tejaas.utils.args | INFO | Method: rr
2022-09-13 12:23:29,872 | tejaas.utils.args | INFO | Null Model: perm
2022-09-13 12:23:29,872 | tejaas.utils.args | INFO | Sigma_beta: 0.1
2022-09-13 12:23:29,872 | tejaas.main | DEBUG | Using 2 cores
2022-09-13 12:23:29,909 | tejaas.iotools.data | DEBUG | Reading expression levels for trans-eQTL discovery
2022-09-13 12:23:30,386 | tejaas.iotools.data | DEBUG | Reading expression levels for target gene discovery
2022-09-13 12:23:31,327 | tejaas.iotools.data | DEBUG | Found 7155 genes of 218 samples
2022-09-13 12:23:31,327 | tejaas.iotools.data | DEBUG | Reading gencode file for gene information
Traceback (most recent call last):
  File "/groups/stern/home/hanc/miniconda3/envs/py39/bin/tejaas", line 8, in <module>
    sys.exit(main())
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/main.py", line 73, in main
    data.load()
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/data.py", line 241, in load
    gene_info = readgtf.gencode(self.args.gtf_file, trim=self.args.gxtrim, biotype=self.args.biotype)
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/readgtf.py", line 48, in gencode
    gene_name = infolist[4].strip().split(' ')[1].replace('"','')
IndexError: list index out of range

Now I am not sure which input file is likely causing the issue.
I am able to successfully run the example files that came with the download.

Any suggestions would much appreciated. Thank you.

@banskt
Copy link
Member

banskt commented Sep 16, 2022

Hi,

Thank you for creating this issue.

The problem is because of the GTF file format, which is explained here. Note that fields are tab-separated, and the last column (field 9) is semicolon-separated list of tag-value pairs. You missed the semicolons. Tejaas looks for three tags in this field, namely "gene_name", "gene_id" and "gene_type". The first column should be the chromosome number, with or without the 'chr' prefix. However, with the input file that you are using, the debug messages should have included a line which says,

No gene annotations found in GTF file. Check feature 'gene' is present, with 'gene_id' and 'gene_name' annotations

I am surprised that this message was not produced. Would it be possible for you to share the GTF file that you are using so that I can check why it is not producing this particular error message?

Explanation of the error messages

IndexError: arrays used as indices must be of integer (or boolean) type

This happens because an empty array is used. No genes were selected because the gene names provided in the expression file did not match the gene names in the GTF file (obtained from field 9 of GTF with tag "gene_id"). In your GTF file, no "gene_id" tag-value pair is provided.

Troubleshooting 1: You found the same error because the gene_id from gencode.v19 GTF are different from the gene names that you are using.

Troubleshooting 2: Tejaas allows missing genotype.

Troubleshooting 3: I think this is because the GTF file was not tab-separated.

Fixed GTF File format

chr1    AUGUSTUS    gene    1   29982   0.29    -   .   gene_id "g5649"; gene_type "protein_coding"; gene_name "FOO";
chr1    AUGUSTUS    mRNA    28451   29982   0.29    -   .   gene_id "g13599.t1"; Parent "g13599"; gene_type "protein_coding"; gene_name "BAR";
chr1    AUGUSTUS    CDS 28451   28712   0.43    -   1   Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g2"; gene_name "FOOBAR";
chr1    AUGUSTUS    exon    28451   28712   .   -   .   Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g3"; gene_name "FOO2";
chr1    AUGUSTUS    CDS 29864   29982   0.29    -   0   Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g4"; gene_name "FOO3";
chr1    AUGUSTUS    exon    29864   29982   .   -   .   Parent "g13599.t1"; gene_type "protein_coding"; gene_id "g5"; gene_name "FOO4";

Note the new tag-value pairs, tab separation between fields.

@banskt banskt self-assigned this Sep 16, 2022
@hanclair
Copy link
Author

Thank you so much for this extremely detailed response.

I am unfortunately still have trouble with my GTF file format. I revised the input GTF to look like this:

chr1	AUGUSTUS	gene	1	1540	0.98	-	.	gene_id "g5649"; gene_type "protein_coding"; gene_name "g5649";
chr1	AUGUSTUS	mRNA	1	1540	0.98	-	.	gene_id "g5649"; Parent "g5649"; gene_type "protein_coding"; gene_name "g5649";
chr1	AUGUSTUS	CDS	1110	1540	0.98	-	0	Parent "g5649.t1"; gene_type "protein_coding"; gene_id "g5649"; gene_name "g5649";
chr1	AUGUSTUS	exon	1110	1540	.	-	.	Parent "g5649.t1"; gene_type "protein_coding"; gene_id "g5649"; gene_name "g5649";

The error I am now hitting is:

2022-09-27 10:34:39,010 | tejaas.utils.args | INFO | Running TEJAAS v1.0.1
2022-09-27 10:34:39,012 | tejaas.utils.args | INFO | Method: rr
2022-09-27 10:34:39,013 | tejaas.utils.args | INFO | Null Model: perm
2022-09-27 10:34:39,013 | tejaas.utils.args | INFO | Sigma_beta: 0.1
2022-09-27 10:34:39,013 | tejaas.main | DEBUG | Using 2 cores
2022-09-27 10:34:39,041 | tejaas.iotools.data | DEBUG | Reading expression levels for trans-eQTL discovery
2022-09-27 10:34:39,476 | tejaas.iotools.data | DEBUG | Reading expression levels for target gene discovery
2022-09-27 10:34:40,391 | tejaas.iotools.data | DEBUG | Found 7155 genes of 218 samples
2022-09-27 10:34:40,391 | tejaas.iotools.data | DEBUG | Reading gencode file for gene information
Traceback (most recent call last):
  File "/groups/stern/home/hanc/miniconda3/envs/py39/bin/tejaas", line 8, in <module>
    sys.exit(main())
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/main.py", line 73, in main
    data.load()
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/data.py", line 241, in load
    gene_info = readgtf.gencode(self.args.gtf_file, trim=self.args.gxtrim, biotype=self.args.biotype)
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/readgtf.py", line 48, in gencode
    gene_name = infolist[4].strip().split(' ')[1].replace('"','')
IndexError: list index out of range

I checked that the GTF file is properly tab separated. Attached is the first 1000 lines of the GTF that I am using, as an example.

test_chr.gtf.txt

Thank you again for all your time. I really appreciate it.

@banskt
Copy link
Member

banskt commented Oct 14, 2022

The GTF file format is correct. I found that the problem is because the PyPI release version of our package was not using the latest Github changes. We sincerely apologize for your trouble. Hope you will find Tejaas useful to detect trans-eQTLs. If you would need any help in selecting the parameters or running the software, please let us know.

I have now updated the package to the latest Github version. You can update it using:

pip install tejaas --upgrade

and hopefully it will work. If not, please do let us know.

Also, note that the --biotype flag in the command line requires an argument (protein_coding or lncRNA).

@hanclair
Copy link
Author

hanclair commented Feb 28, 2023

The program runs for me now with the update you mentioned!

An additional questions: I don't know if I am misunderstanding the --chrom tag. My understanding is that if I have multiple chromosomes in the vcf (each chromosome designated as an integer in the CHROM column of the vcf), then I can use this parameter to designate which chromosome I would choose SNPs from (ie --chrom 2 --include-SNPs 0:100 would mean first 100 SNPs from chr2). However, regardless of what integer I put for the --chrom tag, it defaults to picking the SNPs from the first chromosome in the vcf. I could even put a number that doesn't exist as a chromosome in the vcf and it would still run and default to SNPs that are on the first chromosome.

I have been getting around this problem by pre-subsetting the vcf to each only contain one chromosome. I see in the parallelization example it says "file_path_here_${CHRM}.vcf.gz", so perhaps subsetting the vcf into one chromosome per file is what I am meant to do? Just wondering if I am misunderstanding this, since this is a required parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants