Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not run vep #244

Closed
toanddt opened this issue Jun 10, 2020 · 13 comments · Fixed by #251
Closed

Can not run vep #244

toanddt opened this issue Jun 10, 2020 · 13 comments · Fixed by #251
Milestone

Comments

@toanddt
Copy link

toanddt commented Jun 10, 2020

I can run normal vep tool with the same input file but errors occured when I run cannoli vep with following command:
cannoli-submit --master local --deploy-mode client -- vep /input/sample.vcf /output/out -executable /home/spark/ensembl-vep/vep -cache file:///home/spark/.vep/homo_sapiens/95_GRCh38/ -single
Log:
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119) at org.bdgenomics.adam.rdd.variant.VCFOutFormatter.read(VCFOutFormatter.scala:104) at org.bdgenomics.adam.rdd.OutFormatterRunner.<init>(OutFormatter.scala:32) at org.bdgenomics.adam.rdd.GenomicDataset$$anonfun$15.apply(GenomicDataset.scala:876) at org.bdgenomics.adam.rdd.GenomicDataset$$anonfun$15.apply(GenomicDataset.scala:838) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Please help to resolve the issue

@heuermh
Copy link
Member

heuermh commented Jun 11, 2020

Hello @toanddt! Thank you for submitting this issue.

I am afraid the malformed header line error reported above might be masking the real issue. Cannoli is expecting valid VCF on stdout from vep.

Might you be able to look for any error message(s) reported by vep itself in the logs?

@toanddt
Copy link
Author

toanddt commented Jun 11, 2020

Thank for your quick response @heuermh.
This a vep logs:
`-------------------- EXCEPTION --------------------
MSG: ERROR: "so" is not a valid value for --terms
Valid values: SO, display, NCBI

STACK Bio::EnsEMBL::VEP::Config::check_config /home/spark/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm:649
STACK Bio::EnsEMBL::VEP::Config::new /home/spark/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm:487
STACK Bio::EnsEMBL::VEP::BaseRunner::new /home/spark/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:92
STACK toplevel /home/spark/ensembl-vep/vep:216
Date (localtime) = Thu Jun 11 05:24:51 2020
Ensembl API version = 95
`
And I found out the rout cause, because cannoli hardcode vep args

`class Vep(
val args: VepArgs,
val stringency: ValidationStringency = ValidationStringency.LENIENT,
sc: SparkContext) extends CannoliFnVariantContextDataset, VariantContextDataset {

override def apply(variantContexts: VariantContextDataset): VariantContextDataset = {

val builder = CommandBuilders.create(args.useDocker, args.useSingularity)
  .setExecutable(args.executable)
  .add("--format")
  .add("vcf")
  .add("--output_file")
  .add("STDOUT")
  .add("--vcf_info_field")
  .add("ANN")
  .add("--terms")
  .add("so")
  .add("--no_stats")
  .add("--offline")
  .add("--dir_cache")
  .add(if (args.addFiles) "$0" else absolute(args.cachePath))`

What version of VEP you used for testing?

@heuermh
Copy link
Member

heuermh commented Jun 12, 2020

Thanks for the additional information!

Oh, that is frustrating -- why would they make an incompatible change soSO? We have to hard code some command line arguments in order for vep to output valid VCF ANN specification-formatted annotations. I will have to find at what version they made this change.

@toanddt
Copy link
Author

toanddt commented Jun 16, 2020

Hi @heuermh
Did you find out the version of vep which can be used to run with cannoli?

@heuermh
Copy link
Member

heuermh commented Jul 4, 2020

Hello @toanddt, it appears that SO has been the default for --terms for a long while back, so I feel it is pretty safe to make the change in #251. Please let me know if that works for you.

@toanddt
Copy link
Author

toanddt commented Jul 7, 2020

Thank @heuermh. But I am still not able to run cannoli - vep, another issue occured . Are you able to run cannoli - vep at your side?

@heuermh
Copy link
Member

heuermh commented Jul 9, 2020

I am also not able to run the current version of vep via its Bioconda/Biocontainers Docker image

...
20/07/08 21:30:29 INFO Vep: Piping DatasetBoundVariantContextDataset with 0 reference sequences and 0
samples to vep with command: [docker, run, -i, -v, /data:/data, --rm,
quay.io/biocontainers/ensembl-vep:100.2--pl526hecda079_0, vep, --format, vcf, --output_file, STDOUT,
--vcf_info_field, ANN, --no_stats, --offline, --dir_cache, /data, --species, homo_sapiens]
files: []
...
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
20/07/08 21:30:34 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw a header line specifying VCF version
	at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:108)
	at org.bdgenomics.adam.rdd.variant.VCFOutFormatter.read(VCFOutFormatter.scala:104)
	at org.bdgenomics.adam.rdd.OutFormatterRunner.<init>(OutFormatter.scala:32)
	at org.bdgenomics.adam.rdd.GenomicDataset.$anonfun$pipe$8(GenomicDataset.scala:872)

I tried with both the indexed and non-indexed versions of the VEP cache, no difference. It appears that vep is outputting invalid VCF or no VCF at all to stdout.

Will continue to investigate...

@heuermh
Copy link
Member

heuermh commented Jul 13, 2020

See also bigdatagenomics/adam#2150

@heuermh heuermh modified the milestones: 0.10.0, 0.11.0 Jul 13, 2020
@dmaziec
Copy link

dmaziec commented Oct 24, 2020

Hello,
The root cause of this issue is that currently --vcf flag, which is used to set output in VCF format, is missing.
After adding following changes: dmaziec/cannoli@5684640 it is possible to save it as VCF

@heuermh
Copy link
Member

heuermh commented Oct 26, 2020

Thank you @dmaziec1! I left a few questions as comments on your commit

@heuermh
Copy link
Member

heuermh commented Oct 26, 2020

To answer some of my questions, from http://grch37.ensembl.org/info/docs/tools/vep/vep_formats.html#output

VCF output

The VEP script can also generate VCF output using the --vcf flag.

Main information about the specificity of the VEP VCF output format:

  • Consequences are added in the INFO field of the VCF file, using the key "CSQ" (you can change it using --vcf_info_field).

Thus it appears for Canolli we should be using

vep --format vcf --output_file STDOUT --vcf --vcf_info_field ANN --no_stats --offline --dir_cache ...

(edited)

@heuermh
Copy link
Member

heuermh commented Oct 28, 2020

I've discovered multiple issues that will prevent mapping Ensembl VEP ANN VCF INFO field values to TranscriptEffect.

See bigdatagenomics/adam#2279

@heuermh
Copy link
Member

heuermh commented Dec 16, 2020

Properly mapping Ensembl VEP ANN VCF INFO field values to TranscriptEffects will take some work, possibly upstream in VEP itself. Meanwhile, I've merged #251 which retains VEP annotations in the default CSQ INFO field and does not attempt to map those to TranscriptEffects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants