Can not run vep #244

toanddt · 2020-06-10T02:39:32Z

I can run normal vep tool with the same input file but errors occured when I run cannoli vep with following command:
cannoli-submit --master local --deploy-mode client -- vep /input/sample.vcf /output/out -executable /home/spark/ensembl-vep/vep -cache file:///home/spark/.vep/homo_sapiens/95_GRCh38/ -single
Log:
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119) at org.bdgenomics.adam.rdd.variant.VCFOutFormatter.read(VCFOutFormatter.scala:104) at org.bdgenomics.adam.rdd.OutFormatterRunner.<init>(OutFormatter.scala:32) at org.bdgenomics.adam.rdd.GenomicDataset$$anonfun$15.apply(GenomicDataset.scala:876) at org.bdgenomics.adam.rdd.GenomicDataset$$anonfun$15.apply(GenomicDataset.scala:838) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Please help to resolve the issue

The text was updated successfully, but these errors were encountered:

heuermh · 2020-06-11T02:39:13Z

Hello @toanddt! Thank you for submitting this issue.

I am afraid the malformed header line error reported above might be masking the real issue. Cannoli is expecting valid VCF on stdout from vep.

Might you be able to look for any error message(s) reported by vep itself in the logs?

toanddt · 2020-06-11T05:53:01Z

Thank for your quick response @heuermh.
This a vep logs:
`-------------------- EXCEPTION --------------------
MSG: ERROR: "so" is not a valid value for --terms
Valid values: SO, display, NCBI

STACK Bio::EnsEMBL::VEP::Config::check_config /home/spark/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm:649
STACK Bio::EnsEMBL::VEP::Config::new /home/spark/ensembl-vep/modules/Bio/EnsEMBL/VEP/Config.pm:487
STACK Bio::EnsEMBL::VEP::BaseRunner::new /home/spark/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:92
STACK toplevel /home/spark/ensembl-vep/vep:216
Date (localtime) = Thu Jun 11 05:24:51 2020
Ensembl API version = 95
`
And I found out the rout cause, because cannoli hardcode vep args

`class Vep(
val args: VepArgs,
val stringency: ValidationStringency = ValidationStringency.LENIENT,
sc: SparkContext) extends CannoliFnVariantContextDataset, VariantContextDataset {

override def apply(variantContexts: VariantContextDataset): VariantContextDataset = {

val builder = CommandBuilders.create(args.useDocker, args.useSingularity)
  .setExecutable(args.executable)
  .add("--format")
  .add("vcf")
  .add("--output_file")
  .add("STDOUT")
  .add("--vcf_info_field")
  .add("ANN")
  .add("--terms")
  .add("so")
  .add("--no_stats")
  .add("--offline")
  .add("--dir_cache")
  .add(if (args.addFiles) "$0" else absolute(args.cachePath))`

What version of VEP you used for testing?

heuermh · 2020-06-12T17:15:50Z

Thanks for the additional information!

Oh, that is frustrating -- why would they make an incompatible change so → SO? We have to hard code some command line arguments in order for vep to output valid VCF ANN specification-formatted annotations. I will have to find at what version they made this change.

toanddt · 2020-06-16T02:50:06Z

Hi @heuermh
Did you find out the version of vep which can be used to run with cannoli?

heuermh · 2020-07-04T19:11:07Z

Hello @toanddt, it appears that SO has been the default for --terms for a long while back, so I feel it is pretty safe to make the change in #251. Please let me know if that works for you.

toanddt · 2020-07-07T03:16:32Z

Thank @heuermh. But I am still not able to run cannoli - vep, another issue occured . Are you able to run cannoli - vep at your side?

heuermh · 2020-07-09T02:39:02Z

I am also not able to run the current version of vep via its Bioconda/Biocontainers Docker image

...
20/07/08 21:30:29 INFO Vep: Piping DatasetBoundVariantContextDataset with 0 reference sequences and 0
samples to vep with command: [docker, run, -i, -v, /data:/data, --rm,
quay.io/biocontainers/ensembl-vep:100.2--pl526hecda079_0, vep, --format, vcf, --output_file, STDOUT,
--vcf_info_field, ANN, --no_stats, --offline, --dir_cache, /data, --species, homo_sapiens]
files: []
...
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
20/07/08 21:30:34 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw a header line specifying VCF version
	at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:108)
	at org.bdgenomics.adam.rdd.variant.VCFOutFormatter.read(VCFOutFormatter.scala:104)
	at org.bdgenomics.adam.rdd.OutFormatterRunner.<init>(OutFormatter.scala:32)
	at org.bdgenomics.adam.rdd.GenomicDataset.$anonfun$pipe$8(GenomicDataset.scala:872)

I tried with both the indexed and non-indexed versions of the VEP cache, no difference. It appears that vep is outputting invalid VCF or no VCF at all to stdout.

Will continue to investigate...

heuermh · 2020-07-13T20:36:58Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not run vep #244

Can not run vep #244

toanddt commented Jun 10, 2020

heuermh commented Jun 11, 2020

toanddt commented Jun 11, 2020

heuermh commented Jun 12, 2020

toanddt commented Jun 16, 2020

heuermh commented Jul 4, 2020

toanddt commented Jul 7, 2020

heuermh commented Jul 9, 2020

heuermh commented Jul 13, 2020

dmaziec commented Oct 24, 2020 •

edited

heuermh commented Oct 26, 2020

heuermh commented Oct 26, 2020 •

edited

heuermh commented Oct 28, 2020

heuermh commented Dec 16, 2020

Can not run vep #244

Can not run vep #244

Comments

toanddt commented Jun 10, 2020

heuermh commented Jun 11, 2020

toanddt commented Jun 11, 2020

heuermh commented Jun 12, 2020

toanddt commented Jun 16, 2020

heuermh commented Jul 4, 2020

toanddt commented Jul 7, 2020

heuermh commented Jul 9, 2020

heuermh commented Jul 13, 2020

dmaziec commented Oct 24, 2020 • edited

heuermh commented Oct 26, 2020

heuermh commented Oct 26, 2020 • edited

heuermh commented Oct 28, 2020

heuermh commented Dec 16, 2020

dmaziec commented Oct 24, 2020 •

edited

heuermh commented Oct 26, 2020 •

edited