Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming VCFs using smart_open? #260

Open
dbrami opened this issue Jan 6, 2023 · 3 comments
Open

Streaming VCFs using smart_open? #260

dbrami opened this issue Jan 6, 2023 · 3 comments

Comments

@dbrami
Copy link

dbrami commented Jan 6, 2023

Hi,
I've seen open issue #174 Can't use cyvcf2 against AWS S3, and I'm assuming the intent is to download VCFs locally to be open by cyvcf2.
My question is how easy/hard is it to use smart_open in conjunction with cyvcf2 to stream needed regions of vcf from AWS S3 as needed instead of downloading all VCF first?

@brentp
Copy link
Owner

brentp commented Jan 7, 2023

Hi,
You should be able to use cyvcf2 directly. But, if the handle you pass to VCF has a fileno method or is an integer, the it will be treated as a file descriptor.
Note that htslib will handle AWS authentication for you if you use, for example the environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
AWS_DEFAULT_REGION
AWS_DEFAULT_PROFILE
AWS_PROFILE

@dbrami
Copy link
Author

dbrami commented Jan 8, 2023

Could you elaborate please?
I would just:

  • set all the needed AWS OS environment variables you listed
  • use cyvcf2 as so:
from cyvcf2 import VCF

for variant in VCF('s3://abc-def-results/P-20230106-0003/folder1/sampe1.vcf.gz'): 

@dbrami
Copy link
Author

dbrami commented Jan 19, 2023

Hi,
The proposed solution does not seem to work. Did i miss something?

(base) ➜  AncestryML git:(main) ✗ export AWS_ACCESS_KEY_ID=ABCDE...
(base) ➜  AncestryML git:(main) ✗ export AWS_SECRET_ACCESS_KEY=DEFGH...

(ML) ➜  AncestryML git:(main) ✗ python VCF_to_hash.py -p P-20230109-1234 -v s3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz
2023-01-18 16:57:55,894 - root - INFO - Logger initialized
2023-01-18 16:57:55,895 - root - INFO - Parsing command-line parameters
2023-01-18 16:57:55,898 - root - INFO - Parsing general config file cfg/GenomeHashConfig.yaml
2023-01-18 16:57:55,911 - root - INFO - Processing VCF:	/Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz
[E::hts_open_format] Failed to open file "/Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz" : No such file or directory
Traceback (most recent call last):
  File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 352, in <module>
    main()
  File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 315, in main
    sample_variants_dict, samples_list, sample_project_dict = parse_vcf(vcf_list, df_var_signature,
  File "/Users/bramid/PycharmProjects/AncestryML/VCF_to_hash.py", line 218, in parse_vcf
    vcf = VCF(input_vcf)
  File "cyvcf2/cyvcf2.pyx", line 258, in cyvcf2.cyvcf2.VCF.__init__
  File "cyvcf2/cyvcf2.pyx", line 190, in cyvcf2.cyvcf2.HTSFile._open_htsfile
OSError: Error opening /Users/bramid/PycharmProjects/AncestryML/s3:/1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based/NA20787/NA20787.hard-filtered.vcf.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants