Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout waiting for connection from pool for 1000 genomes vcf on AWS #1951

Open
akmorrow13 opened this issue Mar 11, 2018 · 9 comments
Open

Comments

@akmorrow13
Copy link
Contributor

val x = sc.loadGenotypes("s3a://1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr17.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz")

generates error Unable to execute HTTP request: Timeout waiting for connection from pool
with net.fnothaft:jsr203-s3a:0.0.2.

This error was tested with Hadoop-BAM 7.9.2 and 7.9.1

@fnothaft
Copy link
Member

Sigh, I am seeing this too...

@akmorrow13
Copy link
Contributor Author

@fnothaft how are you running? Are you on EMR or through toil on standard aws instances? Apparently EMR dropped support for s3a. However, I can still loadAlignments from s3a, but not vcfs. Fortunately, s3 works just fine for vcfs (but is sloww)

@heuermh
Copy link
Member

heuermh commented Mar 15, 2018

Apparently EMR dropped support for s3a.

When did that happen? And at a specific version of EMR?

Fortunately, s3 works just fine for vcfs (but is sloww)

Practically, conductor is still a good solution for s3 → HDFS, and is faster than s3-dist-cp. Conductor can't upload directories of Parquet+Avro from HDFS → s3 though, so you'd need to fall back to s3-dist-cp for that.

@akmorrow13
Copy link
Contributor Author

I'm not sure when s3a was dropped from. @delagoya may know more, as they were my informant.

@fnothaft
Copy link
Member

Are you able to use s3n?

@delagoya
Copy link

I am researching with the EMR team about what is the supported URL encodings.

@dstockstad
Copy link

Was just passing through. Hopefully everyone has seen this page but linking just in case:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Interesting that s3:// on EMR is slower than s3a:// considering EMRFS (EMR's proprietary S3 impl) is one of it's selling points. You might be able to use s3a URL's consistently by setting the following parameters:

<property>
  <name>fs.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  <description>The implementation class of the S3A Filesystem</description>
</property>

<property>
  <name>fs.AbstractFileSystem.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3A</value>
  <description>The implementation class of the S3A AbstractFileSystem.</description>
</property>

Link:
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

This is all untested but I might give this a whirl when I get a moment and see if I can get this working and post results here.

@heuermh
Copy link
Member

heuermh commented May 7, 2018

@dstockstad Thanks for the note! Where do those properties need to be specified?

@dstockstad
Copy link

You're going to want to do it using the instructions here:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

The settings go into core-site. So something like this:

[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3a.impl": "org.apache.hadoop.fs.s3a.S3A",
    }
  }
]

Keep in mind that I still have not actually verified this so can't say for sure whether it will work and might also need additional configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants