`s3a` URLs don't work as in documentation #556

acruise · 2024-02-22T00:23:45Z

EDIT: this helped, the doc may need to be updated:

sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")

Describe the bug
According to the docs, aut should be able to read data from s3a URLs, but every way I've tried it, I get the same result (wrong FS...)

This specific run is built from aut-docker @ b64c02a343ad02ac36e84a2393ed52d86f0fb4ee), but a standalone Sparkling build does the same thing. I would file the ticket against Sparkling, but your docs actually exist, and no good deed goes unpunished ;)

I've verified that the credentials I'm providing can read this file using aws s3 cp etc.

alex@alex-work-pc:~/dev/docker-aut$ docker run -it sha256:f6a21678154c9603e5e4b3f453fa043083812bc40456469e3059edb7c5a3b36d
24/02/22 00:10:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://231b303af8fa:4040
Spark context available as 'sc' (master = local[*], app id = local-1708560659354).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/
         
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.16)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import io.archivesunleashed._, io.archivesunleashed.matchbox._
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "REDACTED")

scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "REDACTED")

scala> RecordLoader.loadArchives("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)
java.lang.IllegalArgumentException: Wrong FS: s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
  at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
  at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
  at org.archive.webservices.sparkling.io.HdfsIO.files(HdfsIO.scala:156)
  at org.archive.webservices.sparkling.util.RddUtil$.loadFilesLocality(RddUtil.scala:61)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:95)
  ... 51 elided

scala>

To Reproduce
Steps to reproduce the behavior (e.g.):

git clone git@github.com:archivesunleashed/docker-aut.git && cd docker-aut
docker build .
docker run -it <hash of above>
import packages and set credentials as documented
eval RecordLoader.loadArchives("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)

Expected behavior
A DataFrame is returned by RecordLoader ;)

Screenshots
If applicable, add screenshots to help explain your problem.

Environment information

AUT version: HEAD of docker-aut, currently at b64c02a343ad02ac36e84a2393ed52d86f0fb4ee
OS: Ubuntu 23.10, but Docker
Java version: 11 (from Dockerfile)
Apache Spark version: 3.3.1 (from Dockerfile)
Apache Spark w/aut: (from Dockerfile)
Apache Spark command used to run AUT: docker run -it sha256:f6a21678154c9603e5e4b3f453fa043083812bc40456469e3059edb7c5a3b36d

The text was updated successfully, but these errors were encountered:

ruebot · 2024-02-27T12:37:08Z

Thanks for finding this issue, and creating a great ticket! I appreciate it.

Looks like this was something I missed regression testing when I worked with @helgeho on pulling in Sparkling here](c8fa256#diff-7e582908ce37e25cdae381cebc539965c62f5a241cf1ea38fcafe9683b6ce44cR96). It'll be awhile before I can pivot to any of my research time work on this since my funding on this project ended back in July 2023. But, I'll definitely try and set aside some time in the future to get this working again.

Also, I see your example uses wat file from Common Crawl and @helgeho also flagged the WAT file usage. aut is geared toward w/arc files. So, if you want to do WAT specific work, you might want to look to Sparkling more. iirc, there is a fair bit of support for working with them in it.

Documentation flagged for future work:

acruise mentioned this issue Feb 22, 2024

s3a URLs don't work in WarcLoader (Wrong FS: s3a://...) internetarchive/Sparkling#3

Open

ruebot added the bug label Feb 27, 2024

ruebot self-assigned this Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`s3a` URLs don't work as in documentation #556

`s3a` URLs don't work as in documentation #556

acruise commented Feb 22, 2024 •

edited

ruebot commented Feb 27, 2024

s3a URLs don't work as in documentation #556

s3a URLs don't work as in documentation #556

Comments

acruise commented Feb 22, 2024 • edited

ruebot commented Feb 27, 2024

`s3a` URLs don't work as in documentation #556

`s3a` URLs don't work as in documentation #556

acruise commented Feb 22, 2024 •

edited