Skip to content
This repository has been archived by the owner on Feb 22, 2021. It is now read-only.

enwiki-latest-pages-articles-multistream.xml.bz2 not a valid bz2 #35

Open
Aditi138 opened this issue Jun 12, 2017 · 5 comments
Open

enwiki-latest-pages-articles-multistream.xml.bz2 not a valid bz2 #35

Aditi138 opened this issue Jun 12, 2017 · 5 comments

Comments

@Aditi138
Copy link

Hi,

I was running the prepare.sh file for en-US and its throwing the following exception, because of which the generated corpus is empty. Can you please suggest some alternate solution?

Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:138)
at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

@jind11
Copy link

jind11 commented Dec 26, 2017

I have the same issue, have you solved it?

@dav009
Copy link
Contributor

dav009 commented Dec 27, 2017

gonna check the format of the dump

@jind11
Copy link

jind11 commented Dec 27, 2017

I found the problem comes from the problem of "curl -O "http://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-pages-articles-multistream.xml.bz2" downloading, it only give me a 186 Bytes file that is wrong. Instead I changed to "curl -L -O "http://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-pages-articles-multistream.xml.bz2", now I can download the correct 14 GB file and the problem is resolved.

@tgalery
Copy link
Contributor

tgalery commented Dec 27, 2017

@jind11 can you send a PR ?

@jind11
Copy link

jind11 commented Dec 27, 2017

@tgalery sure, I also have several other bug fixers for the prepare.sh file, I will upload the PR these two days

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants