Skip to content
This repository has been archived by the owner on Feb 22, 2021. It is now read-only.

Issue creating corpus - Invalid byte 2 of 4-byte UTF-8 sequence #36

Open
ArthurCamara opened this issue Jul 17, 2017 · 5 comments
Open

Comments

@ArthurCamara
Copy link

I'm trying to manually create a corpus, using the following command:
java -Xmx10G -Xms10G -cp target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki working/enwiki-latest-pages-articles-multistream.xml.bz2 /mnt/hd0/Arthur/data/en-wiki-latest.lines

resulting in the following error:

[Fatal Error] :965698439:106: Invalid byte 2 of 4-byte UTF-8 sequence. Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 965698439; columnNumber: 106; Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ... 5 more

I'm using the latest wikipedia dump, and the sha1sum matches.

Any idea on what can be causing this?

@jind11
Copy link

jind11 commented Dec 27, 2017

I am having a similar problem.

@dav009
Copy link
Contributor

dav009 commented Dec 28, 2017

@jind just curious is it with the english wikipedia?

@jind11
Copy link

jind11 commented Dec 28, 2017

yes

@tgalery
Copy link
Contributor

tgalery commented Dec 28, 2017

Haven't had much time to dig into this, but here's a couple questions. Would you have the same error for older dumps of English and / or other languages ?

@sunan93
Copy link

sunan93 commented Jan 22, 2018

I am also having the same issue. If anyone has come across the fix, please share it here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants