Skip to content
This repository has been archived by the owner on Feb 22, 2021. It is now read-only.

Chinese Wikipedia StackOverflowError #14

Open
nick-magnini opened this issue Jan 25, 2016 · 2 comments
Open

Chinese Wikipedia StackOverflowError #14

nick-magnini opened this issue Jan 25, 2016 · 2 comments
Labels

Comments

@nick-magnini
Copy link

Chinese Wikipedia pops this error out when creating word2vec corpus using: org.idio.wikipedia.word2vec.Word2VecCorpus class.

   java.lang.StackOverflowError
   at java.util.regex.Pattern$CharProperty.match(Pattern.java:3705)
   at java.util.regex.Pattern$Curly.match0(Pattern.java:4160)
   at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
       .........
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4144)
    at java.util.regex.Pattern$Slice.match(Pattern.java:3882)
    at java.util.regex.Pattern$Start.match(Pattern.java:3420)
    at java.util.regex.Matcher.search(Matcher.java:1211)
    at java.util.regex.Matcher.find(Matcher.java:604)
    at java.util.regex.Matcher.replaceAll(Matcher.java:914)
    at scala.util.matching.Regex.replaceAllIn(Regex.scala:298)
    at org.idio.wikipedia.word2vec.ArticleCleaner$.cleanStyle(ArticleCleaner.scala:69)
    at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:65)
    at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:56)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
    at java.lang.Thread.run(Thread.java:809)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-01-25 16:25:08 WARN  TaskSetManager:71 - Lost task 57.0 in stage 0.0 (TID 57, localhost): TaskKilled (killed intentionally)```
@dav009
Copy link
Contributor

dav009 commented Jan 26, 2016

uh, rather weird, but I have definitely not used this on any asian lang.

Probably related to this issue: http://stackoverflow.com/questions/7509905/java-lang-stackoverflowerror-while-using-a-regex-to-parse-big-strings
I definitely want to replace the class I added to clean the Wikipedia boilerplate. I assume chinese can have very long paragraphs with no spaces whatsoever ?

I will try to address this and the other issues you have mentioned over the weekend.

@jiesutd
Copy link

jiesutd commented Dec 22, 2017

Hi,
Is there any update for this problem? I am also facing a similar problem when dealing with Chinese text:

2017-12-22 23:11:31 ERROR Executor:96 - Exception in task 122.0 in stage 0.0 (TID 122)
java.lang.StackOverflowError
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3776)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4250)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
	at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)2017-12-22 22:55:03 INFO  Executor:59 - Executor is trying to kill task 125.0 in stage 0.0 (TID 125)

	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-12-22 22:55:03 INFO  Executor:59 - Executor is trying to kill task 126.0 in stage 0.0 (TID 126)
2017-12-22 22:55:03 INFO  Executor:59 - Executor is trying to kill task 123.0 in stage 0.0 (TID 123)

@mal mal removed the fandango label Jan 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants