Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagger escapes at Stream index: 34160469 #10

Open
petulla opened this issue Nov 3, 2019 · 7 comments
Open

Tagger escapes at Stream index: 34160469 #10

petulla opened this issue Nov 3, 2019 · 7 comments

Comments

@petulla
Copy link

petulla commented Nov 3, 2019

I raised another issue related to this. I can't get past training on index 34160469. The

tapioca index-dump wiki_collection latest-all.json.bz2 --profile profiles/human_organization_place.json

step falls out at this point every time. Any idea what might be happening? The previous steps ended successfully. Is there pre-trained model I can use to supplement any of the steps for testing?

Solr 8.2
Python 3.7.4
Mac OS Mojave

019-11-03 16:23:49,575 opentapioca.taggerfactory INFO     Stream index: 34160469
2019-11-03 16:23:49,576 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
Traceback (most recent call last):
  File "/Users/username/.pyenv/versions/jupyter3/bin/tapioca", line 11, in <module>
    load_entry_point('opentapioca==0.1.0', 'console_scripts', 'tapioca')()
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/opentapioca-0.1.0-py3.7.egg/opentapioca/cli.py", line 118, in index_dump
    batch_size=2000, commit_time=10, delete_excluded=False, skip_docs=skip)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/opentapioca-0.1.0-py3.7.egg/opentapioca/taggerfactory.py", line 91, in index_stream
    self._push_documents(batch, collection_name, commit)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/opentapioca-0.1.0-py3.7.egg/opentapioca/taggerfactory.py", line 121, in _push_documents
    r.raise_for_status()
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/wiki_collection/update?commit=false
(jupyter3)
@wetneb
Copy link
Member

wetneb commented Nov 4, 2019

@petulla it might be worth checking the Solr logs for any errors there?

@petulla
Copy link
Author

petulla commented Nov 4, 2019

This is the error.. Any ideas? The same document id throws the error every time.

org.apache.solr.common.SolrException: Exception writing document id Q23672838 to the index; possible analysis error: input automaton is too large: 1001

Full read out:

org.apache.solr.common.SolrException: Exception writing document id Q23672838 to the index; possible analysis error: input automaton is too large: 1001
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:244)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:257)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:487)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:337)
	at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:337)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
	at org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:231)
	at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:475)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:75)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:92)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DocBasedVersionConstraintsProcessor.processAdd(DocBasedVersionConstraintsProcessor.java:396)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:507)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:156)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:121)
	at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:84)
	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
...

@wetneb
Copy link
Member

wetneb commented Nov 4, 2019

Hmm, it looks like we are running into a hard-coded bound on the size of the index here, not sure if we can do much about it! We probably need to report that upstream to Solr. I haven't got much time to investigate this right now though.

If you want a quick fix, try narrowing down the scope of the profile (by selecting smaller classes of Wikidata items to include), which should decrease the size of the index and hopefully avoid this bug. Sorry that I cannot give a more satisfactory fix!

@petulla
Copy link
Author

petulla commented Nov 4, 2019

Hm. So the file like uman_organization_location.json, restrict the index?

I'm confused because I assumed you had run this on the full wikipedia dataset.

@wetneb
Copy link
Member

wetneb commented Nov 4, 2019

I have indeed run this on the full Wikidata dump, but that was a while ago now and Wikidata grows all the time, so it is totally possible that this error appeared in the mean time.

Yes, I would change human_organization_location.json to restrict to whatever suits you best, depending on your use case.

@petulla
Copy link
Author

petulla commented Nov 4, 2019

Can try just running a recent dump and seeing if it works for you? I'm trying Facebook's recent NEL codebase now but may need to return to this and am concerned fixing may take several hours at minimum.

@wetneb
Copy link
Member

wetneb commented Nov 4, 2019

I do intend to re-run this myself on a recent dump in the coming months, I will report back here once this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants