Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"requests.exceptions.HTTPError: 404 Client Error" while trying tapioca train-classifier #11

Open
heathersherry opened this issue Nov 19, 2019 · 12 comments

Comments

@heathersherry
Copy link
Contributor

heathersherry commented Nov 19, 2019

Dear authors,

Thanks for sharing the great project.

I tried to follow the documents of this project to run it. Everything goes smoothly, until I tried to train a classifier on the dataset.
I create a Solr collection named collection_5 and run:
bunzip2 < latest-all.json.bz2 | tapioca index-dump collection_5 - --profile profiles/human_organization_place.json
Everything works well. I index the Wikidata dump in the Solr collection successfully.

Then I tried this command to get the classifier:
tapioca train-classifier -c collection_5 -b data/wd_2019-02-24.bow.pkl -p data/wd_2019-02-24.pgrank.npy -d data/merged_RSS-500_and_istex_train.ttl -o data/rss_istex_classifier.pkl
It fails with this error information:

Traceback (most recent call last):

  File "/usr/local/bin/tapioca", line 11, in <module>
    load_entry_point('opentapioca==0.1.0', 'console_scripts', 'tapioca')()
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/data2/xxx/related_work/opentapioca/opentapioca/cli.py", line 184, in train_classifier
    best_params = clf.crossfit_model(d, parameter_grid, max_iter=max_iter)
  File "/data2/xxx/related_work/opentapioca/opentapioca/classifier.py", line 113, in crossfit_model
    docid_to_mentions[str(context.uri)] = self.create_mentions(context.mention)
  File "/data2/xxx/related_work/opentapioca/opentapioca/classifier.py", line 78, in create_mentions
    mentions = self.tagger.tag_and_rank(phrase)
  File "/data2/xxx/related_work/opentapioca/opentapioca/tagger.py", line 52, in tag_and_rank
    r.raise_for_status()
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:8983/solr/collection_5/tag?overlaps=NO_SUB&tagsLimit=500&fl=id%2Clabel%2Caliases%2Cextra_aliases%2Cdesc%2Cnb_statements%2Cnb_sitelinks%2Cedges%2Ctypes&wt=json&indent=off

(I put the opentapioca project in the folder /data2/xxx/related_work)

Could you please give some hints for solving this problem? Is it some problems brought by Solr? I have checked the status of Solr, it seems everything is working well.

Found 1 Solr nodes: 

Solr process 8173 running on port 8983
{
  "solr_home":"/data2/sherry/related_work/solr-8.2.0/server/solr",
  "version":"8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:11:04",
  "startTime":"2019-10-30T06:39:53.937Z",
  "uptime":"20 days, 2 hours, 43 minutes, 18 seconds",
  "memory":"3.3 GB (%83.2) of 4 GB",
  "cloud":{
    "ZooKeeper":"localhost:9983",
    "liveNodes":"1",
    "collections":"6"}}

Thanks a lot!

@wetneb
Copy link
Member

wetneb commented Nov 19, 2019

If you created the Solr collection yourself, then it probably lacks the /tag endpoint that is required by opentapioca.

You should run tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json for a collection that does not exist yet: tapioca will create the collection by itself, with the appropriate /tag endpoint.

There might be a way to add the endpoint after the fact, having already ingested the dump in a collection - but I am not sure how!

I will make it clearer in the docs that you should not create the Solr collection yourself.

@heathersherry
Copy link
Contributor Author

heathersherry commented Nov 24, 2019

Dear authors,

Thanks for the quick explanation!
I have tried several times to run tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json before creating the Solr collection by myself. However, there is another error message:

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/original/update?commit=false

Actually, if I create the Solr collection first, then run tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json, there will not be any problem. However, after running tapioca train-classifier, the error message mentioned in my last post occurs.

Moreover, I also tried bunzip2 < latest-all.json.bz2 | tapioca index-dump my_collection_name - --profile profiles/human_organization_place.json. But I also receive the following error:

Traceback (most recent call last):                                                                                                                                                     
File "/home/sherry/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send timeout=timeout                                                                                                                                                                  
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]                                                                                                                    
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 400, in increment raise six.reraise(type(error), error, _stacktrace)                                                                                                                                 
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 734, in reraise raise value.with_traceback(tb)                                                                                                                                                     
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen chunked=chunked,                                                                                                                                                                   
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request six.raise_from(e, None)                                                                                                                                                            
File "<string>", line 3, in raise_from File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 416, in _make_request httplib_response = conn.getresponse()                                                                                                                                            
File "/usr/lib/python3.6/http/client.py", line 1346, in getresponse response.begin()                                                                                                                                                                   
File "/usr/lib/python3.6/http/client.py", line 307, in begin  version, status, reason = self._read_status()                                                                                                                                      
File "/usr/lib/python3.6/http/client.py", line 268, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")                                                                                                                           
File "/usr/lib/python3.6/socket.py", line 586, in readinto return self._sock.recv_into(b) urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) 

I have also checked the status of Solr. The status is the same as that I mentioned in the last post. So I am not sure why there is a "Connection aborted" error.

Could you please give some hints? Thanks a lot!

@wetneb
Copy link
Member

wetneb commented Nov 24, 2019

For the HTTP 400 error you get, there should be some logs available in the Solr web interface. Can you check there and report what exactly cases this Bad Request error?

@heathersherry
Copy link
Contributor Author

heathersherry commented Nov 26, 2019

Hi,

Here is the logs (I use linux terminal to run Solr and Opentapioca) for the program:

2019-11-25 15:04:38,220 opentapioca.taggerfactory INFO     Stream index: 10674820
2019-11-25 15:04:38,221 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-25 15:04:41,453 opentapioca.taggerfactory INFO     Stream index: 10676994
2019-11-25 15:04:41,453 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
Traceback (most recent call last):
  File "/usr/local/bin/tapioca", line 11, in <module>
    load_entry_point('opentapioca==0.1.0', 'console_scripts', 'tapioca')()
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/data2/xxx/related_work/opentapioca/opentapioca/cli.py", line 118, in index_dump
    batch_size=2000, commit_time=10, delete_excluded=False, skip_docs=skip)
  File "/data2/xxx/related_work/opentapioca/opentapioca/taggerfactory.py", line 91, in index_stream
    self._push_documents(batch, collection_name, commit)
  File "/data2/xxx/related_work/opentapioca/opentapioca/taggerfactory.py", line 121, in _push_documents
    r.raise_for_status()
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/cony_collection_5/update?commit=false

(I name the collection as cony_collection_5)
Meanwhile, I check the status of Solr and it seems that it works well:

$ /home/xxx/solr-8.3.0/bin/solr status

Found 1 Solr nodes:

Solr process 933 running on port 8983
{ 
  "solr_home":"/home/xxx/solr-8.3.0/server/solr",
  "version":"8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22",
  "startTime":"2019-11-25T12:26:18.094Z",
  "uptime":"0 days, 14 hours, 59 minutes, 5 seconds",
  "memory":"5.1 GB (%16) of 32 GB",
  "cloud":{
    "ZooKeeper":"localhost:9983",
    "liveNodes":"1",
    "collections":"7"}}

@wetneb
Copy link
Member

wetneb commented Nov 26, 2019

Thanks! The Solr logs themselves should be accessible on the Solr web interface. By default it runs at http://hostname:8983/solr/.

@heathersherry
Copy link
Contributor Author

heathersherry commented Nov 26, 2019

Thanks a lot for the reply!
Actually I have tried running bunzip2 < latest-all.json.bz2 | tapioca index-dump my_collection_name - --profile profiles/human_organization_place.json for five times. I start Solr with different memory (4G, 8G, 16G, 32G and 64G). However, this error occurs every time and terminates at the same point, when the stream index reaches 10676994.

2019-11-26 09:02:02,305 opentapioca.taggerfactory INFO     Stream index: 10670820
2019-11-26 09:02:02,306 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-26 09:02:04,307 opentapioca.taggerfactory INFO     Stream index: 10672820
2019-11-26 09:02:04,309 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-26 09:02:06,467 opentapioca.taggerfactory INFO     Stream index: 10674820
2019-11-26 09:02:06,468 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-26 09:02:12,948 opentapioca.taggerfactory INFO     Stream index: 10676994
2019-11-26 09:02:12,949 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
Traceback (most recent call last): ...

Therefore, I guess that the error is caused by the data. To skip the malicious data, is it fine to add a try-exception for line 121 r.raise_for_status() in opentapioca/opentapioca/taggerfactory.py? Is it fine to skip this step when there is error?

P.S. I am running the experiments on a Linux server without web interface. Therefore, I cannot reach the Web interface. I will tried this method later if the above solution does not help.
Thanks a lot for your help. :)

@heathersherry
Copy link
Contributor Author

Add exception handling for line 121 r.raise_for_status() in opentapioca/opentapioca/taggerfactory.py solves this probblem. Now I can sucessfully run the application. Thanks a lot!

@wetneb
Copy link
Member

wetneb commented Nov 28, 2019

@heathersherry wonderful! Do you think you could create a pull request for that change? I think it would make a lot of sense!

@wetneb wetneb reopened this Nov 28, 2019
@heathersherry
Copy link
Contributor Author

heathersherry commented Dec 2, 2019

@heathersherry wonderful! Do you think you could create a pull request for that change? I think it would make a lot of sense!

Sure! Thanks again for creating such a great project. :)
Shall I create the pull request in the default branch? Currently it seems that the permission is denied.

@wetneb
Copy link
Member

wetneb commented Dec 2, 2019

Yes, you should be able to create a pull request by first creating a fork of this repository in your own account, pushing your change there and then creating the pull request. Alternatively, if you only want to propose a change to a single file (as it is the case here), you should be able to view that file on Github and use the edit link there.
If none of these work for you I have invited you as collaborator to this project, which should make things easier.

heathersherry added a commit to heathersherry/opentapioca that referenced this issue Dec 3, 2019
ziodave pushed a commit to ziodave/opentapioca that referenced this issue Feb 24, 2021
@dinani65
Copy link

dinani65 commented Mar 29, 2021

I also get the same error when I want to create a collection.
Command:
tapioca index-dump col2 latest-all.json.bz2 --profile profiles/human_organization_location.json
Error:
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/admin/collections?action=CREATE&name=col2&collection.configName=tapioca&numShards=1
Log:
org.apache.solr.common.SolrException: Solr instance is not running in SolrCloud mode.

@lucyhorowitz
Copy link

lucyhorowitz commented Dec 16, 2022

I am getting the same error as just above, but in the log there are a few NoSuchFileExceptions about solr-9.0.0/lib and /dist and then org.apache.solr.common.SolrException: Error CREATEing SolrCore 'collection5_shard1_replica_n1': Unable to create core [collection5_shard1_replica_n1] Caused by: solr.XSLTResponseWriter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants