Docker & OpenTapioca #31

kerphi · 2021-01-19T19:55:35Z

Would it be possible to have a Docker image to help testing/deploying OpenTapioca?
It could be a great feature to help new community to enter into OpenTapioca world.

wetneb · 2021-01-19T20:35:25Z

It would be great to have that indeed! I am unlikely to find the time to work on this soon but would be very much in favour of including that in the repository.

eracle · 2022-08-04T15:47:34Z

@wetneb Hey, could you specify which version of zookeeper you are using, and what is your local config? Maybe it would be cool to have a series of steps for your specific zookeeper install procedure.

wetneb · 2022-08-04T16:10:45Z

I use Solr 7.7.3 and the Zookeeper that is bundled in it. I do not install Zookeeper itself, I just download Solr and that comes with Zookeeper in it.

eracle · 2022-08-05T14:09:14Z

Before Solr version 8.11.1, the Log4Shell CVE is present and it is a security problem.
Do you think your project would also work with Solr 8 on up?

wetneb · 2022-08-05T14:14:45Z

I have not checked. I am not actively maintaining this project as you can see. But I will always be happy to merge PRs.

eracle · 2022-08-05T16:04:49Z

Ok I kinda solved the previous problem. I will have a PR ready soon. One question, should I update the settings_template.py file:

# The name of the Solr collection where Wikidata is indexed
SOLR_COLLECTION = 'wd_2019-02-24'

# The path to the language model, trained with "tapioca train-bow"
LANGUAGE_MODEL_PATH='data/wd_2019-02-24.bow.pkl'
# The path to the pagerank Numpy vector, computed with "tapioca compute-pagerank"
PAGERANK_PATH='data/wd_2019-02-24.pgrank.npy'
# The path to the trained classifier, obtained from "tapioca train-classifier"
CLASSIFIER_PATH='data/rss_istex_classifier.pkl'

wetneb · 2022-08-05T16:11:56Z

I am not sure what you want to change in the settings_template.py, but I assume that if you want to change things there, you probably have a good reason to :)

eracle · 2022-08-06T23:07:00Z

The CLI was asking me something about the settings.py file that probably is not included in the docs.
Should I copy the settings_template.py file and rename it as settings.py?

Another question:
the following command:

tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json

What's my_collection_name? Could you provide some examples of its value?

wetneb · 2022-08-07T07:08:13Z

The CLI was asking me something about the settings.py file that probably is not included in the docs.
Should I copy the settings_template.py file and rename it as settings.py?

Indeed! And feel free to have a look at its contents and check if there is anything there that you want to change for your own purposes.

Another question

The docs say:

Pick a Solr collection name (without creating the collection in advance) and run:
tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json

So the intention behind this sentence is to say that:

you can come up with a collection name of your own and it can be arbitrary. For instance, bubble_tea_is_overrated could be a nice collection name, just like a_little_waltz_in_the_park would be a nice one too.
Once you have made up your mind about your collection name, you can insert it in the command mentioned in the docs. For instance: tapioca index-dump a_little_waltz_in_the_park latest-all.json.bz2 --profile profiles/human_organization_place.json

If you can think of ways to make the docs more understandable for you in both locations, do not hesitate to open a PR with the phrasing you would have preferred there, I am sure it is going to be much better.

eracle · 2022-08-18T20:57:13Z

@wetneb Hi Antonin,
I am testing the branch on my personal server and at the moment I am running the indexing.
Unfortunately, Solr stops being killed by the Operating System since it uses too much memory. It looks like during indexing some memory leakage or something similar happens on Solr cloud.
How much memory did you use to have on your server?

I also notice there is the parameter skip_docs
Do you use to manually restart the indexing process by passing last failing skip_docs number?

wetneb · 2022-08-19T08:12:54Z

Hi @eracle,

On my previous server I had 20+GB RAM. Now much less, so I can no longer update the index.

Yes I suspect skip_docs can be used to resume the indexing from an offset, but I do not remember exactly.

ziodave mentioned this issue Mar 1, 2021

Contributions #32

Open

eracle mentioned this issue Aug 24, 2022

Dockerizing opentapioca #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker & OpenTapioca #31

Docker & OpenTapioca #31

kerphi commented Jan 19, 2021

wetneb commented Jan 19, 2021

eracle commented Aug 4, 2022

wetneb commented Aug 4, 2022

eracle commented Aug 5, 2022 •

edited

wetneb commented Aug 5, 2022

eracle commented Aug 5, 2022

wetneb commented Aug 5, 2022

eracle commented Aug 6, 2022

wetneb commented Aug 7, 2022

eracle commented Aug 18, 2022 •

edited

wetneb commented Aug 19, 2022

Docker & OpenTapioca #31

Docker & OpenTapioca #31

Comments

kerphi commented Jan 19, 2021

wetneb commented Jan 19, 2021

eracle commented Aug 4, 2022

wetneb commented Aug 4, 2022

eracle commented Aug 5, 2022 • edited

wetneb commented Aug 5, 2022

eracle commented Aug 5, 2022

wetneb commented Aug 5, 2022

eracle commented Aug 6, 2022

wetneb commented Aug 7, 2022

eracle commented Aug 18, 2022 • edited

wetneb commented Aug 19, 2022

eracle commented Aug 5, 2022 •

edited

eracle commented Aug 18, 2022 •

edited