Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker & OpenTapioca #31

Open
kerphi opened this issue Jan 19, 2021 · 11 comments
Open

Docker & OpenTapioca #31

kerphi opened this issue Jan 19, 2021 · 11 comments

Comments

@kerphi
Copy link

kerphi commented Jan 19, 2021

Would it be possible to have a Docker image to help testing/deploying OpenTapioca?
It could be a great feature to help new community to enter into OpenTapioca world.

@wetneb
Copy link
Member

wetneb commented Jan 19, 2021

It would be great to have that indeed! I am unlikely to find the time to work on this soon but would be very much in favour of including that in the repository.

@ziodave ziodave mentioned this issue Mar 1, 2021
@eracle
Copy link
Collaborator

eracle commented Aug 4, 2022

@wetneb Hey, could you specify which version of zookeeper you are using, and what is your local config? Maybe it would be cool to have a series of steps for your specific zookeeper install procedure.

@wetneb
Copy link
Member

wetneb commented Aug 4, 2022

I use Solr 7.7.3 and the Zookeeper that is bundled in it. I do not install Zookeeper itself, I just download Solr and that comes with Zookeeper in it.

@eracle
Copy link
Collaborator

eracle commented Aug 5, 2022

Before Solr version 8.11.1, the Log4Shell CVE is present and it is a security problem.
Do you think your project would also work with Solr 8 on up?

@wetneb
Copy link
Member

wetneb commented Aug 5, 2022

I have not checked. I am not actively maintaining this project as you can see. But I will always be happy to merge PRs.

@eracle
Copy link
Collaborator

eracle commented Aug 5, 2022

Ok I kinda solved the previous problem. I will have a PR ready soon. One question, should I update the settings_template.py file:

# The name of the Solr collection where Wikidata is indexed
SOLR_COLLECTION = 'wd_2019-02-24'

# The path to the language model, trained with "tapioca train-bow"
LANGUAGE_MODEL_PATH='data/wd_2019-02-24.bow.pkl'
# The path to the pagerank Numpy vector, computed with "tapioca compute-pagerank"
PAGERANK_PATH='data/wd_2019-02-24.pgrank.npy'
# The path to the trained classifier, obtained from "tapioca train-classifier"
CLASSIFIER_PATH='data/rss_istex_classifier.pkl'

@wetneb
Copy link
Member

wetneb commented Aug 5, 2022

I am not sure what you want to change in the settings_template.py, but I assume that if you want to change things there, you probably have a good reason to :)

@eracle
Copy link
Collaborator

eracle commented Aug 6, 2022

The CLI was asking me something about the settings.py file that probably is not included in the docs.
Should I copy the settings_template.py file and rename it as settings.py?

Another question:
the following command:

tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json

What's my_collection_name? Could you provide some examples of its value?

@wetneb
Copy link
Member

wetneb commented Aug 7, 2022

The CLI was asking me something about the settings.py file that probably is not included in the docs.
Should I copy the settings_template.py file and rename it as settings.py?

Indeed! And feel free to have a look at its contents and check if there is anything there that you want to change for your own purposes.

Another question

The docs say:

Pick a Solr collection name (without creating the collection in advance) and run:
tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json

So the intention behind this sentence is to say that:

  • you can come up with a collection name of your own and it can be arbitrary. For instance, bubble_tea_is_overrated could be a nice collection name, just like a_little_waltz_in_the_park would be a nice one too.
  • Once you have made up your mind about your collection name, you can insert it in the command mentioned in the docs. For instance: tapioca index-dump a_little_waltz_in_the_park latest-all.json.bz2 --profile profiles/human_organization_place.json

If you can think of ways to make the docs more understandable for you in both locations, do not hesitate to open a PR with the phrasing you would have preferred there, I am sure it is going to be much better.

@eracle
Copy link
Collaborator

eracle commented Aug 18, 2022

@wetneb Hi Antonin,
I am testing the branch on my personal server and at the moment I am running the indexing.
Unfortunately, Solr stops being killed by the Operating System since it uses too much memory. It looks like during indexing some memory leakage or something similar happens on Solr cloud.
How much memory did you use to have on your server?

I also notice there is the parameter skip_docs
Do you use to manually restart the indexing process by passing last failing skip_docs number?

@wetneb
Copy link
Member

wetneb commented Aug 19, 2022

Hi @eracle,

On my previous server I had 20+GB RAM. Now much less, so I can no longer update the index.

Yes I suspect skip_docs can be used to resume the indexing from an offset, but I do not remember exactly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants