Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error and termination when hitting an unavailable URL #9

Open
kloppjp opened this issue Apr 4, 2015 · 4 comments
Open

Error and termination when hitting an unavailable URL #9

kloppjp opened this issue Apr 4, 2015 · 4 comments

Comments

@kloppjp
Copy link

kloppjp commented Apr 4, 2015

Background: For simplified Chinese, there is no "bq" combination, hence the downloader will quit with an error message when iterating through the data.

Suggestion: Wouldn't it be nicer if there was a try/catch block around the data retrieval part or the assert would be replaced by an if statement that outputs an error message but allows for jumping to the next file instead?

@dimazest
Copy link
Owner

dimazest commented Apr 5, 2015

Hi,

Thanks for the bug report. The issue is not that trivial to fix because different languages miss different indices. I would avoid a try ... catch block because it might hide real issues, for example when a file that should be retrieved is not retrieved due to poor connection.

For the time being, you can pass indices to readline_google_store:

>>> from google_ngram_downloader import readline_google_store

>>> fname, url, records = next(readline_google_store(ngram_len=5, indices=['cd', 'ed'], lang='chi-sim'))
>>> fname
'googlebooks-chi-sim-all-5gram-20120701-cd.gz'
>>> url
'http://storage.googleapis.com/books/ngrams/books/googlebooks-chi-sim-all-5gram-20120701-cd.gz'
>>> next(records)
Record(ngram='CDP _NOUN_ _NOUN_ _NOUN_ _NOUN_', year=1983, match_count=1, volume_count=1)

@kloppjp
Copy link
Author

kloppjp commented Apr 6, 2015

It works this way, however I have to make sure that I know all the indices, so on the long term it would still be more handy if the script could check that itself (e.g. download the google ngram page and check whether it contains the links corresponding to the indices? Sounds a bit like overkill, though...)
Anyway, thanks for the quick reply, very much appreciated! :)

@dimazest
Copy link
Owner

dimazest commented Apr 8, 2015

I'm very busy right now, but once I get time, I'll just copy the indices from the page.

@tianhuil
Copy link

tianhuil commented Sep 9, 2019

HI, there is a PR that solves this from my fork pending but you can pip install it in the meantime

> pip install git+git://github.com/tianhuil/google-ngram-downloader.git@master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants