Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google published new ngrams, 20200217 #22

Open
lahosken opened this issue Jun 5, 2021 · 2 comments
Open

Google published new ngrams, 20200217 #22

lahosken opened this issue Jun 5, 2021 · 2 comments

Comments

@lahosken
Copy link

lahosken commented Jun 5, 2021

https://storage.googleapis.com/books/ngrams/books/datasetsv3.html . For an URL example, one file of ngrams is at http://storage.googleapis.com/books/ngrams/books/20200217/eng/1-00016-of-00024.gz

@7shoe
Copy link

7shoe commented Jun 7, 2021

Indeed, I tried to debug. Necessary code changes seem to be restricted to util.py.
However, new problems arise. Let me use German v3 2-grams as a reference (version 20200217).
The challenges are:

  1. No variable to pick version in the code right now.
  2. Different naming scheme for URLs from which the data is downloaded, see your ...00016-of-00024.gz URL above
  3. Google n-gram v3 line structure seems to have changed as compared to v2.

Changing the code of def iter_google_store(...) in util.py from version = '20120701' to another causes a new bug, the file template doesn't match anymore. Then, it should be
FILE_TEMPLATE_GER_NEW = '{ngram_len}-{index}-of-{full_number}.gz'
instead of
FILE_TEMPLATE = 'googlebooks-{lang}-all-{ngram_len}gram-{version}-{index}.gz',
Commenting out assert len(data) == 4 in the function definition def readline_google_store.

In def iter_google_store(...) we need to get the full_number right for the proper URL of the files. This number depends on the language lang (german in my case) and ngram_len; for that case it is

#version = '20120701'
version = '20200217' # New: v3
session = requests.Session()

# Case-By-Case lookup of total number of {gram_len} grams
if(version=='20200217' and ngram_len==1):
    full_number = '00008'
if(version=='20200217' and ngram_len==2):
    full_number = '00181'
elif(version=='20200217' and ngram_len==3):
    full_number = '01369'
elif(version=='20200217' and ngram_len==4):
    full_number = '01003'
elif(version=='20200217' and ngram_len==5):
    full_number = '02262'
else:
    full_number = 0

Printing the line (old version, 20120701) yields
0 0005_NUM 1901 1 1
which is 4 lines (n-gram, year, count, publication) as asserted in the code and mentioned in the documentation.
The 1st line of the new version has 29 entries though. It took me some time to figure out that these are all year/counts/publication triplets, e.g. `1929,1,1', '1930,5,3', etc.

I summed the counts/publications up across years and used the first year of appearance as the year, i.e.

ngram = data[0]
if(version == '20200217' and lang == 'ger'):
       (min_year, count, pubs) = (min([int(data_loc.split(',')[0]) for data_loc in data[1:]]), 
                                                     sum([int(data_loc.split(',')[1]) for data_loc in data[1:]]), 
                                                     sum([int(data_loc.split(',')[2]) for data_loc in data[1:]]))
         other  = [min_year, count, pubs]
# older version (v2/v1) 
else:
        assert len(data) == 4
        other = map(int, data[1:5])

yield Record(ngram, *other)

However, this only happens for the German v3 n-grams (i.e. version = 20200217).

@dimazest
Copy link
Owner

dimazest commented Jun 8, 2021

Thanks for the analysis. I'll have a look what v3 has to offer.

Pull requests are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants