feature request: live-caching system #146

abubelinha · 2024-05-01T18:11:59Z

Current caching system is only writing to disk when the script finishes.
That could be improved (perhaps as an user-option) for some use cases under unreliable network connections:

I just came to pygbif today because requests was failing due to network issues at GBIF's side. I finally solved that problem, but in the mean time I decided to give pygbif a try, instead of using my own-made api requests. BTW I am not a requests-expert as you can see in that issue.

I was assuming these two things:

My silly requests loop has no caching system: if it fails (or if I stop it), when I re-run it has to redownload everything
I expected pygbif caching system to speed up that situation a lot. So if I the script gets killed by an error (or by user) and I re-run, I expected all cached requests should go fast (as described in Caching API requests #52) and the loop should quickly arrive to the point where it was killed in its previous run.

Now I am surprised because that is not happening.

I made a very simple test with this example file (510 names):
https://www.gbif.org/tools/speciesMatching/advancedExample.csv
I just used tqdm progress bar package combined with a for loop, and it reports total time and iterations per second (which in this case means names matched per second).
Of course. below speeds may change a lot depending on network issues (today is a bad day, as per gbif-api#128).

First pygbif matching (2.96 it./s) is much slower than requests (12 to 15 it./s).
That's not just because of writing things to cache.
Even without setting caching(True) system!! (3.5 it./s)
Of course, if the loop finishes (script not killed), then a second pygbif run is so fast: there is nothing to download (~180 it./s)
The dissapointing thing is when network errors occur: if the script is not finished then cache is not finally written to disk.
So if I rerun the script, it goes exactly the same slow speed, about 4 times slower than using requests package.

Maybe there is already an option (that I don't know) to tell pygbif to "write cache right now"?

If not, it's worth trying to modify pygbif cache system so it optionally writes to disk after each api call.
That would probably make it a bit slower. But no work would be lost if the script is finished for some reason before it ends normally.

The other thing to check is why pygbif is 4 times slower than requests, but that would be a different issue.

This is basically what I do in my loops:

for name in ['name1', 'name2' ... ]:
    dict = pygbif.species.name_backbone(name) # that returns a dictionary
    
for name in ['name1', 'name2' ... ]:
    r = requests.get("https://api.gbif.org/v1/species/match?name={}".format(name) )
    # then I parse json response to get a dictionary

Has anyone else noticed this difference in their speeds?

The text was updated successfully, but these errors were encountered:

CecSve · 2024-05-23T07:14:58Z

maybe related to this #98 - have not investigated it yet, but as the issue mentions, it would require a bigger update to the codebase. @jhnwllr willl you take a look and see if it is related?

This was referenced May 1, 2024

can't use caching: AttributeError: module 'requests_cache' has no attribute 'delete' #145

Closed

Caching API requests #52

Closed

CecSve added the enhancement idea to support API features currently not supported by pygbif label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: live-caching system #146

feature request: live-caching system #146

abubelinha commented May 1, 2024 •

edited

CecSve commented May 23, 2024

feature request: live-caching system #146

feature request: live-caching system #146

Comments

abubelinha commented May 1, 2024 • edited

CecSve commented May 23, 2024

abubelinha commented May 1, 2024 •

edited