Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: live-caching system #146

Open
abubelinha opened this issue May 1, 2024 · 1 comment
Open

feature request: live-caching system #146

abubelinha opened this issue May 1, 2024 · 1 comment
Labels
enhancement idea to support API features currently not supported by pygbif

Comments

@abubelinha
Copy link

abubelinha commented May 1, 2024

Current caching system is only writing to disk when the script finishes.
That could be improved (perhaps as an user-option) for some use cases under unreliable network connections:

I just came to pygbif today because requests was failing due to network issues at GBIF's side. I finally solved that problem, but in the mean time I decided to give pygbif a try, instead of using my own-made api requests. BTW I am not a requests-expert as you can see in that issue.

I was assuming these two things:

  • My silly requests loop has no caching system: if it fails (or if I stop it), when I re-run it has to redownload everything
  • I expected pygbif caching system to speed up that situation a lot. So if I the script gets killed by an error (or by user) and I re-run, I expected all cached requests should go fast (as described in Caching API requests #52) and the loop should quickly arrive to the point where it was killed in its previous run.

Now I am surprised because that is not happening.

I made a very simple test with this example file (510 names):
https://www.gbif.org/tools/speciesMatching/advancedExample.csv
I just used tqdm progress bar package combined with a for loop, and it reports total time and iterations per second (which in this case means names matched per second).
Of course. below speeds may change a lot depending on network issues (today is a bad day, as per gbif-api#128).

  • First pygbif matching (2.96 it./s) is much slower than requests (12 to 15 it./s).
    That's not just because of writing things to cache.
    Even without setting caching(True) system!! (3.5 it./s)
    Of course, if the loop finishes (script not killed), then a second pygbif run is so fast: there is nothing to download (~180 it./s)
  • The dissapointing thing is when network errors occur: if the script is not finished then cache is not finally written to disk.
    So if I rerun the script, it goes exactly the same slow speed, about 4 times slower than using requests package.

Maybe there is already an option (that I don't know) to tell pygbif to "write cache right now"?

If not, it's worth trying to modify pygbif cache system so it optionally writes to disk after each api call.
That would probably make it a bit slower. But no work would be lost if the script is finished for some reason before it ends normally.

The other thing to check is why pygbif is 4 times slower than requests, but that would be a different issue.

This is basically what I do in my loops:

for name in ['name1', 'name2' ... ]:
    dict = pygbif.species.name_backbone(name) # that returns a dictionary
    
for name in ['name1', 'name2' ... ]:
    r = requests.get("https://api.gbif.org/v1/species/match?name={}".format(name) )
    # then I parse json response to get a dictionary

Has anyone else noticed this difference in their speeds?

@CecSve CecSve added the enhancement idea to support API features currently not supported by pygbif label May 23, 2024
@CecSve
Copy link
Contributor

CecSve commented May 23, 2024

maybe related to this #98 - have not investigated it yet, but as the issue mentions, it would require a bigger update to the codebase. @jhnwllr willl you take a look and see if it is related?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement idea to support API features currently not supported by pygbif
Projects
None yet
Development

No branches or pull requests

2 participants