Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is a limit of 1000 results per search. #824

Closed
BBI-YggyKing opened this issue Jun 21, 2018 · 9 comments
Closed

There is a limit of 1000 results per search. #824

BBI-YggyKing opened this issue Jun 21, 2018 · 9 comments
Labels

Comments

@BBI-YggyKing
Copy link
Contributor

The GitHub API limits searches to 1000 results. This limit affects searches performed via PyGitHub, such as GitHub.search_issues.

It seems there is no indication that your search has hit this limit - there is no exception or error that I am aware of. Perhaps there should be an exception raised when this happens (if it can be detected).

It is possible to work around this limit by issuing multiple search queries, but such queries must be tailored to suit the particular goals of the query - for example iterating over search_issues by progressive date ranges - and I cannot think of a way to generalise this.

Any thoughts on how to address this? Is there a general solution?

Note that this issue has nothing to do with rate limiting or pagination of results.

@BBI-YggyKing
Copy link
Contributor Author

BBI-YggyKing commented Jun 21, 2018

Here is a workaround demonstrating how to retrieve all pull requests in a range of dates, even if there are more than 1000 results:

EDIT: I will rewrite this to be a method that yields, rather than a class, will be simpler

class PullRequestQuery:
    def __init__(self, git, repo, since, until):
        self.git = git
        self.repo = repo
        self.until = until
        self.issues = self.__query(since, until)
    
    def __iter__(self):
        skip = False
        while True:
            results = False
            for issue in self.issues:
                if not skip:
                    results = True
                    yield issue.as_pull_request()
                skip = False
            
            # If no more results then stop iterating.
            if not results:
                break

            # Start new query picking up where we left off. Previous issue will be first one returned, so skip it.
            self.issues = self.__query(issue.closed_at.strftime('%Y-%m-%dT%H:%M:%SZ'), self.until)
            skip = True
        
    def __query(self, since, until):
        querystring = 'type:pr is:closed repo:%s/%s closed:"%s..%s"' % (self.repo.organization.login, self.repo.name, since, until)
        return self.git.search_issues(query=querystring, sort="updated", order="asc")

With this class, you can now do this sort of thing:

git = Github(user, passwd)
org = git.get_organization(orgname)
repo = org.get_repo(reponame)
for pull in PullRequestQuery(git, repo, "2017-01-01", "2017-12-31"):
    print "%s: %s" % (pull.number, pull.title)

@mfonville
Copy link
Contributor

Reading the Github API docs about search, I also notice that incomplete_results is missing as part of the search-results processin in PyGithub. Probably including that value might also already help out with detecting if search results might be (in)complete.

@BBI-YggyKing
Copy link
Contributor Author

Now that I have PyGithub forked and running locally from source (I'm looking at #606) perhaps I can investigate this further.

@stale
Copy link

stale bot commented Sep 30, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 30, 2018
@stale stale bot closed this as completed Oct 7, 2018
@djwgit
Copy link

djwgit commented Jan 29, 2020

re-open this issue ?
and for a general solution for other searches as well

@Piuli
Copy link

Piuli commented Jun 4, 2021

Does anybody have a solution? This is blocking us from exploring the marketplace.

@jtsai-quid
Copy link

Got the same problem in 1.55 😩

@Piuli
Copy link

Piuli commented Jun 15, 2021

You can retrieve over 1,000 results by also following this method similar to what BBI-YggyKing mentioned, but this is without using the API. However, it may not return all of the results.

https://stackoverflow.com/questions/67844111/how-can-i-scrape-more-than-1-000-results-from-github-marketplace/67991835#67991835

@oscarpobletes
Copy link

Check out https://github.com/oscarpobletes/GitHubMines !

This is an extraction tool that allows you to perform a search on GitHub and bypass some limits established by GitHub GraphQL API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants