Github API search code, missing items in JSON

147 Views Asked by At

I've been trying to build up a tool that needs to fetch all files' URLs of GitHub code search's result. For example when you go the here and search for uber.com api_key. You'll see that there is 381 code results and I want to get all these 381 files' URLs.

In order to do that I learned how to use GitHub API V3 and made following function:

def fetchItems(search, GITHUB_API):   
    
    items = set()
    response = {"items":[1]}
    pageNumber = 1
    
    while(response["items"]):
        
        sleep(3) # trying to avoid rate limit, not successful though :(

        url = "https://api.github.com/search/code"
        params = {
            "q" : search,
            "per_page" : 30, # default value, it can be increased to 100
            "page" : pageNumber
        }  
        headers = {
            "Accept" : "application/vnd.github+json",
            "Authorization" : f"Bearer {GITHUB_API}"
        }

        r = requests.get(url=url, headers=headers, params=params, verify=False)
        
        if r.status_code == 403: # if we exceed the rate limit, sleep until rate limit get reseted
            epochReset = int(r.headers["X-Ratelimit-Reset"])
            epochNow = time()

            if epochNow < epochReset:
                sleep((epochReset - epochNow) + 1)
            
            sleep(1)
            continue
        
        response = json.loads(r.text)
    
        for file in response["items"]:
            items.add(file["html_url"])
        
        pageNumber += 1
    
    return items

page variable indicates the number of items that'll be returned in each page, and page is the page :). By increasing page number in every request, you should be able to get all items according to my understanding.

However when I opened my database and checked the items that have been written, I saw that there were only 377 files, so 4 of the files are missing:

here.

I checked the database writer function and I'm sure that there is nothing wrong with that. Does GitHub API return missing items in JSON or am I doing something wrong?

1

There are 1 best solutions below

0
Robert J. Walker On

One possibility is that the index that powers the GitHub /search/code endpoint only indexes files less than 384 KB. If any of the files found in your search on the site are too big, they won't show up in your response.