Skip to content

Commit

Permalink
fix issue with incorrect number of total pages if one of the seeds is…
Browse files Browse the repository at this point in the history
… a redirect

following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get the actual page count.
  • Loading branch information
ikreymer committed Apr 4, 2024
1 parent 83c9203 commit 19d47b1
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions backend/btrixcloud/operator/crawls.py
Expand Up @@ -1178,6 +1178,11 @@ async def get_redis_crawl_stats(
pages_done = await redis.llen(f"{crawl_id}:d")

pages_found = await redis.scard(f"{crawl_id}:s")
# account for extra seeds and subtract from seen list
extra_seeds = await redis.llen(f"{crawl_id}:extraSeeds")
if extra_seeds:
pages_found -= extra_seeds

sizes = await redis.hgetall(f"{crawl_id}:size")
archive_size = sum(int(x) for x in sizes.values())

Expand Down

0 comments on commit 19d47b1

Please sign in to comment.