PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

dboyadzhiev · 2024-02-14T12:48:55Z

Describe the bug

The pagination of S3 list_objects_v2 skip pages when using CommondPrefixes (i.e. Delimiter) and StartingToken

Use case:
Our API provides a list of S3 "folders" and supports pagination. It is a wrapper over our internal S3 bucket and forwards the information. The first response of the API returns a list of common prefixes and the next token provided by the PageIterator. The second request uses this token to continue the listing.

Expected Behavior

Using the paginator.paginate() method with the Delimiter parameter and not setting StartingToken should return all pages starting from the first one and its next token.
Using it again but this time with a given StartingToken (the first page next token) should return all pages starting from the second one and its next token.

Current Behavior

When the paginator.paginate() is called with StartingToken it returns the second page with an empty CommonPrefixes list but the third with a valid CommonPrefixes list

Reproduction Steps

You need a bucket with date partitions and files in them.

S3://by_bucket/2023-01-01/file1.json
S3://by_bucket/2023-01-01/file2.json
S3://by_bucket/2023-01-02/file1.json
S3://by_bucket/2023-01-02/file2.json
...
S3://by_bucket/2023-12-01/file1.json
S3://by_bucket/2023-12-01/file2.json

import boto3

BUCKET_NAME = ""
PREFIX = ""
token = None

s3_client = boto3.client("s3")
paginator = s3_client.get_paginator('list_objects_v2')

def request_page(token):
    paginator = s3_client.get_paginator('list_objects_v2')
    return paginator.paginate(
        Bucket=BUCKET_NAME,
        Delimiter='/',
        Prefix=PREFIX,
        PaginationConfig={'PageSize': 5, 'StartingToken': token}
    )

# simolate multi requests to an API 
steps = 0

# First request 
# print page 1 prefixes
# keep the token for page 2
print("Request 1")
for page in request_page(token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

# Second request 
# print page 2 prefixes
# keep the token for page 2
print("Request 2")
for page in request_page(next_token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

Output:

> Request 1
> S3://by_bucket/2023-01-01
> S3://by_bucket/2023-01-02
> S3://by_bucket/2023-01-03
> S3://by_bucket/2023-01-04
> S3://by_bucket/2023-01-05
> done in step: 1
>
> Request 2
> []
> S3://by_bucket/2023-01-11
> S3://by_bucket/2023-01-12
> S3://by_bucket/2023-01-13
> S3://by_bucket/2023-01-14
> S3://by_bucket/2023-01-15
> done in step: 3

Possible Solution

No response

Additional Information/Context

I followed the issue down to PageIterator.__iter__() (.venv/lib/python3.11/site-packages/botocore/paginate.py)

            if first_request:
                # The first request is handled differently.  We could
                # possibly have a resume/starting token that tells us where
                # to index into the retrieved page.
                if self._starting_token is not None:
                    starting_truncation = self._handle_first_request(
                        parsed, primary_result_key, starting_truncation
                    )
                first_request = False
                self._record_non_aggregate_key_values(parsed)

The primary_result_key is initiated a few lines before that as self.result_keys[0] and result_keys are essentially coming from a JSON schema from venv/lib/python3.11/site-packages/botocore/data/s3/2006-03-01/paginators-1.json

"ListObjectsV2": {
      "more_results": "IsTruncated",
      "limit_key": "MaxKeys",
      "output_token": "NextContinuationToken",
      "input_token": "ContinuationToken",
      "result_key": [
        "Contents",
        "CommonPrefixes"
      ]
    },

where result_key is Contents which is missing in the S3 response body parsed

SDK version used

1.31.17

Environment details (OS name and version, etc.)

MacOS 14.2.1 (23C71)

The text was updated successfully, but these errors were encountered:

dboyadzhiev added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Feb 14, 2024

RyanFitzSimmonsAK self-assigned this May 9, 2024

RyanFitzSimmonsAK added investigating This issue is being investigated and/or work is in progress to resolve the issue. s3 p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

dboyadzhiev commented Feb 14, 2024 •

edited

PageIterator skipping a page when browsing list_objects_v2 with Delimiter #3119

PageIterator skipping a page when browsing list_objects_v2 with Delimiter #3119

Comments

dboyadzhiev commented Feb 14, 2024 • edited

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

SDK version used

Environment details (OS name and version, etc.)

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

dboyadzhiev commented Feb 14, 2024 •

edited