Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageIterator skipping a page when browsing list_objects_v2 with Delimiter #3119

Open
dboyadzhiev opened this issue Feb 14, 2024 · 0 comments
Assignees
Labels
bug This issue is a confirmed bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue s3

Comments

@dboyadzhiev
Copy link

dboyadzhiev commented Feb 14, 2024

Describe the bug

The pagination of S3 list_objects_v2 skip pages when using CommondPrefixes (i.e. Delimiter) and StartingToken

Use case:
Our API provides a list of S3 "folders" and supports pagination. It is a wrapper over our internal S3 bucket and forwards the information. The first response of the API returns a list of common prefixes and the next token provided by the PageIterator. The second request uses this token to continue the listing.

Expected Behavior

Using the paginator.paginate() method with the Delimiter parameter and not setting StartingToken should return all pages starting from the first one and its next token.
Using it again but this time with a given StartingToken (the first page next token) should return all pages starting from the second one and its next token.

Current Behavior

When the paginator.paginate() is called with StartingToken it returns the second page with an empty CommonPrefixes list but the third with a valid CommonPrefixes list

Reproduction Steps

You need a bucket with date partitions and files in them.

S3://by_bucket/2023-01-01/file1.json
S3://by_bucket/2023-01-01/file2.json
S3://by_bucket/2023-01-02/file1.json
S3://by_bucket/2023-01-02/file2.json
...
S3://by_bucket/2023-12-01/file1.json
S3://by_bucket/2023-12-01/file2.json
import boto3

BUCKET_NAME = ""
PREFIX = ""
token = None

s3_client = boto3.client("s3")
paginator = s3_client.get_paginator('list_objects_v2')

def request_page(token):
    paginator = s3_client.get_paginator('list_objects_v2')
    return paginator.paginate(
        Bucket=BUCKET_NAME,
        Delimiter='/',
        Prefix=PREFIX,
        PaginationConfig={'PageSize': 5, 'StartingToken': token}
    )

# simolate multi requests to an API 
steps = 0

# First request 
# print page 1 prefixes
# keep the token for page 2
print("Request 1")
for page in request_page(token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

# Second request 
# print page 2 prefixes
# keep the token for page 2
print("Request 2")
for page in request_page(next_token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

Output:

> Request 1
> S3://by_bucket/2023-01-01
> S3://by_bucket/2023-01-02
> S3://by_bucket/2023-01-03
> S3://by_bucket/2023-01-04
> S3://by_bucket/2023-01-05
> done in step: 1
>
> Request 2
> []
> S3://by_bucket/2023-01-11
> S3://by_bucket/2023-01-12
> S3://by_bucket/2023-01-13
> S3://by_bucket/2023-01-14
> S3://by_bucket/2023-01-15
> done in step: 3

Possible Solution

No response

Additional Information/Context

I followed the issue down to PageIterator.__iter__() (.venv/lib/python3.11/site-packages/botocore/paginate.py)

            if first_request:
                # The first request is handled differently.  We could
                # possibly have a resume/starting token that tells us where
                # to index into the retrieved page.
                if self._starting_token is not None:
                    starting_truncation = self._handle_first_request(
                        parsed, primary_result_key, starting_truncation
                    )
                first_request = False
                self._record_non_aggregate_key_values(parsed)

The primary_result_key is initiated a few lines before that as self.result_keys[0] and result_keys are essentially coming from a JSON schema from venv/lib/python3.11/site-packages/botocore/data/s3/2006-03-01/paginators-1.json

"ListObjectsV2": {
      "more_results": "IsTruncated",
      "limit_key": "MaxKeys",
      "output_token": "NextContinuationToken",
      "input_token": "ContinuationToken",
      "result_key": [
        "Contents",
        "CommonPrefixes"
      ]
    },

where result_key is Contents which is missing in the S3 response body parsed

SDK version used

1.31.17

Environment details (OS name and version, etc.)

MacOS 14.2.1 (23C71)

@dboyadzhiev dboyadzhiev added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Feb 14, 2024
@RyanFitzSimmonsAK RyanFitzSimmonsAK self-assigned this May 9, 2024
@RyanFitzSimmonsAK RyanFitzSimmonsAK added investigating This issue is being investigated and/or work is in progress to resolve the issue. s3 p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue s3
Projects
None yet
Development

No branches or pull requests

2 participants