Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with read after write consistency #531

Open
gusostow opened this issue Oct 28, 2022 · 5 comments
Open

Issue with read after write consistency #531

gusostow opened this issue Oct 28, 2022 · 5 comments
Assignees

Comments

@gusostow
Copy link

gusostow commented Oct 28, 2022

Version: 2.0.0

I find that uploading files to S3 are not always immediately available for access.

It was my understanding that this should not happen with strong consistency. Is s5cmd exiting before the PUT request returns?

#!/bin/bash
set -e

while true; do
    name=$(openssl rand -hex 12)
    path="/tmp/$name"
    outpath=/tmp/$name/out
    url=s3://bucket/$name

    mkdir $path
    mkdir $outpath
    touch $path/file

    s5cmd cp "$path/*" $url/ 
    s5cmd --log trace cp "$url/*" $outpath/ 

    if [[ ! -f $outpath/file ]]; then
        echo "$outpath/file doesn't exist"
        exit 1
    else
        echo "$outpath/file exists"
    fi
    sleep 1
done

This tends to fail after a few iterations for me.

Note that a similar variant of this test never fails when file is transferred directly instead of by directory/prefix.

@gusostow
Copy link
Author

From looking at the logs more closely, the second command, s5cmd --log trace cp "$url/*" $outpath/ , copying back from S3 to local is exiting 0 even though no files are copied based on the wildcard. It's my understanding that s5cmd normal exits 0 and logs an error when this is the case.

See example output from the reproducer:

cp /tmp/19bb0c6e83d17480924f9206/file s3://BUCKET/19bb0c6e83d17480924f9206/file
DEBUG: Request s3/ListObjectsV2 Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /?list-type=2&prefix=aostow%2F19bb0c6e83d17480924f9206%2F HTTP/1.1
Host: asd-tmp.s3.amazonaws.com
User-Agent: aws-sdk-go/1.40.25 (go1.18.3; linux; amd64)
Authorization: AWS4-HMAC-SHA256 Credential=AKIAWP5DEQE432HWSWWU/20221028/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=36910278f845f4e08f6959d0821e3df29b4fdf633facc0a70657ccecba2ed7d7
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20221028T183948Z
Accept-Encoding: gzip


-----------------------------------------------------
DEBUG: Response s3/ListObjectsV2 Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Fri, 28 Oct 2022 18:39:49 GMT
Server: AmazonS3
X-Amz-Bucket-Region: us-east-1
X-Amz-Id-2: Tr7r7r2dyW3MTw9SkOlsfyXxC+R3ORNbZv2RNSqJPZaXqy4mn8FalkBjksHDa0dSQR2BoDQ/ibk=
X-Amz-Request-Id: GH4N1ZXGZ02TE3ST


-----------------------------------------------------
/tmp/19bb0c6e83d17480924f9206/out/file doesn't exist

s5cmd cp exited with 0

s5cmd is listing objects then exiting 0 without copying.

@mwkaufman
Copy link

mwkaufman commented May 12, 2023

Seeing the same lack of strong consistency with script below, in the latest beta v2.1.0-beta.1-3e08061, v2.0.0, and v1.4.0-d7a0dda:

#!/bin/bash

echo "### COPY"
s5cmd --log debug --json --stat cp s3://bucket1/my.parquet s3://bucket2/my.parquet

echo "### LIST, copied item missing"
s5cmd --log debug --json --stat ls  s3://bucket2/my.parquet
sleep 5

echo "### LIST2, same query after sleep, item is there"
s5cmd --log debug --json --stat ls  s3://bucket2/my.parquet

It's not 100% of the time but over 50% of the time I'd say. I wrote similar scripts using the AWS CLI and github.com/aws/aws-sdk-go v1.40.25 and they never lacked this strong consistency. File used in testing is 122 KB, same behavior from s5cmd whether it was a new file or an overwritten file.

@mwkaufman
Copy link

So this seems tied to S3 having a fast clock. i.e. We did a copy at 16:27:27, failed a read at 16:27:28, and then succeeded a read at 16:27:33 showing a LastModified time of 16:27:29. s5cmd discarded the object on the initial read due to this logic ignoring future objects: https://github.com/peak/s5cmd/pull/188/files So when we went back to s5cmd 1.0.0 it works fine.

Thoughts on a flag to optionally disable ignoring files with future timestamps? e.g. s5cmd --show-files-with-future-lastmodified-dates ls s3://bucket2/my.parquet

https://github.com/peak/s5cmd/blob/v2.1.0-beta.1/storage/s3.go#L208 -> if mod.After(now) && ignoreFutureFiles {

@denizsurmeli
Copy link
Contributor

denizsurmeli commented Jul 6, 2023

The reason why this problem occurs is having different clocks, as @mwkaufman mentioned.

s5cmd/storage/s3.go

Lines 333 to 337 in 74d21bb

// track the instant object iteration began,
// so it can be used to bypass objects created after this instant
if now.IsZero() {
now = time.Now().UTC()
}

s5cmd/storage/s3.go

Lines 345 to 349 in 74d21bb

mod := aws.TimeValue(c.LastModified).UTC()
if mod.After(now) {
objectFound = true
continue
}

These check was added for the #168. When you use pagination when you have more than 1000 according to AWS listObjectsV2, if you have moved an object that matches with the wildcard pattern and if in the next call to the listObjectsV2 the object you have moved occurs, the object gets recopied since it matches the pattern, thus creating an infinite loop. This happens because since s5cmd does not list all the objects then starts copying but it simultaneously does the operations, unlike gsutil where it first lists then copies, and aws-cli has different wildcard permissions than s5cmd has.

This is another approach by #567 , where we request the time from the header, but LastModifiedTime has millisecond precision and the date returned by the header has only second precision. So, imagine the case where you are operating as in #168. You have started at 18.00.555. Omitting the dates, the actual serverTime is 18:00:00.750 but the time returned in the header is 18:00:01. You have moved an object and it has the last modified time of 18:00:00.850. The object skips the filter, causing #168 to reoccur. Observe that this is squeezing the %.0001 percent of possibilities and you can think that this problem still existent in the current implementation.

So, what are some possible solutions ?

  • Implementing the flag solution proposed by @mwkaufman.
  • Try Getting Time from Server for List Calls #567
  • Implementing the solution that gsutil uses. This creates memory pressure if there are too many keys.
  • Using Amazon's NTP. Amazon synchronizes their servers by each region using NTP and we can use the clock provided by the server as the client clock.
  • Decoupling the filter from ls and moving the check to the necessary commands.

@denizsurmeli
Copy link
Contributor

denizsurmeli commented Jul 13, 2023

After further research and discussion, we have concluded that the current solution is the best effort solution at the moment. We are leaning towards to add a flag to disable future object check when ls is called.

@igungor igungor removed this from the v2.2.0 milestone Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants