Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3g sidecar passthru #8287

Closed
wants to merge 51 commits into from
Closed

S3g sidecar passthru #8287

wants to merge 51 commits into from

Conversation

lukemarsden
Copy link
Contributor

Draft for collaboration/knowledge sharing

@lukemarsden lukemarsden changed the base branch from master to 2.3.x October 17, 2022 18:21
// body into memory, if it does that's bad for large writes
// and we should figure out how we can stream it to disk
// first...
awsauth.Sign4(req, awsauth.Credentials{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we probably need to plumb the volume with the web identity token into it. See https://pachyderm.slack.com/archives/C01LBA4NSJU/p1666033437705929?thread_ts=1666032276.294569&cid=C01LBA4NSJU

sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
# sc.setLogLevel("DEBUG")
sc.setSystemProperty("com.amazonaws.services.s3.disablePutObjectMD5Validation", "true")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably don't need this any more

import os

conf = SparkConf()
minio = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could strip this out


Writes to `s3_out` will then work with Spark, especially when Spark is writing a large amount of data. (With the normal S3 gateway, you see slow-downs and errors relating to "copyFile" failing.)

This directory contains a worked example. We've built and pushed the Docker image for you already, so all you need to do is run:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to the bits?

var CurrentBucket string = "out"

// This is like pipeline_name-<job-id>
var CurrentTargetPath string = ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobScopedPrefix

awsauth "github.com/smartystreets/go-aws-auth"
)

type RawS3Proxy struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the global variables inside the struct, duh

@lukemarsden
Copy link
Contributor Author

@lukemarsden
Copy link
Contributor Author

lukemarsden commented Oct 18, 2022

Wireshark: https://github.com/eldadru/ksniff#ksniff

kubectl sniff -p pipeline-spark-s3-demo-v3-a1b2 storage

Set a filter in wireshark to http

// transform each of the response headers
for k, v := range resp.Header {
for i, vv := range v {
v[i] = transform(vv)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be untransform

lukemarsden and others added 17 commits October 18, 2022 19:57
… (see user container pipeline logs grep for PROXY)
…e storage secrets, and we don't want to leak them to the user code. The triggering and blocking on the copy however needs to happen from the user code. So we make an http call from the user container on localhost to a new API endpoint on the storage container called /finish, BEFORE we finish the commit in the worker, and block until it completes.
@msteffen
Copy link
Contributor

Closing PR as it's obsolete (though we should consider revisiting the project someday)

@msteffen msteffen closed this Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants