Add LogStore to retain build logs, and a S3LogStore implementation #967

manics · 2020-10-03T13:16:14Z

Prototype of uploading repo2docker build logs to S3 (jupyterhub/binderhub#1156)

If you don't have access to an S3 server run a local one:

docker run -it --rm -e MINIO_ACCESS_KEY=minio -e MINIO_SECRET_KEY=minio123 -p 9000:9000 minio/minio server /data

Download an S3 client e.g. the minio client from https://min.io/download

Add temporary client config called mybindertest for the local test server

export MC_HOST_mybindertest=http://minio:minio123@localhost:9000

Create a bucket called mybinder

mc mb mybindertest/mybinder

Allow public downloads from the bucket

mc policy set download mybindertest/mybinder

Repo2docker S3 logs config

cat << EOF > s3config.py 
c.Repo2Docker.s3_logs_endpoint = 'http://localhost:9000'
c.Repo2Docker.s3_logs_access_key = 'minio'
c.Repo2Docker.s3_logs_secret_key = 'minio123'
c.Repo2Docker.s3_logs_bucket = 'mybinder'
EOF

Build with image name asd123

repo2docker --config s3config.py --image-name asd123 https://github.com/binder-examples/requirements

Logs should be available at
http://localhost:9000/mybinder/buildlogs/asd123/repo2docker.log

If you can't be bothered to setup your own Minio server you can use the public demo one at play.min.io:

export MC_HOST_mybindertest=https://Q3AM3UQ867SPQQA43P2F:zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG@play.min.io
mc mb mybindertest/mybinder
mc policy set download mybindertest/mybinder

Those demo credentials are public, I've already run the above so assuming no-one else has modified it you just need:

cat << EOF > s3config.py 
c.Repo2Docker.s3_logs_endpoint = 'https://play.min.io'
c.Repo2Docker.s3_logs_access_key = 'Q3AM3UQ867SPQQA43P2F'
c.Repo2Docker.s3_logs_secret_key = 'zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG'
c.Repo2Docker.s3_logs_bucket = 'mybinder'
EOF

repo2docker --config s3config.py --no-run https://github.com/binder-examples/conda
...
Successfully tagged r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319:latest
Uploading log to mybinder/buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

https://play.minio.io/mybinder/buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

minrk

Awesome! I think this is a great approach. I had been thinking we should do it at the deployment/mybinder.org level, but this does seem lots simpler.

repo2docker/app.py

minrk · 2020-10-05T07:02:03Z

repo2docker/app.py

+            and self.s3_logs_secret_key
+            and self.s3_logs_bucket
+        ):
+            logfile = tempfile.NamedTemporaryFile("w", delete=False)


These builds logs have been useful in the past, maybe they should be kept anyway, even if not uploading to s3?

repo2docker/app.py

setup.py

manics · 2020-10-05T08:29:50Z

Thanks for reviewing! I started with repo2docker for this prototype because it was easier to get the logs, but I've since figured out how to do it on the BinderHub side: jupyterhub/binderhub#1161. Obviously all your review comments from here also apply there.

I think BinderHub is a better location since it integrates multiple components though the repo2docker is a bit simpler. Let me know which you prefer and I'll put more time into it.

betatim · 2020-10-07T05:59:24Z

Came here to find out why "store logs in s3" needed changes in repo2docker. My guess at an implementation would have either been changes to BinderHub chart (inject a sidecar) or in BinderHub itself. Is it worth discussing the pros and cons?

Having thought about it for a bit while writing this I am starting to think that r2d is the right place (sidecar or "native"), not bhub. By putting the code into r2d we can restart the hub pod without losing logs and maintain the kind of "fire and forget" setup we have right now for builds (bhub asks for a build to start and then doesn't need to maintain any state about it).

betatim · 2020-10-07T06:19:07Z

About naming the bucket: I think there are two options:

use a name derived from the resolved ref (in the Git case that would be a SHA1 and URL)
use a name derived from the unresolved ref (in the Git case that would be a branch/tag name and URL)

(the "derived from" is meant to represent us hashing/mangling/escaping/processing the inputs to generate a bucket name)

I like naming it after the resolved ref as that generates unique names for free and we can have a history of previous builds. However I can't (yet) think of a nice way for a user to find those logs again. The scenario i am thinking about is this:

user builds master of org/repo thinking "let me quickly build this repo"
build fails with some error that looks easy to fix
user makes a commit in org/repo
user builds master of org/repo again
error persists, "ok, maybe let me look at that original log again, must have overlooked something"
user now has to go find the original commit's SHA1 in order to get access to the log??

Basically in a situation where you build and rebuild and rerebuild the same repo by adding commits to your repository it becomes quite tricky to (re)find the log of a previous build.

I think this is a user interface problem which we could solve in BinderHub with some GUI or by clever structuring of the bucket names or some other way. I think the right level of complexity for a user would be presenting them with a chronologically sorted list of build logs that has a link back to the source repository (in the state it was when we built it) and maybe even the human recognisable ref that they typed into the BinderHub form.

minrk · 2020-10-07T08:13:26Z

Basically in a situation where you build and rebuild and rerebuild the same repo by adding commits to your repository it becomes quite tricky to (re)find the log of a previous build.

I see two pieces of information useful for solving this problem:

We could use a directory prefix (s3 doesn't really have directories) based on the repo only, so that it's easy to list all build logs for a repo. This may also come in handy for per-repo quotas/culling purposes, though e.g. 30-day internal bucket expiry may be sufficient for our purposes.
We can use metadata fields like we do in the analytics-archive to store information like resolved_ref, unresolved_ref, repo, date, etc.

Key information: prefix can be used to filter files to a short manageable list per-repo, metadata cannot, hence the prefix. bucket-storage like S3/GCS does not support things like list-files-matching-metadata.

Then we can have a BinderHub API based on this info to retrieve the build log history for a repo, e.g.

GET /api/build-logs/$provider/$repo -> list of build logs for repo, sorted by date with info about which ref they were building
GET /api/build-log/$provider/$repo/$ref -> single log after resolving ref

manics · 2020-10-07T19:02:23Z

About naming the bucket: I think there are two options:

use a name derived from the resolved ref (in the Git case that would be a SHA1 and URL)
use a name derived from the unresolved ref (in the Git case that would be a branch/tag name and URL)

For the benefit of anyone new to S3 I thought I'd clear up a bit of terminology. You have an S3 server, a bucket, and a key which references an object. With many public clouds you also have a region but ignore that for now.

This means the URL to your object (in this case a log-file) is something like https://<s3-server.example.org>/<bucket>/<key>

The bucket is a container for multiple objects.
The key is a single string, / has no meaning. As @minrk mentioned you can simulate a directory structure by doing a prefix search, but since it's not a real directory you can confuse people by creating both /bucket/foo and /bucket/foo/bar as objects 😀 (some S3 implementations will prevent this).

So in my original example:
https://play.minio.io/mybinder/buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

Server: https://play.minio.io
Bucket: mybinder
Key: buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

This adds unnecessary complexity since it depends on the S3 server, so rely on the user to know how to get the URL

Add boto3 to test dependencies

…mage name).

AWS S3 ACLs are legacy: https://aws.amazon.com/blogs/security/iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/

repo2docker knows nothing about the logstore other than it should copy the Docker build logs to it

manics · 2020-10-16T18:43:56Z

I think this is quite close to being ready. I took @minrk's suggestion (thanks!) of using a Configurable which I think makes things a lot nicer, the changes to app.py are fairly small, most of the logic is in logstore.S3LogStore.

The main issue left is how to configure S3LogStore, in particular how to set the log name when integrated into BinderHub. At present S3LogStore knows nothing about the R2D configuration, so it relies on S3LogStore.logname to be set to the desired S3 key/path, which means it has to be set in a config file passed as repo2docker --config my_config.py:

c.Repo2Docker.logstore = 'repo2docker.logstore.S3LogStore'
c.S3LogStore.endpoint = 'http://192.168.1.1:9000'
c.S3LogStore.access_key = 'minio'
c.S3LogStore.secret_key = 'minio123'
c.S3LogStore.bucket = 'mybinder'
c.S3LogStore.logname = 'buildlogs/repo2docker.log'
c.S3LogStore.metadata = {"R2D-Key": "value",}

This is probably OK for the static S3 connection parameters, but the logname will change with every build so the config would need to be rewritten. Options:

Don't change anything, always rewrite the config file
Add a new repo2docker argument to set the logname, and possibly the metadata if we want it.
Allow configurables to be set on the command line: --S3LogStore.logname=foo/bar.txt

Any thoughts? I quite like option 3 as it matches other Jupyter/Ipython applications, and hides the new feature from a typical user. It's not currently working in R2D though, was this a deliberately design choice? It seemed easy enough to add:

diff --git a/repo2docker/__main__.py b/repo2docker/__main__.py
index 907460d..5ea22d8 100644
--- a/repo2docker/__main__.py
+++ b/repo2docker/__main__.py
@@ -232,7 +232,8 @@ def make_r2d(argv=None):
         print(__version__)
         sys.exit(0)
 
-    args = get_argparser().parse_args(argv)
+    print(argv)
+    args, unknown_args = get_argparser().parse_known_args(argv)
 
     r2d = Repo2Docker()
 
@@ -240,6 +241,9 @@ def make_r2d(argv=None):
         r2d.log_level = logging.DEBUG
 
     r2d.load_config_file(args.config)
+    if unknown_args:
+        print("Unknown args: ", unknown_args)
+        r2d.parse_command_line(unknown_args)
     if args.appendix:
         r2d.appendix = args.appendix

manics · 2020-10-16T19:06:18Z

On K8S sidecar vs native in BinderHub: AFAIK since the repo2docker container logs to stdout it's not possible to capture them in a sidecar without changing repo2docker to log to a location the sidecar logging agent can read, such as a file on a shared volume or a network socket. I might be wrong on this though.

BinderHub obtains the logs using the K8s API
https://github.com/jupyterhub/binderhub/blob/acdab1952385eead7650590464bd218f1892fca6/binderhub/build.py#L350

consideRatio · 2022-10-31T09:24:50Z

I'm onboard with the idea and I love that you have created an abstract LogStore class where s3 is one implementation of it.

Happy to help with review efforts if you want to bring this to life again @manics!

s3 logs prototype

e2633b0

This was referenced Oct 3, 2020

Persist repository build logs for later access jupyterhub/binderhub#1156

Open

WIP: Upload build logs to S3 jupyterhub/binderhub#1161

Closed

minrk reviewed Oct 5, 2020

View reviewed changes

manics added 10 commits October 8, 2020 22:10

Remove ANSI terminal escapes from logs

fff010b

Move S3 log upload into a Configurable class

83d0f28

Remove S3LogStore.close return URL

52c732b

This adds unnecessary complexity since it depends on the S3 server, so rely on the user to know how to get the URL

Add mock unittest for S3LogStore

d69ca94

Add boto3 to test dependencies

S3LogStore set charset=utf-8

c3991e8

S3LogStore add metadata

a028dc2

S3LogStore fix logname

6d54148

S3LogStore only log if non-empty, default key repo2docker.log (no i…

792bbe0

…mage name).

logstore: remove ACL

cd90863

AWS S3 ACLs are legacy: https://aws.amazon.com/blogs/security/iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/

Move all log config into S3LogStore

0d28357

repo2docker knows nothing about the logstore other than it should copy the Docker build logs to it

Merge remote-tracking branch 'origin/master' into s3logs

8cadb5b

consideRatio changed the title ~~WIP: s3 logs prototype~~ s3 logs prototype Oct 30, 2022

consideRatio changed the title ~~s3 logs prototype~~ Add support for uploading build logs to S3 Oct 31, 2022

consideRatio added the new label Oct 31, 2022

consideRatio changed the title ~~Add support for uploading build logs to S3~~ Add LogStore, and a S3LogStore implementation to retaining build logs Oct 31, 2022

consideRatio changed the title ~~Add LogStore, and a S3LogStore implementation to retaining build logs~~ Add LogStore, and a S3LogStore implementation to retain build logs Oct 31, 2022

consideRatio changed the title ~~Add LogStore, and a S3LogStore implementation to retain build logs~~ Add LogStore to retain build logs, and a S3LogStore implementation Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LogStore to retain build logs, and a S3LogStore implementation #967

Add LogStore to retain build logs, and a S3LogStore implementation #967

manics commented Oct 3, 2020

minrk left a comment

minrk Oct 5, 2020

manics commented Oct 5, 2020

betatim commented Oct 7, 2020

betatim commented Oct 7, 2020

minrk commented Oct 7, 2020

manics commented Oct 7, 2020 •

edited

manics commented Oct 16, 2020

manics commented Oct 16, 2020

consideRatio commented Oct 31, 2022

Add LogStore to retain build logs, and a S3LogStore implementation #967

Are you sure you want to change the base?

Add LogStore to retain build logs, and a S3LogStore implementation #967

Conversation

manics commented Oct 3, 2020

minrk left a comment

Choose a reason for hiding this comment

minrk Oct 5, 2020

Choose a reason for hiding this comment

manics commented Oct 5, 2020

betatim commented Oct 7, 2020

betatim commented Oct 7, 2020

minrk commented Oct 7, 2020

manics commented Oct 7, 2020 • edited

manics commented Oct 16, 2020

manics commented Oct 16, 2020

consideRatio commented Oct 31, 2022

manics commented Oct 7, 2020 •

edited