Docker layer caching #2696

joeleonjr · 2024-04-11T13:35:06Z

Description:

Scanning docker images, especially at scale, is relatively slow. Docker images are made up of layers, which are immutable and assigned a unique sha256 hash digest. The same layers often appear across different docker images (whether tagged versions of the same image OR completely different images sharing the same base image layers). Users should be able to cache the results of scanning layers and opt to skip scanning the same layers in future scans.

The primary use cases:

Security researchers / BB that want to scan thousands of docker images in one go.
Code owners that want to scan all tagged versions of the images they're responsible for securing in a more efficient manner.

I added a decent amount of unittests as well as some integration tests to ensure that Docker layer caching is working appropriately. Also, I couldn't figure out a way to tie scan results to particular layers without making small modifications to the engine. In particular, I hooked into the notification logic to cache layer findings. If there's a better way to accomplish this, I'm happy to re-implement.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

rgmz

I looked into this in the past, and a solution like this is what I concluded made sense as well.

Though, it's possible that google/go-containerregistry has some obscure/poorly-documented caching feature (https://github.com/google/go-containerregistry/tree/main/pkg/v1/cache), or that it's possible to piggy-back on the native docker caches.

rgmz · 2024-04-11T13:43:16Z

main.go

@@ -150,6 +150,8 @@ var (

 	dockerScan       = cli.Command("docker", "Scan Docker Image")
 	dockerScanImages = dockerScan.Flag("image", "Docker image to scan. Use the file:// prefix to point to a local tarball, otherwise a image registry is assumed.").Required().Strings()
+	dockerCache      = dockerScan.Flag("cache", "Use layer caching. Don't re-scan a layer that has already been scanned and is in the layer caching db.").Bool()
+	dockerCacheDB    = dockerScan.Flag("cache-db", "Path to the layer caching database. Default is trufflehog_layers.sqlite3").Default("trufflehog_layers.sqlite3").String()


If we're persisting data, it may be prudent to make the name more generic in case there's other stuff worth caching in the future.

e.g., caching binary files rather than re-scanning them

Are you thinking about data caching outside of the Docker source or within it?

I was thinking about data caching in general. I've previously experimented with using sqlite to do things like skip duplicate docker layers, binary files, and commits; cache GitHub API E-Tags; track progress so that scans can be resumed; etc.

Then again, based on @rosecodym's comment about performance it may not be desirable to re-use the same database for multiple purposes.

rosecodym

This looks great! I love those tests.

As we discussed synchronously:

There is technically a race condition in here that can cause a layer to be erroneously skipped if the chunking code completes before the first found secret is notified. This is a manifestation of a general difficulty with caching within TruffleHog's architecture but is unlikely to break enough things to not make this change worth it.
SQLite's default serialization might effectively reduce the concurrency of secret notification to 1 for heavy Docker scans, but since this makes heavy Docker scans so much faster overall, this is almost definitely an acceptable tradeoff.
- If the concurrency reduction does occur, SQLite has a multithreaded setting that might ameliorate it a bit.

rosecodym · 2024-04-11T14:53:12Z

pkg/sources/docker/docker_caching.go

+		if err != nil {
+			return false, err
+		}
+		if !verified && !unverified_with_error {


It doesn't look like you need to track these separately - is this implementation speculating about feature requirements?

We could consolidate to one var, but I'm assuming some environments will encounter verification errors for certain detectors on an ongoing basis and they might want to skip those (which would require another cli flag) in favor of only re-scanning when there is a verified secret. But maybe that's not a frequent enough occurrence to justify storing them separately.

ahrav

Very interesting, nice work!

ahrav · 2024-04-11T16:29:26Z

pkg/engine/engine.go

+		// Handle docker layer caching if applicable
+		if e.dockerCache && r.SourceType == sourcespb.SourceType_SOURCE_TYPE_DOCKER {
+			layer := r.SourceMetadata.GetDocker().Layer
+			db, err := docker.ConnectToLayersDB(e.dockerCacheDb)


suggestion: I noticed that the code creates and closes a new database connection for each result processed in the notifyResults function. This approach of establishing and tearing down connections for every result could potentially introduce a significant performance overhead, especially if the number of results is large.

To mitigate this performance impact, I suggest exploring the possibility of establishing database connections earlier in the engine's lifecycle and leveraging connection pooling. Connection pooling should allow multiple goroutines to efficiently share and reuse a pool of pre-established database connections, reducing the overhead of creating and closing connections for each operation.

By implementing connection pooling, we could create a pool of connections when initializing the engine. Then, instead of creating a new connection for each result, we could acquire a connection from the pool, perform the necessary database operations, and release the connection back to the pool when finished. This approach would minimize the connection overhead and improve the overall performance of the notifyResults function.

What are your thoughts? Do you think it's feasible to establish connections earlier and utilize connection pooling in this scenario?

As I authored the code, I thought it was creating/closing db connections a lot, but I honestly wasn't super sure the most efficient way to implement it. I think connection pooling would be a good addition, but I could use a hand with it since I haven't implemented anything like that in the past.

Sure thing. Maybe we can do something like:

// Add the pool to the engine.

type Engine { ... dbPool *sql.DB }

// Connect during init

func WithDockerCache(dockerCache bool, dockerCacheDb string) Option { return func(e *Engine) { if dockerCache { err := e.InitializeDockerCache(dockerCacheDb) ... } } } func (e *Engine) InitializeDockerCache(dockerCacheDb string) error { var err error e.dbPool, err = sql.Open("sqlite3", dockerCacheDb) ... // Not sure if we want to make these configurable or use defaults. Default could be the number of workers // used for notifying? e.dbPool.SetMaxOpenConns(10) e.dbPool.SetMaxIdleConns(5) return nil }

// Use the pool for operations.

func (e *Engine) updateDockerCache(layer string) error { conn := e.dbPool.Conn(context.Background()) defer conn.Close() // Perform database operations using the acquired connection // ... return nil }

ahrav · 2024-04-11T16:51:38Z

pkg/sources/docker/docker_caching.go

+// InitializeLayersDB initializes the SQLite database with the digest table
+// Schema: digest (digest TEXT UNIQUE, verified BOOLEAN, unverified_with_error BOOLEAN, completed BOOLEAN)
+// It returns an error if encountered
+func InitializeLayersDB(db *sql.DB) error {


question: Would you mind explaining the rationale behind having separate verified and unverified_with_error columns in the database schema? What distinguishes a credential with verified = FALSE from one with unverified_with_error = TRUE?

From what I can gather, the code sets verified to TRUE when a credential is successfully verified, and sets unverified_with_error to TRUE when an error occurs during the verification process. However, the naming of unverified_with_error seems a bit ambiguous to me. Could you clarify what the “with_error” part signifies?

Moreover, I was wondering if we could potentially simplify the schema by using a single verified column to represent the verification status. This way, we could consolidate the UpdateVerified and UpdateUnverified functions into a single UpdateVerificationStatus function that takes the verification status as a parameter.

For example:

func UpdateVerificationStatus(db *sql.DB, digest string, verified bool) error { _, err := db.Exec("UPDATE digest SET verified = ? WHERE digest = ?", verified, digest) if err != nil { return err } return nil }

By doing this, we could streamline the code and make it more straightforward.

Of course, I understand that the current schema might be designed to capture more granular information about the verification process. If that’s the case, could you provide some insights into the specific requirements or use cases that necessitate the separate columns?

@rosecodym brought up a similar point. It might make sense to consolidate to one. I was just considering the situation where there's an unverified secret due to something like a network error. In that case, we wouldn't want to skip that layer, in case that secret is actually valid. The reason for splitting into two vars is I could image an org blocks certain traffic that keeps causing a verification error and the user doesn't want to have to keep scanning that layer and would rather only look at verified secrets. But that could be a super small edge case and not worth the added code complexity.

pkg/sources/docker/docker_caching.go

ahrav · 2024-04-11T16:58:02Z

pkg/sources/docker/docker_caching.go

+// SkipDockerLayer will return True iff the layer has been scanned before and no secrets were found.
+// This function factors in previously unverified secrets that had errors, since they could still be valid.
+// It returns an error if encountered
+func SkipDockerLayer(db *sql.DB, digest string) (bool, error) {


question: Do you anticipate this table growing significantly large over time? Additionally, do you expect the SkipDockerLayer function to be called frequently within the application's workflow?

If yes, do you think it might be worthwhile to consider adding a composite index on the (digest, completed, verified, unverified_with_error) columns to optimize the query performance in the SkipDockerLayer function.

I'm not sure how large the db will actually get. I can run a test during some upcoming research, but my assumption is very few users will surpass 1M rows. Most will probably be between 10k-20k. If it's worth the performance boost to add a composite index, I'm game, but I could use an assist on doing it properly.

I can run some tests against a large Artifactory instance to get some realistic numbers, if you supply the commands.

pkg/sources/docker/docker.go

joeleonjr added 4 commits April 10, 2024 14:47

added docker layer caching

bff60ef

added unittests

1dade6c

cleanup + test cases

e5b691b

fixing lint issues

1e14c68

joeleonjr requested review from a team as code owners April 11, 2024 13:35

joeleonjr added enhancement pkg/engine PRs and Issues related to the `engine` package pkg/sources PRs and Issues related to the `sources` package labels Apr 11, 2024

Merge branch 'main' into docker-layer-caching

9b09684

rgmz reviewed Apr 11, 2024

View reviewed changes

rosecodym approved these changes Apr 11, 2024

View reviewed changes

ahrav reviewed Apr 11, 2024

View reviewed changes

updated var naming convention

e3cb083

rgmz mentioned this pull request May 15, 2024

Update SendGrid detector #2833

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker layer caching #2696

Docker layer caching #2696

joeleonjr commented Apr 11, 2024

rgmz left a comment •

edited

rgmz Apr 11, 2024 •

edited

joeleonjr Apr 11, 2024

rgmz Apr 11, 2024

rosecodym left a comment

rosecodym Apr 11, 2024

joeleonjr Apr 11, 2024

ahrav left a comment

ahrav Apr 11, 2024

joeleonjr Apr 11, 2024

ahrav Apr 12, 2024

ahrav Apr 11, 2024

joeleonjr Apr 11, 2024

ahrav Apr 11, 2024

joeleonjr Apr 11, 2024

rgmz Apr 11, 2024

Docker layer caching #2696

Are you sure you want to change the base?

Docker layer caching #2696

Conversation

joeleonjr commented Apr 11, 2024

Description:

Checklist:

rgmz left a comment • edited

Choose a reason for hiding this comment

rgmz Apr 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rosecodym left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahrav left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgmz left a comment •

edited

rgmz Apr 11, 2024 •

edited