Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker layer caching #2696

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Docker layer caching #2696

wants to merge 6 commits into from

Conversation

joeleonjr
Copy link
Contributor

Description:

Scanning docker images, especially at scale, is relatively slow. Docker images are made up of layers, which are immutable and assigned a unique sha256 hash digest. The same layers often appear across different docker images (whether tagged versions of the same image OR completely different images sharing the same base image layers). Users should be able to cache the results of scanning layers and opt to skip scanning the same layers in future scans.

The primary use cases:

  • Security researchers / BB that want to scan thousands of docker images in one go.
  • Code owners that want to scan all tagged versions of the images they're responsible for securing in a more efficient manner.

I added a decent amount of unittests as well as some integration tests to ensure that Docker layer caching is working appropriately. Also, I couldn't figure out a way to tie scan results to particular layers without making small modifications to the engine. In particular, I hooked into the notification logic to cache layer findings. If there's a better way to accomplish this, I'm happy to re-implement.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@joeleonjr joeleonjr requested review from a team as code owners April 11, 2024 13:35
@joeleonjr joeleonjr added enhancement pkg/engine PRs and Issues related to the `engine` package pkg/sources PRs and Issues related to the `sources` package labels Apr 11, 2024
Copy link
Contributor

@rgmz rgmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this in the past, and a solution like this is what I concluded made sense as well.

Though, it's possible that google/go-containerregistry has some obscure/poorly-documented caching feature (https://github.com/google/go-containerregistry/tree/main/pkg/v1/cache), or that it's possible to piggy-back on the native docker caches.

@@ -150,6 +150,8 @@ var (

dockerScan = cli.Command("docker", "Scan Docker Image")
dockerScanImages = dockerScan.Flag("image", "Docker image to scan. Use the file:// prefix to point to a local tarball, otherwise a image registry is assumed.").Required().Strings()
dockerCache = dockerScan.Flag("cache", "Use layer caching. Don't re-scan a layer that has already been scanned and is in the layer caching db.").Bool()
dockerCacheDB = dockerScan.Flag("cache-db", "Path to the layer caching database. Default is trufflehog_layers.sqlite3").Default("trufflehog_layers.sqlite3").String()
Copy link
Contributor

@rgmz rgmz Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're persisting data, it may be prudent to make the name more generic in case there's other stuff worth caching in the future.

e.g., caching binary files rather than re-scanning them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking about data caching outside of the Docker source or within it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about data caching in general. I've previously experimented with using sqlite to do things like skip duplicate docker layers, binary files, and commits; cache GitHub API E-Tags; track progress so that scans can be resumed; etc.

Then again, based on @rosecodym's comment about performance it may not be desirable to re-use the same database for multiple purposes.

Copy link
Contributor

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I love those tests.

As we discussed synchronously:

  • There is technically a race condition in here that can cause a layer to be erroneously skipped if the chunking code completes before the first found secret is notified. This is a manifestation of a general difficulty with caching within TruffleHog's architecture but is unlikely to break enough things to not make this change worth it.
  • SQLite's default serialization might effectively reduce the concurrency of secret notification to 1 for heavy Docker scans, but since this makes heavy Docker scans so much faster overall, this is almost definitely an acceptable tradeoff.
    • If the concurrency reduction does occur, SQLite has a multithreaded setting that might ameliorate it a bit.

if err != nil {
return false, err
}
if !verified && !unverified_with_error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you need to track these separately - is this implementation speculating about feature requirements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consolidate to one var, but I'm assuming some environments will encounter verification errors for certain detectors on an ongoing basis and they might want to skip those (which would require another cli flag) in favor of only re-scanning when there is a verified secret. But maybe that's not a frequent enough occurrence to justify storing them separately.

Copy link
Collaborator

@ahrav ahrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, nice work!

// Handle docker layer caching if applicable
if e.dockerCache && r.SourceType == sourcespb.SourceType_SOURCE_TYPE_DOCKER {
layer := r.SourceMetadata.GetDocker().Layer
db, err := docker.ConnectToLayersDB(e.dockerCacheDb)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I noticed that the code creates and closes a new database connection for each result processed in the notifyResults function. This approach of establishing and tearing down connections for every result could potentially introduce a significant performance overhead, especially if the number of results is large.

To mitigate this performance impact, I suggest exploring the possibility of establishing database connections earlier in the engine's lifecycle and leveraging connection pooling. Connection pooling should allow multiple goroutines to efficiently share and reuse a pool of pre-established database connections, reducing the overhead of creating and closing connections for each operation.

By implementing connection pooling, we could create a pool of connections when initializing the engine. Then, instead of creating a new connection for each result, we could acquire a connection from the pool, perform the necessary database operations, and release the connection back to the pool when finished. This approach would minimize the connection overhead and improve the overall performance of the notifyResults function.

What are your thoughts? Do you think it's feasible to establish connections earlier and utilize connection pooling in this scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I authored the code, I thought it was creating/closing db connections a lot, but I honestly wasn't super sure the most efficient way to implement it. I think connection pooling would be a good addition, but I could use a hand with it since I haven't implemented anything like that in the past.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Maybe we can do something like:

// Add the pool to the engine.

type Engine {
  ...
  dbPool *sql.DB
}

// Connect during init

func WithDockerCache(dockerCache bool, dockerCacheDb string) Option {
    return func(e *Engine) {
        if dockerCache {
            err := e.InitializeDockerCache(dockerCacheDb)
            ...
        }
    }
}

func (e *Engine) InitializeDockerCache(dockerCacheDb string) error {
    var err error
    e.dbPool, err = sql.Open("sqlite3", dockerCacheDb)
    ...

    // Not sure if we want to make these configurable or use defaults. Default could be the number of workers
   // used for notifying?
    e.dbPool.SetMaxOpenConns(10) 
    e.dbPool.SetMaxIdleConns(5)
    return nil
}

// Use the pool for operations.

func (e *Engine) updateDockerCache(layer string) error {
    conn := e.dbPool.Conn(context.Background())
    defer conn.Close()

    // Perform database operations using the acquired connection
    // ...

    return nil
}

// InitializeLayersDB initializes the SQLite database with the digest table
// Schema: digest (digest TEXT UNIQUE, verified BOOLEAN, unverified_with_error BOOLEAN, completed BOOLEAN)
// It returns an error if encountered
func InitializeLayersDB(db *sql.DB) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Would you mind explaining the rationale behind having separate verified and unverified_with_error columns in the database schema? What distinguishes a credential with verified = FALSE from one with unverified_with_error = TRUE?

From what I can gather, the code sets verified to TRUE when a credential is successfully verified, and sets unverified_with_error to TRUE when an error occurs during the verification process. However, the naming of unverified_with_error seems a bit ambiguous to me. Could you clarify what the “with_error” part signifies?

Moreover, I was wondering if we could potentially simplify the schema by using a single verified column to represent the verification status. This way, we could consolidate the UpdateVerified and UpdateUnverified functions into a single UpdateVerificationStatus function that takes the verification status as a parameter.

For example:

func UpdateVerificationStatus(db *sql.DB, digest string, verified bool) error {
    _, err := db.Exec("UPDATE digest SET verified = ? WHERE digest = ?", verified, digest)
    if err != nil {
        return err
    }
    return nil
}

By doing this, we could streamline the code and make it more straightforward.

Of course, I understand that the current schema might be designed to capture more granular information about the verification process. If that’s the case, could you provide some insights into the specific requirements or use cases that necessitate the separate columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rosecodym brought up a similar point. It might make sense to consolidate to one. I was just considering the situation where there's an unverified secret due to something like a network error. In that case, we wouldn't want to skip that layer, in case that secret is actually valid. The reason for splitting into two vars is I could image an org blocks certain traffic that keeps causing a verification error and the user doesn't want to have to keep scanning that layer and would rather only look at verified secrets. But that could be a super small edge case and not worth the added code complexity.

pkg/sources/docker/docker_caching.go Outdated Show resolved Hide resolved
// SkipDockerLayer will return True iff the layer has been scanned before and no secrets were found.
// This function factors in previously unverified secrets that had errors, since they could still be valid.
// It returns an error if encountered
func SkipDockerLayer(db *sql.DB, digest string) (bool, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Do you anticipate this table growing significantly large over time? Additionally, do you expect the SkipDockerLayer function to be called frequently within the application's workflow?

If yes, do you think it might be worthwhile to consider adding a composite index on the (digest, completed, verified, unverified_with_error) columns to optimize the query performance in the SkipDockerLayer function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how large the db will actually get. I can run a test during some upcoming research, but my assumption is very few users will surpass 1M rows. Most will probably be between 10k-20k. If it's worth the performance boost to add a composite index, I'm game, but I could use an assist on doing it properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can run some tests against a large Artifactory instance to get some realistic numbers, if you supply the commands.

pkg/sources/docker/docker.go Show resolved Hide resolved
pkg/sources/docker/docker.go Show resolved Hide resolved
@rgmz rgmz mentioned this pull request May 15, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement pkg/engine PRs and Issues related to the `engine` package pkg/sources PRs and Issues related to the `sources` package
Development

Successfully merging this pull request may close these issues.

None yet

4 participants