Kevin's Blog

The only way to discover the limits of the possible is to go beyond them into the impossible. - Arthur C. Clarke

Mar 4, 2018 - 5 minute read - Comments - post-mortem

How We Deleted All Our Base Images

The Problem

At my organization, we typically push over 100 images a day. We are running Docker Trusted Registry with an NFS storage backened powered by Netapp OnTAP, which has dedupe, compression, etc. Storage space isn’t a huge concern for us but we’ve noticed our registry tipping the scales at over 500gb. This included months of old builds and assets that are of no value to the organization. We realized we needed a strategy to clean up old manifests but how?

Docker Trusted Registry is an amazing product, at some point I’ll write a blog post on how we’re using some of the features such as Security Scanning, Notary and Trusts. It will allow you to schedule garbage collection which will remove manifests which have no tag associated with them. DTR also sports a robust API, which makes a lot of this stuff trivial (compared to the v2 open-source registry). However, as a consumer of the registry we have to instruct DTR which tags to remove.

We set out to solve two problems initially.

  • Determine which tags/manifest are live to prevent deletion of in-use image tags
  • Determine our retention policy

The first problem was fairly easy to solve, as we could interface directly with UCP to determine what tag/manifests were in operation. The second was simply a “comfortable” number as we’ve never had to roll-back (we try and employ a “roll-forward” mentality).

We missed something

During our first pass with the cleanup script we quickly noticed that some of our older base images, of which some services still depended upon, were being purged. This exposed a shortcoming on our approach as we had not considered the impact of deleting manifests which were used in the build process for our production images. We remedied the issue by rolling up the impacted images to the most recent base image that was published. We then set out to discuss our options in prevent this from happening in the future.

Solutions

We came up with a couple of viable solutions.

  • Scrape Dockerfiles and look for tags in the FROM statement
  • Implement a whitelist for “base” images and prevent their deletion for eternity

The first option sounded horrid and we dismissed it out of the sheer amount of overhead this would create (compute/network/io/cognitive perspective). The second option sounded more palpatable and could be easily implemented within our current environment.

What happened?

So how did we manage to delete our base images?

While the whitelist functionality was under development there was an issue loading the configuration file which set the values for settings.DTR.MaxManifestCount, settings.DTR.MaxManifestAge, and settings.DTR.Whitelist.Tags. The configuration service returned 0, 0, []*string respectively and that’s where our program started blasting everything it came across. Consider the code below:

Go ahead, delete everything
 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
manifests := registry.GetRegistryRepositoryManifestList(repo.Namespace, repo.Name)
totalCount := len(manifests)

if totalCount <= settings.DTR.MaxManifestCount {
    return
}

keepStack := []string{}
maxAge := time.Duration(settings.DTR.MaxManifestAge) * 24 * time.Hour

// Iterate Manifest
for _, manifest := range manifests {
    // Check if digest is in-use by a service
    if _, inUse := digestStack[manifest.Digest]; inUse {
        keepStack = append(keepStack, manifest.Digest)
        continue
    }

    // Check for rollback use
    if _, inUse := rbDigestStack[manifest.Digest]; inUse {
        keepStack = append(keepStack, manifest.Digest)
        continue
    }

    // Pull tags
    tags := registry.GetRegistryRepositoryTagNamesForDigest(repo, manifest.Digest)

    // Check Whitelist
    keep := false
    for _, tag := range tags {
        if utils.StringInArray(settings.DTR.Whitelist.Tags, *tag) {
            keep = true
            break
        }
    }

    // Check if we keep it
    if keep {
        keepStack = append(keepStack, manifest.Digest)
        continue
    }

    // Check age
    if time.Now().Sub(manifest.CreatedAt) < maxAge {
        if len(keepStack) < settings.DTR.MaxManifestCount {
            keepStack = append(keepStack, manifest.Digest)
            continue
        }
    }

    // Delete it
    if err := registry.DeleteRegistryRepositoryManifest(repo, manifest); err != nil {
        // log error
    }
}

The Aftermath

Our gitlab slack channel, which announces changes to our many service repositories, started reporting failed builds. Our pipeline page, which is how we promote services into production, began reporting that image manifests had gone missing! The engineer who noticed the fatal issue had stopped the job but not before it wrecked havoc on at least two dozen registries.

There is a silver lining here, we don’t build images on our laptops and push them to the registry. Everything is built by our CI system and so it’s pretty easy to go back and republish images. When we began republishing images we encountered a problem attempting to rebuild a base image that had Oracle jdk-8u152. Oracle likes to change their download links so this particular image refused to build.

thanks oracle
123
RUN curl -jksSLH "Cookie: oraclelicense=accept-securebackup-cookie" \
    "http://download.oracle.com/otn-pub/java/jdk/8u152-b16/aa0333dd3019491ca4f6ddbe78cdb6d0/server-jre-8u152-linux-x64.tar.gz" \
    -o /tmp/java.tar.gz

While we do take hourly snapshots of our registry, we decided that rolling to the latest jdk8 made the most sense vs. restoring from backup. (forced upgrade, ftw) It’s also worth mentioning here that most of the services in this environment are still day 0.

Retrospective

Business Impact

  1. We effectively stopped engineers from being able to check in code and deploy changes into the pipeline for a number of services until we had the new base images published.

Lessons Learned

  1. We should sort out how to best clone the production registry into a development registry environment. If anyone here is an administrator of DTR that has some tips on how this might be accomplished with minimal effort please reach out to me.
  2. There are some missions critical images that we should consider pushing to a standby registry in case something impacts our production DTR instance. This isn’t the first time we’ve had to rebuild images and upstream dependencies that fail to download/compile have blocked us.

Action Items

  1. Create custom build plans for critical base and system images which are designed to build nightly but do not push to our registry. This will ensure the image is always conditioned to be rebuilt.
  2. Research feasibility of standing up a fully replicated clone of the registry for development purposes. [self-note: is this overkill?]
  3. Research standing up an offsite registry (v2) for storing base/critical images.