Tooling and workflows for high-output engineering teams (and their stakeholders)

[1]I like that “stakeholder” imparts this image of someone holding a pointy wooden bit to ones chest…My team at SeaWorld is, depending on how you look at things, either stretched ridiculously thin, or incredibly efficient at what we do. We’re responsible for: branded toolchains on around iOS and Android codebases (producing builds for each of our brands on each platform from two distinct codebases); a handrolled API gateway frequently described as “the mobile API”, which caches CMS content and abstracts over several different; and a Next.js application that gets variously webviewed into the native apps and delivered on its own. We get all that done, delivering not no features, and all on a team of about 10 engineers, 1.5 product people, 1 designer and 1 QA (and no project managers, if you can believe it! Jokes, I want a PM like I want a static compiler for Python). I contest that the only thing that makes this remotely possible is our continuous deployment automation, and I further assert that focusing on engineering productivity is an excellent locus for IC attention as you move up in seniority and into management/the chain of command. Join me for a tour of the last two years of evolution in CI and CD with my brand new team.

I shipped the first version of our backend/API proxy/”mobile API” about 2 years ago last 18th, and just pleased myself looking back through commit history and seeing that the second commit was about CI and it was working by the 4th commit:

That continuous deployment processes must be stood up before anyone may merge a single commit of application code into trunk is a conviction I’ve held for an extremely long time now. I’ve only ever regretted backing down, and I’ve never regretted sticking to my guns. Backing down has almost always been obnoxiously (if not painfully) expensive in dollars, hours, broken dreams, and shattered promises, but holding the line has only ever been slightly costly in the short term (in dollars, and in the initial anxiety of superiors who get very anxious at about 3 weeks in with no shiny buttons on the screen). The sweet spot for greenfield CD rollout, placating nontechnical nerves while building in a foundation for quality at the beginning of the product, is something like “one button that plays a random fart noise if you’re in audience A and a whinnying pony if you’re in audience B”. I get ahead of myself though, talking about audiences at this point. Allow me to sketch out my vision for a low-effort, high-value software development and release flow for a team you expect to run at the scale of about 30 engineers.

Principles:

  • automate everything
  • push as much information into tooling as possible
  • seriously, no manual tasks
  • push notifications, pull information
  • configure everything to block merge
  • never merge anything you’re not comfortable shipping today

Tooling

  • Task tracking
  • Version control
  • Deployment
  • Observability
  • Tests and Test Automation
  • Low-code UI test authoring
  • Feature flagging

The components all need to talk to each other, that’s why it takes so long to figure out how to make a minimally-viable web out of these components. Stamping out a setup like this might not make a lot of sense for a true pre-MVP product (although having just the skeleton in place will make moving at any velocity with more than one person hacking concurrently tractable), but I think the following setup is reasonable to expect of any new development in the context of companies with over 30 engineers, and should serve an individual team of 8-10 or a team of teams of 30ish. Can’t say further, because I haven’t personally scaled it further.

The most critical relationship to get right is that between task tracking, version control, and deployment. Task tracking is for sharing what the team is working on, the state of the changes, and where a given change is in your release flow. If anyone has any questions about the state of a ticket, you point them at the task tracker and how to traverse from ticket to open pull request (if any), and once the PR is merged the state of its changes as they churn their way through the bowels of your continuous deployment pipelines to land on some poor unsuspecting server or end user device. To get there, your version control system has to tie in to task tracking, to register the relevance of both individual commits and entire pull requests on a ticket.  CD systems also have to tie into the task tracker, so they can comment or set states on a ticket saying at the least that it, or related changes, have moved from a preview environment to a production environment.

I insist on getting observability up immediately after continuous deployment if I’ve not managed to push it into the definition of done for the first pass at CD. Observability defines how quickly you can react to changes, how well you understand the state of your system, can give you great warning on impending scaling challenges, and the moment you start running more than one service at a time will give you additional superpowers around answering which services are involved in a given bug/problem/???, the precise version of all services that were running when the first symptoms arose, how requests flow through your systems, what’s taking forever and so on. The vendors providing observability as a service have gotten remarkably good over the last decade, and I wouldn’t bring a product to any kind of scale without using something like Honeycomb or Datadog (I don’t know who the distant third is, but it sure ain’t New Relic). One caveat, though: if a vendor isn’t way deep into the OpenTracing standard…run, don’t walk.

Tests are a sad necessity at any scale. Don’t be the contractor that delivers without tests. Don’t be the developer that has to be browbeaten into writing tests. And for the love of potatoes, don’t break the build.

Subject of breaking the build, how do we keep that from happening?!

By never merging code into trunk that doesn’t pass tests and QA, naturally!

And how, dear Benjamin, dear Benjamin, dear Benjamin, and how, dear Benjamin, shall we block merge on QA approval!

With DevOps wizards (or your tech lead who’s committed to 10xing their team output through engineering productivity investments)!

Your devops wizards are to update all pipelines such that whenever a pull request is opened, automation kicks into gear and produces a preview build containing only changes from that opened PR. For the native builds, use some kind of preflight el-cheapo distribution system that you can poop builds into all day. For API and web application builds, you are free to leverage dynamic DNS and rapid provisioning and all of the automation you set up when you first stood this project up to make ephemeral environments that stand up with a PR and shut down when PRs are merged or closed unmerged (you probably also want to routinely kill any preview environments older than a month, as any change that’s been open that long is dubiously up to date anyways, and in any case that’s a ludicrous amount of time for a change proposal to stay open outside the most exigent circumstances).  Use release notes on each native build to connect them back to both open PRs and work items, as you’re going to have folks looking through that list of builds with one or the other (but never both) in hand. Deployment automation must also circle back to both the open PR and the open ticket, leaving a comment on both (or setting the appropriate status) so that product and QA folks can look at the code as it evolves. For native builds, this is a link to the relevant build in your preview distribution system; for APIs and web applications, this is a link to the ephemeral preview environment that works in QA and Product browsers (SSL errors are tolerable, but no breakage related to SSL. Experience dictates there).

This means that you can now add your QA staff as required approvers for all user-facing changes. Obviously, the first thing that you’ll do when setting this all up is specifically exclude them from reviewing anything related to CD machinery.

You’re now in the glorious position of having gotten rid of the archaic notions of dedicated and shared devqa, and prod environments. Shared development goes away, as code only ever gets merged into trunk and deployed immediately. You’ll quickly want an environment for regression, but this environment has to be as close to production as possible, ideally connected into production systems but obfuscated from the public. The upside of standing one of these up is that you can actually run an entire regression suite “against production” before you allow that code to go so far as to hit production.

Now, to leverage all of this amazing work we’ve done in automating and improving our release workflow on behalf of our friends in QA, we’re going to get them to write automated tests as well. They’re not going to write code, though, oh no. Not even I am that quixotic.

In the same way that Jest offers a no-code test-authoring UI tool, so too do many other companies on the market today. In general, there’s a browser plugin, and a browser UI that can bridge between browser windows, so that you can click around in your web or native application in one window, and record those actions in a GUI in the testrunner product. Once a test is recorded with its assertions, the general workflow (extracted from looking at a few of these products) is for the QA staff to extract test public IDs from the testrunner UI, and to get a dev to include those tests in their pull request. Modern VCS providers nearly all provide tooling for folks to make changes to an open PR without actually having to check the code out, which is perfect for QA staff. They click around in the test authoring GUI, make a few new tests, grab their IDs, go to the PR, know where to dump in additional test IDs, and as soon as they “save” (read: commit) that change, CD kicks off a new build including those tests. QA marks their work as done, approves the pull request (or kicks it back), and engineering merges at their convenience.

There’s ooooone last component that really makes it all sing: feature flags. You will inevitably (perhaps this is merely an inevitability in the too-busy-to-modernize enterprise) encounter features that are just too much of a pain in the neck to validate in preview or even regression environments, and you simply cannot deploy the feature knowing that it will work well until it hits the production environment.

For that case, I suggest feature flags. Any reasonable feature flag provider should give you: percentage-based rollouts (handling and persisting which users have been enrolled in the rollout and which not), and stacked audiences. Stacked audiences allow you to (for example) build a single feature flag for both Android and iOS, and ramp the two audiences seperately, while at the same time having one audience for employees with the feature turned to 100% no matter which platform you’re on, and a beta groups of loyal customers eager to provide feedback both on iOS and Android. This gives you the freedom to say “this feature does not go to general availability on iOS or Android; it goes to 100% availability for employees of the company; and maybe next week we’ll turn it on for our loyal guests beta group on Android because iOS needs another week or two to incubate”.

Feature flags give you the freedom to take huge features and deliver them incrementally. The first thing you do is make the flag. The second thing you do is start merging changes behind it. This way you can propose and merge small incremental changes, and QA can sign off on them, validating the entire feature after it’s fully baked, merged into release trunks, and deployed to the production environment. Product staff similarly get to experiment with the feature in their production builds (folks whose jobs aren’t explicitly to perform QA ultimately being somewhat resistant to going through the hoopla of installing and running sample builds) without risking exposing those unbaked features to the general public using the applications.

Any feature flagging vendor should give you full cross-platform support. Meaning that backend systems should be able to use the same flag as frontend systems, and deterministically bucket users so that we get strong guarantees that if native (or JS) enrolls an arbitrary user in a feature flag, the rest of our systems using feature-flagged behavior

Let’s put it all together:

  1. Engineers open a pull request
  2. Automation craps a link to a preview of the work in question back on to the PR and relevant work tickets
  3. Tests run
    1. Unit
    2. Integration
    3. UI
  4. Product and QA review, provide feedback
  5. Engineering peers review code, provide feedback
  6. Engineering, Product and QA approve merge to trunk
  7. Code is merged to trunk
  8. Application is rebuilt
  9. Application is deployed to regression environment
  10. Regression tests are run
  11. Everyone personally responsible for the change reviews the change in preproduction (TestFlight, Play Store Beta track, a version of your API/webapp deployed next to but not serving production traffic and still accessible to eng/qa/product staff)
  12. CD waits for approvals from product, qa and engineering
  13. Everyone approves and new binaries go to servers/customer devices

Where to go from here?

I’d love to figure out how to rebuild Jira/ADO/Trello in source control, alongside documentation. It’s a weird kink, but I want to push as much as possible into source control. Tickets, comments, the whole nine yards. Can you imagine? A ticket, all in markdown, where folks who don’t want to get their programming swords out can comment with a web UI, and it handles putting their comments in the right place and attributing it to them via their logged-in state in the codeviewer UI? And documentation! Documentation should live in (markdown? whatever!) files right next to the code files, and developers can go edit the raw markdown while product/qa/bizops can get in and document things using the web editor! One of my biggest gripes about confluence was how it simply doesn’t afford for a change flow that looks anything like that which we have over on the raw text side of the business (and on top of that, they had to reinvent change tracking and build all new UIs for tracking diffs on top of that. The hubris!).

Some of all of this changes if you go to a k8s-type GitOps model, but not a whole lot. Your deployment binaries are now container images, and the cluster-internal reconciler is now responsible for consuming events from your CD pipeline like “image is done and passes tests” and then committing back to source control a change to the files governing which image is wanted to be running in production (which then engineering has to approve and monitor the rollout of, etc) before the cluster coordinator can roll the new image out for the task in question.

This got very didactic around workflows and tooling requirements, which I only 25% expected. I zero percent expected that I wouldn’t actually talk about my team’s journey from our previous state to our current, but that merely leaves me with the opportunity to write another autobiographical piece.

References

References
1I like that “stakeholder” imparts this image of someone holding a pointy wooden bit to ones chest…

2 thoughts on “Tooling and workflows for high-output engineering teams (and their stakeholders)

  1. Funny how the world changes. Your clients want a shiny new button every 3 weeks, mine were blown away (and nervous about) every 3 months. When we went to monthly they just didn’t know what to make of that rate of change.

    So if I get this correctly, speed of implementation plus automation eliminates the need for shared integration testing. Because a new build automatically includes changes since the developer started work, and is run through your integrated regression test. Does QA hold responsibility for ensuring new features are included in the automated regression test after each build?

    1. There are two gates for regression. First, we run tests against the proposed change as it would look when merged into trunk. In the continuous integration (CI) pipeline, this is a “squash and merge”, where we collapse the discrete changes into one omnibus change and then merge that change into trunk only in the context of the testrunner, and not pushed back to the shared trunk branch. We then run test suites (unit, integration, UI) against code that is effectively “trunk plus the proposed changes”. The second gate is that in the continuous deployment (CD) pipeline, responsible for taking the result of a changeset that all humans and automated systems have blessed for merge and deploying it to production systems, we run the full regression suite against the code actually merged into trunk and built as a production artifact. If that set of tests fails, we consider “the build broken”, and we expect the individuals involved to drop everything they’re doing and fix whatever breakage they introduced into the production build.

      “Shared integration testing” goes away almost completely, because of those two steps: 1) all tests are run against the result of the proposed merge and 2) the entire regression suite is run again on the actual artifact heading to the production environment.

      To get a bit into the weeds about QA’s involvement: QA uses a no-code UI testing tool to author new UI tests and encode what a few years ago they would have done manually, They then grab the public IDs for those UI tests from the no-code provider, and stack another change on the proposed pull request adding their new tests. When the pull request is merged into trunk, the new test IDs get merged into trunk with the originally-proposed changes.

      Consider the scenario: branch A has changeset A.1, A.2, A.3, and introduces new tests from the QA team T.1, T.2, T.3. Branch B has B.1 and B2, and introduces new tests T.9, T.10, T.11. Branch A gets merged in first, but branch B doesn’t have those changes. Branch B’s changes, written in ignorance of branch A’s changes, actually break the implementation of A. The author of branch B has no idea that this is the case (and a naive automation implementation wouldn’t either), so the first time that anyone would know that we had a broken build would be when the CD pipelines attempted to take the change to the production environments, run tests T1/2/3, and report a broken and unreleasable build.

      The hedge we have in place against this scenario is a check to ensure that branch B is in fact the current state of trunk plus the proposed change. Yes this means that engineers have to spend some time “rebasing” their proposed changes on trunk as other engineers’ changes are merged, but a) this shouldn’t be particularly painful and b) if it is it’s almost certainly a result of proposing large changesets and not small changesets, and the pain of rebasing large changes on trunk as other folks get their work in is meant to encourage engineers to keep their changes small and merges into trunk frequent.

      This is where runtime feature flagging comes in: big features go behind a feature flag that gets evaluated at runtime, which lets folks working on big changes that take forever to land merge their work incrementally, without risking breaking the production experience. Once their feature is complete, folks can review it in production by virtue of being logged in with a company email.

      Feature flags add a bit of post-release complexity in that Product and QA now have to work together outside of the engineering merge and release process to determine that a feature is well-enough baked for general availability, and to throw the appropriate feature flag to subject all users to it. Feature flags also add a not insignificant amount of maintenance complexity, as they need ripping out aggressively, and engineering leadership needs to hold the line on deprecating and removing them on a regular cadence.

Leave a Reply