My development team uses a number of automated systems to assist in day-to-day development. Jenkins runs tests and packages the code, Dreadnot orchestrates the code to different environments, Hubot faciliates communication over IRC. These systems are only platforms for delivery: to automate team processes, custom code is written on top of these platforms. For example, our IRC bot sends out reminders for team standups 15 minutes prior every day – this isn’t built into Hubot; for it to work, custom code needs to be written. After enough of this a team’s development process is held together by a fragile web of 1-off code and configured systems.

Code that enables team effectiveness is just like any other code. In the vocabulary of agile software development lifecycle, the development team serves as the product manager, which presents some unique challenges. Rather than a single individual or focused team being responsible for this, these systems are a secondary concern, necessary for doing work but not intrinsically valuable. Systems that perform automation can obscure discussions about the real issues: are discussions happening about team processes or simply about the tools that automate them? Are people talking through processes or getting lost in the details?

I’ve described our acceptance testing setup in the past. A year later we have around 100 acceptance suites and four preproduction environments: staging (integrates with other staging APIs) and three preproduction environments (to mirror production). Because of the realities associated with testing against real APIs and running over a thousand browser acceptance tests we will often need to do a bit of manual testing before promoting the build in preproduction using the Jenkins “Production Deploy OK” job. We have heavily customized Jenkins to fit our desired workflow. We have jobs named Chef_Deploy_Production, Bus_Station, Production_Artifacts, Deploy_Preprod_ORD – each of these is important in a different way and wading through the sheer number of jobs can be intimidating. I would estimate that we have 20 Jenkins jobs related to passing code through from the Git master branch into our production deploy certification job. Often new members of the team only see the jobs as configured in Jenkins, as opposed to the end purpose that all of the jobs achieve when working together. Compounding this complexity is that Jenkins interacts with our IRC bot – certain jobs trigger IRC notifications and certain bot commands trigger other jobs.

I don’t think we’re unique in that our system to support our team has gotten huge and complicated: the more custom our team processes have gotten, the less it is supported by out of the box tools. However when developers talk through the system the atmosphere is primarily one of confusion: unless you’ve worked intricately with their details, you only see jobs in the Jenkins UI that must succeed; when they fail, things don’t work as expected. This is because, while these systems enforce processes, the team only sees the mechanisms of the process rather than the process itself.

Additionally, few systems remain successful when fixed in stone. Discussions around system extension can be difficult. Custom code and workflow systems like Jenkins can do almost anything you want; you’re not going to get any guidance evaluating your approaches on feasibility. A discussion centered around code will just be around which logical systems need to be automated: “Well, the system does X, how could we make it do Y?”. As a member of an engineering team dealing with code all day it’s easy to make this discussion about the code and not think about the bigger issues: are we automating practical processes? Do we even understand the problem enough to invest effort into automating a process? Is this even a process we can successfully add to our existing set of processes?

I find it most helpful to talk through which behaviors you want your systems to encourage, and automate later, only after we are certain that we have the right process. Talking through behaviors can involve some discussions around the end purpose of the original process, what it is intended to assist and avoid, before investing in the effort required to automate it. By taking a behavior-specific approach to these questions, we also defer automation until we are certain that we have validated that our processes work manually. If a behavior doesn’t work manually, can you really be certain that only the barrier to it not working is lack of automation?

As an example, when changes are pushed to our preproduction environment, we expect people with changes there to come to a decision about whether or not their code works. To support these behaviors, we’ve instrumented our IRC bot with a few different commands:

  • a !pipeline command to show which acceptance suites are failing
  • a !preprod command for communication around what people are seeing on the preproduction environment
  • a !force-certify-preprod command for build promotion when there is a failing acceptance suite
  • a !good command for recording that you have manually tested your changes in the preproduction environment, and they are working as expected
  • a !bad command for recording that you have manually tested your changes in the preproduction environment, and they are not working as expected (or break the site)

These commands define the behaviors that we expect our developers to have with the preproduction environment. While this is a lot of commands, prior to bot instrumentation we would manually interact with Jenkins. Rather than run the !force-certify-preprod command, we’d manually trigger the Production_Deploy_OK job through Jenkins. This had to be done with a very specific parameter associated with the build that was deployed to the preproduction environment or else it would fail (or worse, certify the wrong build). Not only was this obscure, it didn’t match the behavior that we wanted on promotion: run Production_Deploy_OK only after failing acceptance suites had been triaged. People could run the job without any communication with the team, limited visibility who this had been done by, and without clearly listing which test suites were failing. By instrumenting the IRC bot to handle these commands, it gives us the level of customization to support our desired team processes – customization that generic tools such as Jenkins will never support out of the box.

Just as you need to have processes, you need to automate your process around desired behaviors to grow your team’s effectiveness. The more processes that get automated, the greater the disconnect between the “out of the box” processes provided by workflow tools like Jenkins, leading your team towards custom instrumentation. Once a custom instrumentation system is in place, the system can sometimes seem like an end in itself; however, it is just a mechanism to enforce a process. Extend your processes by reviewing the behaviors that team members are expected to perform, automate them when you know they’re right, and removing systems when they’re no longer useful.