marcesher.com

Goin’ back to Windows… Windows… Windows…

Three years ago, I wrote about moving off of Windows/Mac and onto Linux for personal computing and developing.

Three years later, I’ve moved back to Windows.

In that 3-year-old post, which was about habit change and work/life balance, the key features I was looking for in a developer machine were a solid terminal and easier installation/updating of software via package management. The things I use most as a developer these days are Go, a database such as PostgreSQL, docker, a terminal to interact with those things, Jenkins, a web browser, and an IDE of some sort.

In the intervening years, Windows has come a long way. With Windows Subsystem for Linux, Virtualization and top-notch docker support, Visual Studio Code and its myriad of plugins (including Go), Chocolatey, and a few other goodies, Windows has become, for me, just about as enjoyable as Linux and (gasp) probably even more enjoyable than Mac.

So why did I decide to even make this journey in the first place? Two reasons:

a busted Linux laptop
influencers

I was using a dual-booted Samsung Chronos laptop, running Ubuntu. Out of the blue, a few months ago, it became terribly unstable and slow. I simply could not figure it out. Maybe a disk issue? Who knows. I didn’t invest much energy into it because I’ve been shedding Samsung from my life for a few years now and this was an opportunity.

While researching replacement options, I became increasingly enamored with the Lenovo Yoga series, though it was running Windows so… boo. But then a funny thing happened. Internet-famous-to-me people like Jessie Frazelle, Brian Ketelsen, and others had moved to Microsoft to work on Windows-related tech, evangelizing Azure, Go, docker, WSL, etc. Developer friends such as Ray Camden and Sean Corfield were talking about their move back to Windows. And the more I read from them, the more I started to even consider giving Windows another shot.

I liked the hardware of the Yoga a lot, especially for the price, so I took a risk. Worst case scenario was that the Windows experience would suck and I’d just put dual-boot a Linux distro on it.

As of now, I’m really glad I’ve made the switch. In the posts that follow, I’ll discuss the various software and configurations I’m currently using as part of this journey.

Next post: Learning about WSL and Windows Automation from Jessie Frazelle

cf.Objective() 2017 Presentation: A Place to Grow

By Marc Esher

I had the distinct privilege of delivering a keynote presentation at cf.Objective() 2017 in Washington, DC.

The slides — probably the most important one is the list of resources at the end — are on GitHub.

Thanks to all who did me the honor of listening. I am humbled and grateful.

The Curious Case of the Slow Jenkins Job

By Marc Esher

When I started work this morning, I expected a normal manager day: emails, meetings, shepherding some proposed infrastructure changes through our change management process.

That was not to be. What followed instead was most of the day on the edge of my limited Linux troubleshooting abilities, trying to diagnose performance degradation on our production Jenkins server.

Around 10:30 AM, Andy messaged me:

“Anecdotal Jenkins slowness. Something that regularly takes 3 minutes on my machine takes 18 minutes on Jenkins”

This is a story about troubleshooting.

Prologue: The environment

This story’s main characters are RHEL, Jenkins, New Relic, job-dsl-plugin, and the anti-virus software we run on our linux servers.

It will become obvious within about 30 seconds, but I will confess without shame up front that I am no Brendan Gregg.

Chapter 1: tmux, top, iostat

At the very start of this story, my mindset is: Gather Facts. I don’t know whether I can solve this problem or if I even have the skills (probably not), but before I pull in our sysadmins, I need to do be able to satisfactorily articulate the problem and tell them all the things I’ve done to figure out what’s what. So:

If I can’t figure out a performance problem with a mix of New Relic, easily accessible logs, or intuition, I get on the server and get tmux going and pull up a cheat sheet. Once I re-learn how to split windows, I get a few windows going with top and iostat -dmx 5.

I’m trying to whittle down the list of culprits. I’m asking: Are there obvious processes consuming all CPU or RAM? Is I/O through the roof and we have a saturation problem? Basically: what’s consuming resources?

My interpretation of the data I was seeing in New Relic was that in this case, disk and network were the bottlenecks. But after talking more with Andy about what his code was doing, I was able to rule out network.

iostat -dmx 5 didn’t show any excessive waiting, and top confirmed what I saw in New Relic that RAM wasn’t an issue.

Andy’s django-admin command was consuming @80% CPU. He told me enough about his code that I learned it was writing thousands of files to disk, so I watch -n 10 'ls | wc -l' on the directory in question to get a sense of how many files it was writing per second, which was between say 300 and 500. New Relic was showing about 100% CPU utilization.

The other 20% of CPU were 2 antivirus scanning processes. That’s interesting. I went back to New Relic to see what it thought about the running processes, and it was showing those AV processes at around 30% total. Because this is an average over time, that suggested to me that those processes periodically spike much higher than what I was seeing in the moment.

So I relaxed my eyeballs and just let them sit on that tmux split window, switching back and forth between top, iostat, and watch. Then I saw it.

Antivirus spiked to over 100% CPU. iostat writes per second dropped, and the number I was seeing in watch showed a noticeable drop in file writes per second, confirming iostat.

And, also: git. There were multiple git processes now running, and taking a suspiciously long time to complete. I say suspicious because while most of our Jenkins jobs pull from one git repo or another, those git commands should be completing very quickly b/c they’re just updates to previously cloned repositories.

Why would I be seeing long-running git commands and corresponding significant spikes in antivirus process CPU?

Chapter 2: Another long-running Jenkins job

I needed to find out what was running those git processes at that time, so I went into the Jenkins UI and sorted jobs by last success. I found a job whose time aligned with what I was just seeing and looked at its build history. About 2 minutes. That seemed wrong because I know that job well and it should only take a few seconds.

I ran it and watched the jenkins console… and I watched it just sit and spin at the git checkout step. This is for a repository that rarely changes. WTH?

Oh: top is going nuts while this is happening. AV CPU spiking.

I then ran the git command in a shell, just to confirm:

$ time /usr/local/bin/git fetch --tags --progress https://github.com/foo/bar.git +refs/heads/*:refs/remotes/origin/* --depth=1

remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0

real    0m21.462s

user    0m0.908s

sys     0m2.193s

21 seconds for a git fetch? That’s nuts.

To get another data point, I went onto another Jenkins server in our dev environment and ran the same code. It ran between 2 and 7 seconds. Still way too long, but nothing like I was seeing on this production Jenkins server.

Chapter 3: tailing AV logs

OK, so AV is clearly emerging as the culprit, and I needed to see into it. I mentioned that I’m no Brendan Gregg; I am not the linux process whisperer (though I aspire to that). I’m an old-school “the answer my friend, is always in the logs, the answer is always in the logs” kinda person, so I needed to see AV logs.

I asked a sysadmin how to see what AV was scanning, and he told me the command to run to configure AV to enable slow-scan logging.

This ended up being the key to this whole affair.

I turned on slow-scan logging, kicked off Andy’s job, and started watching those logs.

And, perhaps not surprisingly, nothing jumped out at me. Sure, it was hitting the files that Andy’s job was creating. No surprise there.

I kept watching. At this point, having never seen these logs before, I don’t have any gut or intuition to guide me, so I fall back on old faithful: just look for stuff that doesn’t look like the other stuff you saw a few minutes ago.

Huh. A lot of git objects start showing up, from other Jenkins jobs running and doing their normal fetch thing. I’m not sure how to interpret that because I don’t know if that’s normal or not from AV’s perspective.

Chapter 4: Email!

At this point, I’ve spent a few hours on this and think I have enough information to send to our Linux sysadmins to see if they have any ideas. So I start writing an email.

I recount the condensed version of what I’ve typed above, along with timings from commands and snippets from those AV logs. They look a bit like this:

1498589352.642894 0.016822 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/bb/f755b6e2c4b64d9397144667504c6da2ce8b17
1498589352.720311 0.023426 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/74/f6f21d05c429539a299f3c59d1ea95ed30472b
1498589352.745168 0.022793 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-33140
1498589353.490658 0.014124 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37352
1498589353.964574 0.015037 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37691
1498589354.185565 0.017393 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/69/a840cd5cf06a36f3fff17948e6f9db4ccb9903
1498589354.608213 0.024442 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/24/f44b5ea3b338569de587e4e10d014fe0bb3afa

And then, before I hit send, something that I’ve put in that email jumps out at me. I hadn’t even noticed it when I was reading logs, but being forced to condense my investigation and see it all tightly, for whatever reason, leads my eyes to:

1498589352.745168 0.022793 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-33140
1498589353.490658 0.014124 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37352
1498589353.964574 0.015037 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37691

Wait: what is that /refs/tags stuff? I know this git repo well. It’s the one above that was taking 21 seconds to fetch, and its Jenkins job taking 2 minutes, when it should only be taking seconds. There are no tags in that repo. Why is AV spending time looking at git tags that I don’t even know exist, and does that correlate to the suspicious slowness?

Chapter 5: OMG

I went to that job’s workspace and ran git tag | wc -l and there it was: almost 30,000 tags.

I then went to the Jenkins job configuration screen and saw it: the job was configured to create a per-build git tag. This job runs every 15 minutes. It had accrued almost a year’s worth of per-build tags.

And AV was presumably scanning them all, every build.

I wiped the workspace and re-ran that job.

Under 3 seconds. AV didn’t even have time to show up in top

Good grief.

Chapter 6: Damage Assessment

My next step was to learn how many Jenkins jobs suffered from a similar condition: configured with per-build tags, and accruing a lot of tags. I whipped up a quick-n-dirty shell script to give me a gut check.

cd $JENKINS_HOME/workspace

for i in * ; do
  if [ -d "$i" ]; then
    if [ -d "$i/.git" ]; then
      echo "$i"
      /usr/local/bin/git --git-dir $i/.git tag | wc -l
    fi
  fi
done > tag_counts.txt

cat tag_counts.txt

I see a hundred or so workspaces with a non-zero number of tags, and several with a pretty high number. The most important ones are the ones that run frequently and have a high number of per-build tags created, because every time one of those jobs run, as I just learned, it’s going to spike AV.

Chapter 7: Back to Andy’s job

While attempting to troubleshoot Andy’s job’s performance problem, we stumbled into another one, namely, that frequently run Jenkins jobs configured with a per-build git tag will eventually create so many tags that they will cause AV to spike significantly for a non-trivial amount of time.

And when a job causes AV spikes, it has the side effect of slowing down other jobs.

We still haven’t gotten to the bottom of why Andy’s job is running a lot slower in Jenkins (womp womp womp, I know, what a disappointment!), but we did find a culprit. We at least learned that when we observe it running sporadically longer than the previous run, it’s because other jobs are running that have the condition described above that cause AV to spike, which ultimately results in degrading his job’s disk write performance.

Chapter 8: job-dsl-plugin

You might be wondering: how did you end up creating so many jenkins jobs with per-build tags and not even know it? You’re probably also wondering: why don’t you frequently wipe your workspaces? I’ll answer the first question. The second one I’m going to investigate in due time.

As readers will know, I have written a lot on this blog about job-dsl-plugin. It is amazing and transformed my entire team’s use of Jenkins. Except probably for occasional quickies or experiments, we don’t use the Jenkins UI to create jobs. We use job-dsl.

Some history: Quite a while ago, the default Jenkins git plugin behavior when creating jobs from the user interface was to automatically add the per-build tag. When you created a job and added a git SCM, you had to explicitly remove that behavior. Well, way back then, job-dsl-plugin replicated that behavior in its simplest dsl, such that if you used the simplest dsl, you got the default behavior from the UI:

scm {
  git("https://github.com/foo/bar.git", "master")
}

would result in having the per-build tag added automatically. (I remember the default UI behavior quite well; I learned about the job-dsl behavior today from a google group posting)

Quite frankly: we did not know this. If you had asked me if our job-dsl jobs had per-build tagging, I’d have looked at our source code and said “of course not; it’s not explicitly configured to do so”.

I learned today that job-dsl-plugin added clear documentation around this last year, but we started using job-dsl before that and apparently just missed that this was happening. Whoops! Our bad. I can understand job-dsl’s reasoning for retaining previous git plugin behavior here, though I do wish they had changed the behavior such that it matched git plugin’s behavior as of 2.0. Right now, the situation is that creating a git checkout with job-dsl using the simple DSL does not match the simple UI behavior. That’s unfortunate.

How will we mitigate these two problems?

Let’s reverse this and talk about the job-dsl problem first and the fact that a fair number of workspaces have a lot of per-build tags second.

First, job-dsl creating per-build tags: in our job-dsl jobs, we use that simple SCM dsl all over the place. So it’s going to take us a bit to fix, but most likely we’ll:

update our jenkins-automation builders to make it as easy as possible to get the scm/git behavior we want without much trouble.
Once we have that repo updated we’ll set about changing all our job-dsl jobs to use whatever solution we devise. Note: Imagine having to do this manually in the Jenkins UI across half a dozen jenkins servers! Screw that noise. Again, I am so glad we use job-dsl. We’ll be able to change all our job-dsl jobs in under an hour.

Second, to resolve all the existing per-build tags, that’s simply a matter of wiping out our workspaces. This is done easily via scriptler or a jenkins job using a groovy build step:

for(item in jenkins.model.Jenkins.instance.items) {
  if(item.class.canonicalName != 'com.cloudbees.hudson.plugins.folder.Folder') {
    println("Looking at job job "+item.name)
    println("Wiping out workspace of job "+item.name)
    item.doDoWipeOutWorkspace()
}

Credit to Vincent Dupain for that script.

Epilogue

My expected manager day turned out to be really different and ultimately fun. I regrettably cancelled one meeting with a colleague, so I’ll need to make up for that. I learned some things and got to play detective for a few hours. I still haven’t solved Andy’s job problem but I got one step closer.

If you want another troubleshooting mystery, here’s one on how long-forgotten settings changes on one computer in the network broke Chocolatey and Powershell on another computer in the network.

I love server-sleuth stories about troubleshooting, and I’d love to hear yours!

Going Serverless: using Twilio and AWS Lambda to make phone calls with an AWS IoT button

By Marc Esher

Credit

Full credit for this idea and post goes to my colleague Andy, who presented this approach at a recent lunch-and-learn at work. A few months after his presentation, I decided to try replicating what he did. This blog post represents what I learned from that effort. Thanks Andy!

The Goal

For the 3 phones in our house, when one is misplaced, press an Amazon IoT button and have it dial the phone. A single click should call one phone number, double click another, and long click the third number. Then, the phone would ring, and I would listen for it, and I would find the phone, and that’d be swell.

This was most assuredly not an effort to create something terribly useful. My actual goal was just to learn more about Lambda and IoT, with some practical (ish) real-world utility.

In this post, I will not go into excruciating detail on every single step. I’ll get into the weeds where I had particular trouble or learned something useful. This is not an Introduction to Lambda post. It also assumes some familiarity with AWS and the AWS console.

The Tech

First, an Amazon IoT button. The button supports single, double, and long click, and so ultimately I want my function to respond differently depending on the type of click. These things are ridiculously priced at $20 and get you about 2000 clicks before the battery dies.

Second, AWS Lambda. I used Python for the code (shown below). I also used nficano/python-lambda to deploy the code, although I definitely recommend learning how to use lambda without any such deploy library first.

Third, a Twilio account and phone number. You get $15 credit for a trial account, but I decided just to buy a number ($1/month). Total cost is $1 / month for the number + 1.3 cents per call. Once you create an account, you’ll get a phone number and “sid” and “token” values to use in the code.

Fourth, twimlets. In short, using a twimlet URL within the lambda function was the easiest way I could find to have the phone call say something (in a terrible robo-voice) or play an mp3 when you pick up the phone. More on this later.

Tying it all together: Click Button –> Triggers lambda function –> uses Twilio API to dial a phone number –> that API call uses a twimlet URL that says something (or plays a file) to the person who answers the phone call.

Let’s dig in. I’ll be using Python for this. Andy has a version that uses JavaScript / NodeJS.

Using the Twilio API

Before jumping in to the Lambda bits, let’s look at the meat of the code, which is using the Twilio API to make a phone call. You can do that with a single command.

pip install the Twilio library and run that in a python shell (set the variables, obviously). In a few seconds, you should get a phone call with a greeting and then some hold music.

The code, draft 1

I mentioned above that I’m using the nficano/python-lambda library to make developing lambda functions a bit simpler. I also mentioned that when you’re just getting started with Lambda — doing your Hello World version — I strongly recommend learning how to do it with the plain old Lambda console before bringing in any tooling.

But once you get that under your belt, a library such as python-lambda really helps.

All of this code lives at https://github.com/marcesher/twilio-lambda, and I’ll walk through it here, modifying it as I go.

First, I created a new local directory, created a virtualenv, and created requirements.txt file with these contents:

After activating the virtualenv, I installed those libraries into it:

pip install -r requirements.txt

Second, I ran

lambda init

which generated several files, including skeletons for service.py and some config files.

Third, I knew I’d be configuring environment variables for twilio sid and token (again, which you get from your Twilio account once you sign up), the “from” phone number, and the various “to” phone numbers.

I keep the environment variables in a “.env” file:

lambda init also created an “event.json” file and added some dummy variables. That file will eventually become important for local testing when we want to simulate double and long button clicks on the button. But I’m going to hold off on that for now.

I then wiped out the stuff in the generated “service.py” and replaced it with this:

Then, using the python-lambda library, and ensuring that the service.py has the code above, I ran it like so:

Invoke the function locally

If all goes well, then the number you’ve configured to get called should get called by your Twilio number. If you’re using a trial account, you’ll first hear a kindly gentleman’s voice telling you you’re using a trial account, and then when you press any key on your phone, you’ll hear some hold music. Later in this post, we’ll change that hold music to something more useful.

All this lambda-invoke thing is just a nice way of running your python function and having it simulate the event object using your event.json file. It’s quite similar to how you’d use the “Test” button in the AWS Lambda console, where you also can create a JSON event object for testing.

Changing the phone call

First, that deceptively simple, single Twilio API call was, for me, infuriating to get working as I wanted. The URL it uses in the sample code is to some hold music, and I wanted to change it to actually say something, which according to the docs means pointing to a URL that returns “Twiml” — Twilio XML — which tells twilio how to behave. Here’s a sample:

So I figured I’d just host a static XML file on github or S3. But when I tried those URLs in the client call, it failed, with an error indicating that the URL didn’t accept a POST request (GH and S3 only support GET). I got to the point where I thought I might actually have to set up a web server just to serve the damn XML file. But before I did that, I took one more look at the sample code that Twilio provides and that’s when I noticed that Twimlets thing:

So then I went and checked out Twimlets.com, and lo and behold, there are all manner of handy helpers. I read the one for “Simple Message”, and then used its “Twimlet Generator” UI to create a properly encoded URL that I could then use in my service. Here’s a sample that just says “Hello from twilio” when you pick up the phone

Just with that simple message twimlet, you can configure it to say any number of different things, or play any number of mp3s. It’s slick.

Getting it working in Lambda

The AWS Lambda function

With the function working locally — dialing a number with twilio and saying something when I picked up — it was time to start configuring the Lambda function. This is going to be a multi-step affair:

create the function (and IAM role if it doesn’t exist)
configure the environment variables
deploy the code
testing in the Lambda console

Wiring up the button, and modifying the service to respond to different types of clicks, will come last.

I used the AWS console to create a new Lambda function named “hello_twilio”. I configured it thusly in the console:

I want to talk about the “Role” stuff. When I first started with Lambda, I got hung up a bit there. Ultimately, it’s just adding an IAM role with appropriate permissions. Here’s the role I created, and which you see in that screenshot above, called “lambda_execution_role”. At a minimum, your role will require CloudWatch and Lambda execution privileges. I threw in a few more for good measure.

Configuring environment variables

On the “Code” tab of the Lambda function, I added environment variables for the twilio sid and token, and from and to numbers. They’re named identically to the variables shown in the .env file, above.

Deploying the function

Because this Lambda function has dependencies — in this case, the Twilio library — you’ll need to bundle your function up as a zip file and upload it to Lambda. The AWS documentation explains how to do this manually. But if you’re using the nficano/python-lambda library, you can use it to do that for you.

First, you’ll want to be sure that the config.yml file created via lambda init has the right values. In my case, those are the values that appear above in the console. Note that “service.handler” means “the handler function in the service.py file”. This will now overwrite anything you previously configured, so make sure this file has the correct values:

Then, it’s simply a matter of:

lambda deploy

Testing in the Lambda console

With the code working locally, I wanted to then test it in the Lambda console. After ensuring that environment variables were set, I clicked the “Test” button and then inspected the CloudWatch log output below. Any invocations of your Lambda function — from the Test button, or from the real IoT button, will stuff the output into CloudWatch. This will become really important in a minute, when I discuss testing the function with the button and trying to change behavior based on the type of button click.

Wiring up the IoT button

The button comes with a tiny manual that walks you through how to activate the button, secure it, and so forth. My experience was that it was quite straightforward and only took a few minutes. Once the button is configured, you’ll be able to add it as a trigger to the Lambda function.

You wire up the IoT button in the Lambda console “Triggers” tab. Click “Add Trigger”, then “AWS IoT”. This will walk you through a wizard where it’ll create some files that you then have to add to your Button.

Once the button is all wired up, you’re ready to press!

You can then view the output in CloudWatch. Additionally, assuming that the code successfully invokes the Twilio API, you can also view the Twilio logs from your Twilio dashboard. This ended up being really helpful for me when I was trying to get that URL working as described above.

Changing behavior based on click type

Once I had the function successfully invoked with the button, I wanted to add one last thing: change the behavior of the function so that it would use a different number for a double and a long click.

This was way harder to figure out than I expected, even though the resultant answer is dead simple.

I simply could not figure out from reading documentation how to have the lambda function respond differently to double and long clicks. I knew they were supported, but no manner of googling “AWS Lambda IoT Double Click” or other such things led me to relevant docs. I went on an “IoT Rules” goose chase for a while, to no avail. Then, on a whim, I decided to just add a print statement of the event object to see if anything was in there.

After adding the print statement, re-deploying, clicking the button, and looking at the logs in CloudWatch, I noticed that there was a “clickType” key in the event object with a value of “SINGLE”. Naturally, I tried double-clicking, then long clicking, and saw in the logs the values of “DOUBLE” and “LONG”. So it ended up being really simple to respond to different click types, but it took a fair bit of time and just dumb luck to figure out how to do it.

And although I admittedly haven’t searched much since getting it working, I still haven’t found the documentation that spells out the clickType being added to the event object by the button.

The final code ended up looking like this:

Supporting clickType for local invocation and Lambda console testing

Once I figured out that different button click was simply a matter of creating a “clickType” key in the event object, it was fairly straightforward to mimic that locally and within the Lambda console.

For local development: that lambda init way up above had created an “event.json” file. I opened that up and replace the contents, like so:

Then, lambda invoke will automatically inject that as an entry in the event object.

For testing in the Lambda console, you need to configure the Test Event object. This is under the “Actions” dropdown menu.

Then, when you click the “Test” button in the console, it’ll use that as the event object.

Wrapping up

Again, huge props to Andy for inspiring this little Lambda / Twilio journey. It was a lot of fun.

Finally, I know this post hand-waved over a bunch of stuff (“to wire up the button… wire up the button” 🙂 ). But if you have questions about any of the stuff I’ve left out, please do ask.

PostScript

This started out as mostly just a silly-ish way to learn more about Lambda and IoT. But it certainly impressed the kids, and they get a kick out of pressing the button to annoy us. It’s like Leave it to Beaver: clean, wholesome fun the whole family can enjoy 😉

Jenkins-as-code: comparing job-dsl and Pipelines

By Marc Esher

In the previous post in this series, I covered my favorite development-time helper: running job scripts from the command line. In this post, I’ll cover the differences between job-dsl and Pipelines, and how I currently see the two living together in the Jenkins ecosystem.

job-dsl refresher

If you’re coming into this post directly, without reading the preceding articles in the series, I strongly encourage you to start at the start and then come back. For the rest of you, a quick refresher:

job-dsl is a way of creating Jenkins jobs with code instead of the GUI. Here’s a very simple example:

When processed via a Jenkins seed job, this code will turn into the familiar Jenkins jobs you know and love.

What are Pipelines?

Jenkins Pipelines are a huge topic, way more than I am going to cover in a single blog post. So I’m going to give the 30,000 foot view, leave a whole bunch of stuff out, and I hope whet your appetite for learning more. For the impatient, skip these next few paragraphs and head straight for the Jenkins Pipeline tutorial.

At its simplest, a Pipeline is very job-dsl-ish: it lets you describe operations in code. But you have to shift your mindset quite a bit from the Freestyle jobs you know well. When configuring a Freestyle job, you have the vast array of Jenkins plugins at your fingertips in the GUI (and job-dsl) — SCM management, Build Triggers, Build Steps, Post-build actions.

Top-level view of a Freestyle job configuration

But with Pipelines, it’s different. You get Build Triggers and your Pipeline definition. But what about the other stuff, you ask? This is where the mindshift comes in. Those things are no longer configured at the job level, but at the Pipeline level. And plugins are not automatically supported in Pipelines, so currently you get a subset of all available Jenkins functionality in Pipelines.

Top-level view of a Pipeline job configuration

Thus in practice, that means things like git repos, build steps, email, test recording/reporting, publishers, etc are all done — in text — in the Pipeline definition.

Here’s an example that ships with Jenkins:

This is kinda sorta like…

Probably confusing, I know. Let’s try to think of it this way: If you’ve read Jez Humble and David Farley’s Continuous Delivery, or have otherwise implemented build/deploy pipelines in Jenkins for years, you already have a solid conceptual sense of pipelines. It’s just that in Jenkins world until rather recently, you probably did this in one of two ways:

Upstream / downstream jobs (possibly in combination with the Delivery Pipeline plugin); or
Via the Build Flow plugin, with independent jobs being orchestrated via a simple text DSL

Either way, you probably had independent Freestyle jobs tied together somehow to make a pipeline.

Well, Jenkins Pipelines still certainly enable you to do that — and I’ll talk specifically about option #2 momentarily — but the big change here is that Pipelines enable you to do all that orchestration in a single job.

Whereas before you might have separate BuildJob, TestJob, and DeployJob tied together in one of the manners above, with Pipelines, you can do all that in a single job, using the concept of Stages to separate the discrete steps of the pipeline.

Cool! What else do I get with this?

Even with the simplest of Pipelines, you get:

Durability, to survive Jenkins restarts
Pausing for user input
Parallelism built in
Pipeline snippet generator to help you build pipelines
Nice visualization of your Pipelines right in the UI

But wait, there’s more

You also get Travis CI style CI via a Jenkinsfile, and Multibranch pipelines which enable pipeline configuration for different branches of a repo. From the Jenkins Pipeline tutorial: “In a multibranch pipeline configuration, Jenkins automatically discovers, manages, and executes jobs for multiple source repositories and branches.”

In addition, Jenkins Blue Ocean is shaping up to have beautiful visualizations for Pipeline jobs

Using Pipelines right now, today

Let’s say you still really like loosely coupled jobs that can run independently or together as part of a Pipeline (caveat: I have encouraged this approach for years, and it’s why I’ve long used Build Flow plugin over Build/Delivery Pipeline plugin). Right now, today, you can replace your Build Flow jobs with Pipelines.

In fact, you should do this. From the Build Flow wiki page:

A simple Pipeline script to build other jobs looks like this:

Overall, pretty similar to BuildFlow. And don’t worry, you get all the parallelism, retry, etc that you’re used to.

I am tremendously grateful for the people who’ve built and maintained the Build Flow plugin over the years. For me, it’s been a cornerstone of continuous delivery, enabling independent, reusable, loosely coupled jobs. Build Flow developers: Thank you!

But, it’s time to move on, Pipeline will replace Build Flow.

Do Pipelines replace job-dsl?

Now, to the final question: should Pipelines replace job-dsl?

I believe that’s the wrong question.

job-dsl will be complementary to Pipelines. Even if I were to stop using Freestyle jobs entirely, and build nothing but Pipelines, I’d still use job-dsl to create those jobs. In fact, if you go back to my initial post where I described the problems we were trying to solve when we adopted job-dsl, none of them are solved by Pipelines. In that respect, Pipeline is just another type of job.

A friggin’ awesome type of job, no doubt. I am incredibly excited about Pipelines and look forward to using them more. And here’s how I’ll be building those jobs, as job-dsl has full support for Pipelines:

So what is the right question?

If asking whether Pipelines replace job-dsl is the wrong question, what’s the right question?

I believe it’s:

When should Pipeline replace Freestyle jobs?
When should Pipeline — via Jenkinsfile — replace creating jobs directly in Jenkins (via GUI or job-dsl)?

I’m going to mostly cop out of answering those questions right now, as, for me, the answers are still evolving as I work more with Pipelines.

My initial gut reactions are:

replace Build Flows, as mentioned above
replace Freestyle jobs when there’s no value in running that set of jobs independently
replace Freestyle jobs when you’d benefit from what Multibranch provides
replace Jenkins-built jobs with Jenkinsfile when you have a TravisCI-style workflow that you want to use in Jenkins instead and you’ve seriously considered the safety and security implications for your organization (my thoughts are in the very earliest stages here)

Next up: encouraging adoption of Jenkins-as-code among teams

In the final planned post in this Jenkins-as-code series, I will address how we encouraged adoption of this approach amongst our development teams. I’ll cover where we succeeded, where we stumbled, and the work I think we still have to do.