When I started work this morning, I expected a normal manager day: emails, meetings, shepherding some proposed infrastructure changes through our change management process.
That was not to be. What followed instead was most of the day on the edge of my limited Linux troubleshooting abilities, trying to diagnose performance degradation on our production Jenkins server.
Around 10:30 AM, Andy messaged me:
“Anecdotal Jenkins slowness. Something that regularly takes 3 minutes on my machine takes 18 minutes on Jenkins”
This is a story about troubleshooting.
Prologue: The environment
This story’s main characters are RHEL, Jenkins, New Relic, job-dsl-plugin, and the anti-virus software we run on our linux servers.
It will become obvious within about 30 seconds, but I will confess without shame up front that I am no Brendan Gregg.
Chapter 1: tmux, top, iostat
At the very start of this story, my mindset is: Gather Facts. I don’t know whether I can solve this problem or if I even have the skills (probably not), but before I pull in our sysadmins, I need to do be able to satisfactorily articulate the problem and tell them all the things I’ve done to figure out what’s what. So:
If I can’t figure out a performance problem with a mix of New Relic, easily accessible logs, or intuition, I get on the server and get tmux going and pull up a cheat sheet. Once I re-learn how to split windows, I get a few windows going with top
and iostat -dmx 5
.
I’m trying to whittle down the list of culprits. I’m asking: Are there obvious processes consuming all CPU or RAM? Is I/O through the roof and we have a saturation problem? Basically: what’s consuming resources?
My interpretation of the data I was seeing in New Relic was that in this case, disk and network were the bottlenecks. But after talking more with Andy about what his code was doing, I was able to rule out network.
iostat -dmx 5
didn’t show any excessive waiting, and top confirmed what I saw in New Relic that RAM wasn’t an issue.
Andy’s django-admin
command was consuming @80% CPU. He told me enough about his code that I learned it was writing thousands of files to disk, so I watch -n 10 'ls | wc -l'
on the directory in question to get a sense of how many files it was writing per second, which was between say 300 and 500. New Relic was showing about 100% CPU utilization.
The other 20% of CPU were 2 antivirus scanning processes. That’s interesting. I went back to New Relic to see what it thought about the running processes, and it was showing those AV processes at around 30% total. Because this is an average over time, that suggested to me that those processes periodically spike much higher than what I was seeing in the moment.
So I relaxed my eyeballs and just let them sit on that tmux split window, switching back and forth between top, iostat, and watch. Then I saw it.
Antivirus spiked to over 100% CPU. iostat writes per second dropped, and the number I was seeing in watch showed a noticeable drop in file writes per second, confirming iostat.
And, also: git. There were multiple git processes now running, and taking a suspiciously long time to complete. I say suspicious because while most of our Jenkins jobs pull from one git repo or another, those git commands should be completing very quickly b/c they’re just updates to previously cloned repositories.
Why would I be seeing long-running git commands and corresponding significant spikes in antivirus process CPU?
Chapter 2: Another long-running Jenkins job
I needed to find out what was running those git processes at that time, so I went into the Jenkins UI and sorted jobs by last success. I found a job whose time aligned with what I was just seeing and looked at its build history. About 2 minutes. That seemed wrong because I know that job well and it should only take a few seconds.
I ran it and watched the jenkins console… and I watched it just sit and spin at the git checkout step. This is for a repository that rarely changes. WTH?
Oh: top is going nuts while this is happening. AV CPU spiking.
I then ran the git command in a shell, just to confirm:
$ time /usr/local/bin/git fetch --tags --progress https://github.com/foo/bar.git +refs/heads/*:refs/remotes/origin/* --depth=1
remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
real 0m21.462s
user 0m0.908s
sys 0m2.193s
21 seconds for a git fetch? That’s nuts.
To get another data point, I went onto another Jenkins server in our dev environment and ran the same code. It ran between 2 and 7 seconds. Still way too long, but nothing like I was seeing on this production Jenkins server.
Chapter 3: tailing AV logs
OK, so AV is clearly emerging as the culprit, and I needed to see into it. I mentioned that I’m no Brendan Gregg; I am not the linux process whisperer (though I aspire to that). I’m an old-school “the answer my friend, is always in the logs, the answer is always in the logs” kinda person, so I needed to see AV logs.
I asked a sysadmin how to see what AV was scanning, and he told me the command to run to configure AV to enable slow-scan logging.
This ended up being the key to this whole affair.
I turned on slow-scan logging, kicked off Andy’s job, and started watching those logs.
And, perhaps not surprisingly, nothing jumped out at me. Sure, it was hitting the files that Andy’s job was creating. No surprise there.
I kept watching. At this point, having never seen these logs before, I don’t have any gut or intuition to guide me, so I fall back on old faithful: just look for stuff that doesn’t look like the other stuff you saw a few minutes ago.
Huh. A lot of git objects start showing up, from other Jenkins jobs running and doing their normal fetch thing. I’m not sure how to interpret that because I don’t know if that’s normal or not from AV’s perspective.
Chapter 4: Email!
At this point, I’ve spent a few hours on this and think I have enough information to send to our Linux sysadmins to see if they have any ideas. So I start writing an email.
I recount the condensed version of what I’ve typed above, along with timings from commands and snippets from those AV logs. They look a bit like this:
1498589352.642894 0.016822 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/bb/f755b6e2c4b64d9397144667504c6da2ce8b17
1498589352.720311 0.023426 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/74/f6f21d05c429539a299f3c59d1ea95ed30472b
1498589352.745168 0.022793 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-33140
1498589353.490658 0.014124 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37352
1498589353.964574 0.015037 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37691
1498589354.185565 0.017393 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/69/a840cd5cf06a36f3fff17948e6f9db4ccb9903
1498589354.608213 0.024442 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/objects/24/f44b5ea3b338569de587e4e10d014fe0bb3afa
And then, before I hit send, something that I’ve put in that email jumps out at me. I hadn’t even noticed it when I was reading logs, but being forced to condense my investigation and see it all tightly, for whatever reason, leads my eyes to:
1498589352.745168 0.022793 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-33140
1498589353.490658 0.014124 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37352
1498589353.964574 0.015037 /var/lib/jenkins/workspace/jenkins-job-name-here/.git/refs/tags/jenkins-jenkins-job-name-here-37691
Wait: what is that /refs/tags stuff? I know this git repo well. It’s the one above that was taking 21 seconds to fetch, and its Jenkins job taking 2 minutes, when it should only be taking seconds. There are no tags in that repo. Why is AV spending time looking at git tags that I don’t even know exist, and does that correlate to the suspicious slowness?
Chapter 5: OMG
I went to that job’s workspace and ran git tag | wc -l
and there it was: almost 30,000 tags.
I then went to the Jenkins job configuration screen and saw it: the job was configured to create a per-build git tag. This job runs every 15 minutes. It had accrued almost a year’s worth of per-build tags.
And AV was presumably scanning them all, every build.
I wiped the workspace and re-ran that job.
Under 3 seconds. AV didn’t even have time to show up in top
Good grief.
Chapter 6: Damage Assessment
My next step was to learn how many Jenkins jobs suffered from a similar condition: configured with per-build tags, and accruing a lot of tags. I whipped up a quick-n-dirty shell script to give me a gut check.
cd $JENKINS_HOME/workspace
for i in * ; do
if [ -d "$i" ]; then
if [ -d "$i/.git" ]; then
echo "$i"
/usr/local/bin/git --git-dir $i/.git tag | wc -l
fi
fi
done > tag_counts.txt
cat tag_counts.txt
I see a hundred or so workspaces with a non-zero number of tags, and several with a pretty high number. The most important ones are the ones that run frequently and have a high number of per-build tags created, because every time one of those jobs run, as I just learned, it’s going to spike AV.
Chapter 7: Back to Andy’s job
While attempting to troubleshoot Andy’s job’s performance problem, we stumbled into another one, namely, that frequently run Jenkins jobs configured with a per-build git tag will eventually create so many tags that they will cause AV to spike significantly for a non-trivial amount of time.
And when a job causes AV spikes, it has the side effect of slowing down other jobs.
We still haven’t gotten to the bottom of why Andy’s job is running a lot slower in Jenkins (womp womp womp, I know, what a disappointment!), but we did find a culprit. We at least learned that when we observe it running sporadically longer than the previous run, it’s because other jobs are running that have the condition described above that cause AV to spike, which ultimately results in degrading his job’s disk write performance.
Chapter 8: job-dsl-plugin
You might be wondering: how did you end up creating so many jenkins jobs with per-build tags and not even know it? You’re probably also wondering: why don’t you frequently wipe your workspaces? I’ll answer the first question. The second one I’m going to investigate in due time.
As readers will know, I have written a lot on this blog about job-dsl-plugin. It is amazing and transformed my entire team’s use of Jenkins. Except probably for occasional quickies or experiments, we don’t use the Jenkins UI to create jobs. We use job-dsl.
Some history: Quite a while ago, the default Jenkins git plugin behavior when creating jobs from the user interface was to automatically add the per-build tag. When you created a job and added a git SCM, you had to explicitly remove that behavior. Well, way back then, job-dsl-plugin replicated that behavior in its simplest dsl, such that if you used the simplest dsl, you got the default behavior from the UI:
scm {
git("https://github.com/foo/bar.git", "master")
}
would result in having the per-build tag added automatically. (I remember the default UI behavior quite well; I learned about the job-dsl behavior today from a google group posting)
Quite frankly: we did not know this. If you had asked me if our job-dsl jobs had per-build tagging, I’d have looked at our source code and said “of course not; it’s not explicitly configured to do so”.
I learned today that job-dsl-plugin added clear documentation around this last year, but we started using job-dsl before that and apparently just missed that this was happening. Whoops! Our bad. I can understand job-dsl’s reasoning for retaining previous git plugin behavior here, though I do wish they had changed the behavior such that it matched git plugin’s behavior as of 2.0. Right now, the situation is that creating a git checkout with job-dsl using the simple DSL does not match the simple UI behavior. That’s unfortunate.
How will we mitigate these two problems?
Let’s reverse this and talk about the job-dsl problem first and the fact that a fair number of workspaces have a lot of per-build tags second.
First, job-dsl creating per-build tags: in our job-dsl jobs, we use that simple SCM dsl all over the place. So it’s going to take us a bit to fix, but most likely we’ll:
- update our jenkins-automation builders to make it as easy as possible to get the scm/git behavior we want without much trouble.
- Once we have that repo updated we’ll set about changing all our job-dsl jobs to use whatever solution we devise. Note: Imagine having to do this manually in the Jenkins UI across half a dozen jenkins servers! Screw that noise. Again, I am so glad we use job-dsl. We’ll be able to change all our job-dsl jobs in under an hour.
Second, to resolve all the existing per-build tags, that’s simply a matter of wiping out our workspaces. This is done easily via scriptler or a jenkins job using a groovy build step:
for(item in jenkins.model.Jenkins.instance.items) {
if(item.class.canonicalName != 'com.cloudbees.hudson.plugins.folder.Folder') {
println("Looking at job job "+item.name)
println("Wiping out workspace of job "+item.name)
item.doDoWipeOutWorkspace()
}
Credit to Vincent Dupain for that script.
Epilogue
My expected manager day turned out to be really different and ultimately fun. I regrettably cancelled one meeting with a colleague, so I’ll need to make up for that. I learned some things and got to play detective for a few hours. I still haven’t solved Andy’s job problem but I got one step closer.
If you want another troubleshooting mystery, here’s one on how long-forgotten settings changes on one computer in the network broke Chocolatey and Powershell on another computer in the network.
I love server-sleuth stories about troubleshooting, and I’d love to hear yours!