Broken windows in a building install a sense of abandonment. If they are not _continuously_ fixed, this perception, and number of broken windows increases.
I spent a fair bit of last week planning for, and doing the first software upgrade using the very useful Chef solution from Opscode. A Chef server applies run-lists (or roles) to one or more Chef clients (the client is installed on the Windows boxes). The client connects (at specified intervals) to the Chef server, checks for any updates in the run-list (applied to that client), and runs it.
A run-list is made up of “recipes”, which are written in Ruby. A recipe may install some software, query something, parse a file. The options “really” are endless.
I was going to upgrade two applications across 200 Windows boxes. But I decided to do 100, and see how things went.
One part of the upgrade was un-installing a Java application, which was wrapped up in a Janel Windows service wrapper. There was a recipe which did this.
Now I know from experience that (even with the correct event handlers), Janel does not react “swiftly to a Windows service stop request”. I have seen a mixture of slow responses, with the very occasional “this process is not responding” message.
Two roles were going to be applied, an un-install, then a re-install.
I was expecting 5% of the Windows boxes, under Chef control to not complete the un-install recipe (from my experience). The reality was a 30% failure. These servers needed manual intervention (which was time consuming).
Subsequently, the Chef recipe has been modified to kill the process, if _it doesn’t response to the Windows service _stop command.
But, if I had applied the un-install role to all 200 nodes, we would have lost our “out-of-hours” release slot.
What has this got to do with _broken windows? _Well, the Janel Windows service was something which was applied at the last-minute, by developers close to a shipping deadline.
If Janel is a broken window, it should be”fixed” before the first upgrade happens. The broken window needs to be fixed when it first appears.
Fixing the broken windows relies on developers understanding the “complete system (not just the code)”, the devOps team (if you have one). understanding what’s being deployed (by Chef), and people talking to each other :0)