The devops approach was meant to give us agile development AND more manageable applications – reducing cost of operations and making apps more reliable. Being close to ops helps, but it is not the full story. It works well for individual applications, but can miss the complex interactions between applications, and between applications and the users or consumers who operate them. The Sprint approach that grew up within Google is good as a starting point, but a more continuous relationship needs to grow out of it.
When I was a system programmer, I worked on a project setting up a HL7 (1) communication network for a group of hospitals. In early configurations, different applications processed HL7 transactions at different rates, so you needed a queuing mechanism to allow backlogs to be solved without losing data. So far, so good. Except, each laboratory application and queuing application was developed separately and often came from different vendors. Even within the same application, each component might have its own commands to collect information on transactions. Developers gave no thought to the operational need for measuring and managing the overall queues once in production. (Developers over here, operations over there).
This hot mess went live.
The phone calls began. Hospitals complaining about laboratory results not getting back to the wards in time. It turned out that some format errors caused transactions to get stuck in the queues (the application code definitely under-delivered in dealing with variances in field content). Operations needed a way to see what was going on between the applications.
An in-house developer created a complex PERL script triggered the various commands, cleaned up the results, and reported on the queues. The first attempt revealed all the details of the queues. Yippee, development solved the problem.
DevOps will only solve part of the problem.
Now operations had ALL the details, but without any meaningful way to interpret and act on it. Operations did not know what to identify as an impending problem. How long a delay was acceptable in a hospital? (Developers over here, operations over there, customers further out.)
Finally, the developers met with some nurses (customers!) to understand the ‘time’ requirement. The goal was that a transaction from labs to bedside should always be faster than the time for a human (generally a nurse) to physically run there.
A rewrite of a script highlighted queues approaching thresholds. Now meaningful thresholds could be set and recommended actions be added in. The stuck transactions were flagged for development to sort out. If queues got too long, the relevant hospital could be informed to switch over to manual delivery of results to wards. The angry phone calls stopped.
- A little background. HL7 is a standard for transfer of clinical information between various healthcare applications.