Author: Paul Spiller
As our next-generation digital banking platform takes shape we’re rethinking our assumptions about tracing and monitoring.
A different kind of software
In 2016 our CTO, Clayton Locke, set the ball rolling on the development of The Engagement Platform, a next-generation financial services platform based on a microservices architecture. The software is accompanied by a new vison for how we produce user interfaces, called UX Evolution, and a bespoke DevOps deployment capability called Continuously Available Deployment.
With so many things changing we’ve had to look again at how we get some important but unsung jobs done. Top of the list? Monitoring.
A different kind of problem
The decentralisation of microservices delivers flexibility, scalability and resilience. It’s a natural fit for our Agile development approach and our federated Client Delivery teams, and it makes deployments smaller, faster, more frequent and less risky.
There’s no better way to architect the Engagement Platform but that doesn’t mean it’s all plain sailing. One of the trickier problems we’ve faced has been figuring out the right way to keep track of what all that decentralised software is doing.
The problems of software monitoring magnify as you fragment a system – and a microservices architecture isn’t just fragmented, it’s fragmented into pieces that are designed to care as little about each other as possible.
In a monolithic architecture the pathways through the system all occur inside a small and tightly controlled collection of software. Crudely, if that software is running then the system is up and if it’s not then the system is down. There are subtleties of course, “running” doesn’t necessarily imply “running well” or “doing what it’s supposed to do”, but it’s still the most straightforward scenario for software monitoring.
In our microservices architecture the pathways through the system can touch tens (perhaps hundreds one day) of independent and loosely coupled components. The collection of components involved will also change depending on what the software is being asked to do. It’s easier for complex, fragmented systems like this to harbour problems and it’s harder to figure out exactly what’s gone wrong when they happen.
Components can fail because they have a fault themselves, or because of a fault cascading down from an upstream component or dependency. In an environment like that what should you watch, and how? Or, to put it bluntly, if something’s broken how will you know what to restart?
Of course, the right answer depends on what you want to know and why.
A different kind of monitoring
Our thinking about what to monitor in the Engagement Platform has coalesced around a number of dualities.
Our monitoring will have two audiences: we want to stay on top of what our software is doing, but we also want to create dashboards for customers so that they can also see what they want and need to know about what their systems are doing.
We want to get two very different kinds of information from our monitoring too: technical metrics and business metrics. Technical metrics cover things like bandwidth or latency that tell sysadmins about a system’s performance and the stresses being applied to it. Business metrics describe the higher-level functionality and software features such as the number of logins that have occurred over a given period.
Metrics like login numbers give us a view on the software’s business impact, but they’re an important health check too. Just because a system is running and its components are busy, that doesn’t mean it’s OK (like a livelock in a multithreaded system where everything’s working but nothing’s getting done). It’s why we added positive and negative monitoring to our list. I want to know what isn’t happening – if, all of a sudden, nobody’s logging in, I want to know.
Perhaps the most important outcome of our monitoring rethink, though, was the realisation that what we started out thinking of as ‘monitoring’ turned into two very different things: ‘monitoring’ and ‘tracing’.
Monitoring is a top-down view of a system, or aspects of it, over time. You’re trying to look at the woods and not the trees.
Tracing is the opposite, it’s observing how a single request passes through the system. When you’re trying to find out what went wrong in a system, where perhaps 15 separate components are working together to get something done, tracing is invaluable.
A lot of good work has already been done in the field of tracing. Companies that rely heavily on microservices, such as Uber and AirBnB, are lining up behind Twitter’s open source Zipkin tracer, so we’re looking at that with a great deal of interest.
A different kind of storage
In rethinking our approach to monitoring we’ve identified many different types of data we want to access – and that, in turn, has caused us to rethink our approach to storage.
“When relational databases are used inappropriately, they exert a significant drag on application development.”
– Martin Fowler, ThoughtWorks
CAP theorem has it that a distributed computer system can provide no more than two out of three from Consistency, Availability and Partition Tolerance. What sort of database you need for any given data depends on things like its characteristics, how often you want to read it or write to it, performance, scale, cost and business criticality.
We’re taking that idea, known as Polyglot Persistence, and applying it to the different types of data we capture for monitoring and tracing. Tracing demands detailed, time series data. Monitoring can generate huge amounts of data but it doesn’t all need to be kept, it can be safely aggregated and compressed.
Changing the way we think about capturing and storing metrics is just one of things my team is working on, but it’s a great example of how our decision to adopt a microservices architecture is driving innovation and renewal up and down the Intelligent Environments software stack.