The Other Interface

It ain't pretty, but ignore it at your peril

One of the most succinct definitions of a technical architect is: a technologist who is responsible for a system meeting its Non-Functional Requirements.

What are often perceived as the most interesting NFRs relate to performance, stability and availability. However, recently I've been paying a lot of attention to perhaps the least glamorous of all the non-functionals: supportability. In a mature system, the lion's share of the time it takes to fix a fault is taken up by diagnosing where the fault lies. Once you've diagnosed it, fixing it is often trivial (testing the fix less so, but that's a discussion for another day).

So how do you decrease this diagnosis time? It boils down to logging and monitoring. There are some excellent monitoring tools available, and I've seen some good home-grown applications, which provide a very informative real-time view of what's going on under the hood of a process (for Java systems, JMX greatly facilitates rolling your own, although you get a lot out of the box with Sun's Java distribution these days). Historical concerns about monitoring tools slowing processes down have all but disappeared: such tools are used on the most latency-sensitive of trading systems. While it's relatively easy to recognise a good monitoring tool, a good approach to logging is less self-evident.

I've encountered dramatically different views on application logging: ranging from the view that the log of a healthy long-running process should be short and readable, no bigger than one screen from top to bottom, to the view that a log file should be exhaustive, often gigabytes in size, and carefully designed post-processing scripts (yes, not just grep) can be used to build a picture of what was going on in the process at a given point in time, or in response to a given event.

The best approach will depend on the nature of your system and how it is supported. I'm currently working on a system supported by several different teams; the development team forms a third or even fourth level of support. Therefore what the system dumps out in its logs feeds into human processes: messages logged at Error level should require manual intervention and possibly escalation to the next level of support, whereas warnings and below should be ignorable.

Everyone who can change the code needs to be aware of this, therefore a logging policy needs to be defined, published and enforced. Ideally this policy will make your system as close to self-diagnosing as possible. When this has not been the case, the black art of knowing which errors can be ignored, or where to look if a process fails with no log information at all, can hugely increase the support costs of the system. It affects the speed of resolution of support incidents, increases the learning curve of new joiners in the team, makes testing more difficult, and reduces software quality by hiding or delaying the discovery of bugs.

If there is one approach which is relevant to all logging policies, it is don't cry wolf, and don't die quietly. To put it another way:

  • Messages logged as Errors / Severe / Fatal should actually be problems with the system, and should not be ignorable.
  • When the system fails, if there is scope in the code to log the current state, this should be done whenever possible.

This may sound obvious, but I'm finding out, to my expense, that applying even such a simple logging policy to a mature system after the fact can be very costly.

There's perhaps no right answer as to what makes an ideal application log, however there are many wrong answers. The worst of all is to ignore this interface to your system. So define your logging standard at the same time as you define your other non-functional requirements, and similarly enforce it as the system evolves.

About the author

Kenneth RoperKenneth Roper is a development team leader at tier-1 investment bank. He is interested in applications with low-latency requirements or large memory footprints. He spends a lot of time reading garbage collection logs and snow reports.

E-mail : kenneth.roper at codingthearchitecture.com


Re: The Other Interface

I had a similar article brewing now that I'm in the frame for third-line support and your two points at the end really resonate: don't die quietly and don't cry wolf.

I'll put my vote in for logging everything as you always wish you could go back in time and crank up the logging when an unexpected circumstance arises. Certainly don't log it all at the same level and perhaps not even in the same place. I've favoured separate log files in the past, giving both the low noise and the kitchen sink editions you describe.

If at all possible I'd suggest that if you can grep it or otherwise divine the situation from the log then you should consider making that logic part of your logging or management interface. In other words, don't leave that black art of interpreting the log output to individuals, don't just try to capture it in your wiki, but treat supportability as a proper requirement and embed it in the application.

Re: The Other Interface

We've just been through a similar process on the .NET project I'm involved in. It's a distributed application (ASP.NET web-tier calling through to a set of middle-tier services via WCF) and everything is logged in the Windows Event Log. It's a bit weird moving from the Java world where everything is dumped out via log4j/Commons Logging to working on a system where there are no log files at all, but it kind of makes sense within the context of the environment. There are a couple of things worth mentioning about the current logging policy...
  1. Only errors are logged to the event log and, therefore, we have no informational messages at all. I'm not sure how I feel about this because it feels like I'm looking at the system with one eye closed. The reasons for this? Partially concerns over logging too much information on boxes that could be compromised by hackers and partially because nobody wants to fill up the event log with stuff that isn't deemed as important. I'd like to say that there's a good management interface, but I've not found it yet. In essence, once the system is running, it's off and running on it's own.
  2. Whenever we do log something, we always make sure sensitive information is masked (starred) out. For example, this could be bank account numbers and passwords. As an aside, I've seen a couple of large e-commerce systems that include customer names+addresses+credit card numbers in the log files. Just imagine what would happen if these fell into the wrong hands!

Add a comment Send a TrackBack