One of the most succinct definitions of a technical architect is: a technologist who is responsible for a system meeting its Non-Functional Requirements.
What are often perceived as the most interesting NFRs relate to performance, stability and availability. However, recently I've been paying a lot of attention to perhaps the least glamorous of all the non-functionals: supportability. In a mature system, the lion's share of the time it takes to fix a fault is taken up by diagnosing where the fault lies. Once you've diagnosed it, fixing it is often trivial (testing the fix less so, but that's a discussion for another day).
So how do you decrease this diagnosis time? It boils down to logging and monitoring. There are some excellent monitoring tools available, and I've seen some good home-grown applications, which provide a very informative real-time view of what's going on under the hood of a process (for Java systems, JMX greatly facilitates rolling your own, although you get a lot out of the box with Sun's Java distribution these days). Historical concerns about monitoring tools slowing processes down have all but disappeared: such tools are used on the most latency-sensitive of trading systems. While it's relatively easy to recognise a good monitoring tool, a good approach to logging is less self-evident.
I've encountered dramatically different views on application logging: ranging from the view that the log of a healthy long-running process should be short and readable, no bigger than one screen from top to bottom, to the view that a log file should be exhaustive, often gigabytes in size, and carefully designed post-processing scripts (yes, not just grep) can be used to build a picture of what was going on in the process at a given point in time, or in response to a given event.
The best approach will depend on the nature of your system and how it is supported. I'm currently working on a system supported by several different teams; the development team forms a third or even fourth level of support. Therefore what the system dumps out in its logs feeds into human processes: messages logged at Error level should require manual intervention and possibly escalation to the next level of support, whereas warnings and below should be ignorable.
Everyone who can change the code needs to be aware of this, therefore a logging policy needs to be defined, published and enforced. Ideally this policy will make your system as close to self-diagnosing as possible. When this has not been the case, the black art of knowing which errors can be ignored, or where to look if a process fails with no log information at all, can hugely increase the support costs of the system. It affects the speed of resolution of support incidents, increases the learning curve of new joiners in the team, makes testing more difficult, and reduces software quality by hiding or delaying the discovery of bugs.
If there is one approach which is relevant to all logging policies, it is don't cry wolf, and don't die quietly. To put it another way:
This may sound obvious, but I'm finding out, to my expense, that applying even such a simple logging policy to a mature system after the fact can be very costly.
There's perhaps no right answer as to what makes an ideal application log, however there are many wrong answers. The worst of all is to ignore this interface to your system. So define your logging standard at the same time as you define your other non-functional requirements, and similarly enforce it as the system evolves.