Most of the systems I've worked on in the recent past have been latency rather than throughput orientated. However my current project is definitely throughput focused and scales horizontally rather than vertically (this is a simplification but basically correct).
This has lead to me making a few errors based on my incorrect assumptions. As you may have read I've been retrofitting performance metrics, we're trying to discover the level of overhead compared to 'real' work, so I've been running a single node on my machine to determine the outputs required for statistics. I wanted to write the results to a database and the DBA asked me to produce an estimate of the load in production before he would create my table structures. Cursing the formality of DBAs, I made some calculations based on the number of tasks and the metrics for each. The number I came up with was huge and the DBA almost coughed up his breakfast.
I was surprised by the number because I had failed to realise that my local test was NOT representative of the real system. If the application was vertically scaled then the quantity of metrics in production would be (roughly) the same as on my desktop PC but this application runs across hundreds of grid machines and each one would output these metrics.
I had just designed a distributed denial of service attack on our own database server. The solution was quite simple - I got the machines that aggregate the outputs to also aggregate the performance metrics. This reduced the load by a couple of orders of magnitude.
I think this is a good example of what happens when you fail to take into account the architecture of a system when writing code. All developers need to be aware of the architecure and how it effects the code they are writing. Does anyone else have similar experiences?