Software Architecture for Developers

What is Significant?

Don't worry, I'm not about to get philosophical on you

Recently I wrote a quick blog about taking metrics for optimisation. I suggested they should only be included if the improvement was significant, but how do you define significant?

You may think that 'significant' is just a matter of opinion but it actually has a very specific meaning in statistics - Wikipedia 's Description. You can have a read through the maths but it basically comes down to "a result is called statistically significant if it is unlikely to have occurred by chance".

This is really important and something that performance testers and optimisers often forget. For example...

Imagine that you perform some kind of performance test on your system or code. This could be anything e.g. latency response timings, throughput per time unit etc but we'll assume for this that it's units processed in 10 minutes. The figure you get is 20. You spend a day modifying a piece of code you think will affect the performance, retest and get 22. A 10% improvement - pretty good.

You hand the new code over to a colleague who also does a test. She says that it's worse by 5%. Slander! You take it to your boss who says there is no difference...

What we've done is perform three tests on the old and new system. Lets list them and perform seven others as well:

Old: 20 20 22 19 19 21 19 19 19 22
New: 22	19 22 20 20 19 19 19 21 19

Now it's obvious what happened (although it probably was before). Your test does not produce constant figures even without changes. Both have a range of 3 (19-22), an average of 20 and a variance of 1.55

If you had performed ten runs on the original code first you would have realised that a single result of 22 for the new code is not significant as it's within the range of the previous figures.

Performing the test multiple times on the new code would increase your confidence that it's a significant change. You can test this statistically (but the maths is beyond the scope of this blog entry).

Just to leave you with a challenge, the project I'm currently working has a task that we wish to optimise but it takes ten hours to run even on a grid of several hundreds machines. How do we run realistic, pre-production tests that we know are statistically significant?



Re: What is Significant?

Interesting problem when you have a task that takes so long to run. I am assuming that you don't have a full dev environment at the same specs as production? I think your first task is to get some crude measurements to determine what your main issues are. Are your CPUs running at >70% all the time? If they are you're CPU bound so tuning your code is likely to help What's your context switching like? Are you spending all your time dealing with cache misses? Look at the amount of I/O activity especially if your CPU usage is low. Are you using a DB, most developers write awful SQL and/or don't look afte r the DB, there is a fair chance this is a big source of your problems If it appears you're CPU bound optimising individual parts of the task will improve the performance of the whole. If you're heavily context switching or I/O bound then it's time to refactor the architecture. I've never worked with a grid but the same principles apply, it's just all a little more complicated. I'd be interested to know more about what you're doing but I guess you can't give too much away

Re: What is Significant?

All good questions Dave! The actual type of work each node does is not really important to the question (you could apply the same problems to a render farm for an animation studio).

The development environment is probably 10% of the total power of the production one, so trying to simulate an eight hour production run would take a working week. Getting 20 runs to average across would be...

We slice and dice the data so we get something representative we can run on the smaller system. I suppose the animation equivalent would be to render every tenth frame.

Re: What is Significant?

The approach I would suggest on reading the limited information is as follows.

First figure out what your bottle neck is before you start tuning or experimenting. You can do this by using a profiler, reading systems statistics, squeezing network bandwidth etc… The bottleneck resource is the one that is continually at or near maximum usage. Once you have done this design an experiment/test case that you can use to evaluate a possible solution/improvement.

To ensure your experiment is valid one must ensure that the measurements are representative. The first step is to ensure that your test data set is representative. For example if you have dataset with 10,000 small problems and 100 large problems a sample of your production dataset should contain for testing purposes about 100 small problems for each large one.

Secondly as the size of the dataset is a significant factor in the performance of an IT system due to the big O efficiency of algorithms using a number of datasets of significantly differing sizes is recommended. i.e. a small dataset that a developer can run on his machine in an hour, one which requires the whole night and then one which requires a week of time on one machine and then one which takes two machines a day and one which takes four machines a day.

Each test case should run a number of times on each dataset so that you can determine systematic errors/variance/standard deviation etc… Then one has a picture of which possible solutions actually are likely to improve the situation.

Re: What is Significant?

Building a matrix of results as you suggest could work quite well - it should enable a 'line of representative/best/worst performance' to be plotted. My only fear is the level of grid configuration I would need to do!

Re: What is Significant?

If you haven't read this article (you probably have though) http://labs.google.com/papers/mapreduce.html it's an interesting read on an alternative architecture to the typical Java app. These kind of problems are going to become much more frequent as architectures move from raw Mhz to massively parallel architectures to improve performance. It's an interesting area that far too few people are experienced in. What is your performance goal? Could you get a sensible metric by breaking the application down into it's component parts and look at each component in turn? Good luck!

Re: What is Significant?

Running a cut-down version of the test is an option. Statistical significance is usually based on very cut-down samples backed by the law of large numbers or the central limit theorem. Alternatively a "sufficiently" accurate simulation could be created based on knowledge of the workings of the device under test.

The hard part is the creation of such a test which is statistically representative, particularly one which isn't simply a reflection of your preconceptions of where a bottleneck is likely to be!

A combination of real-world measurements and a simulation that possesses the same characteristics may well be good enough to aid the development of an optimisation which can be verified in a production-like environment without burning weeks of trials.

Sounds like a good example of software engineering as opposed to computer science!

Re: What is Significant?

to continue the animation example... If you chose the first five minutes of the film to test a new rendering algorithm on, you'd have to e careful in case the more complex objects appeared towards the end of the film!

Add a comment Send a TrackBack