What is Significant?
Don't worry, I'm not about to get philosophical on you
Recently I wrote a quick blog about taking metrics for optimisation. I suggested they should only be included if the improvement was significant, but how do you define significant?
You may think that 'significant' is just a matter of opinion but it actually has a very specific meaning in statistics - Wikipedia 's Description. You can have a read through the maths but it basically comes down to "a result is called statistically significant if it is unlikely to have occurred by chance".
This is really important and something that performance testers and optimisers often forget. For example...
Imagine that you perform some kind of performance test on your system or code. This could be anything e.g. latency response timings, throughput per time unit etc but we'll assume for this that it's units processed in 10 minutes. The figure you get is 20. You spend a day modifying a piece of code you think will affect the performance, retest and get 22. A 10% improvement - pretty good.
You hand the new code over to a colleague who also does a test. She says that it's worse by 5%. Slander! You take it to your boss who says there is no difference...
What we've done is perform three tests on the old and new system. Lets list them and perform seven others as well:
Old: 20 20 22 19 19 21 19 19 19 22 New: 22 19 22 20 20 19 19 19 21 19
Now it's obvious what happened (although it probably was before). Your test does not produce constant figures even without changes. Both have a range of 3 (19-22), an average of 20 and a variance of 1.55
If you had performed ten runs on the original code first you would have realised that a single result of 22 for the new code is not significant as it's within the range of the previous figures.
Performing the test multiple times on the new code would increase your confidence that it's a significant change. You can test this statistically (but the maths is beyond the scope of this blog entry).
Just to leave you with a challenge, the project I'm currently working has a task that we wish to optimise but it takes ten hours to run even on a grid of several hundreds machines. How do we run realistic, pre-production tests that we know are statistically significant?
Re: What is Significant?
Re: What is Significant?
The development environment is probably 10% of the total power of the production one, so trying to simulate an eight hour production run would take a working week. Getting 20 runs to average across would be...
We slice and dice the data so we get something representative we can run on the smaller system. I suppose the animation equivalent would be to render every tenth frame.
Re: What is Significant?
First figure out what your bottle neck is before you start tuning or experimenting. You can do this by using a profiler, reading systems statistics, squeezing network bandwidth etc… The bottleneck resource is the one that is continually at or near maximum usage. Once you have done this design an experiment/test case that you can use to evaluate a possible solution/improvement.
To ensure your experiment is valid one must ensure that the measurements are representative. The first step is to ensure that your test data set is representative. For example if you have dataset with 10,000 small problems and 100 large problems a sample of your production dataset should contain for testing purposes about 100 small problems for each large one.
Secondly as the size of the dataset is a significant factor in the performance of an IT system due to the big O efficiency of algorithms using a number of datasets of significantly differing sizes is recommended. i.e. a small dataset that a developer can run on his machine in an hour, one which requires the whole night and then one which requires a week of time on one machine and then one which takes two machines a day and one which takes four machines a day.
Each test case should run a number of times on each dataset so that you can determine systematic errors/variance/standard deviation etc… Then one has a picture of which possible solutions actually are likely to improve the situation.
Re: What is Significant?
Re: What is Significant?
Running a cut-down version of the test is an option. Statistical significance is usually based on very cut-down samples backed by the law of large numbers or the central limit theorem. Alternatively a "sufficiently" accurate simulation could be created based on knowledge of the workings of the device under test.
The hard part is the creation of such a test which is statistically representative, particularly one which isn't simply a reflection of your preconceptions of where a bottleneck is likely to be!
A combination of real-world measurements and a simulation that possesses the same characteristics may well be good enough to aid the development of an optimisation which can be verified in a production-like environment without burning weeks of trials.
Sounds like a good example of software engineering as opposed to computer science!
Robert works for an investment bank and has spent many years creating and maintaining trading systems. He knows far too much about low latency data systems and garbage collection than is good for anyone. If you find yourself in a pub with him then do NOT mention phrases like "mark and sweep" or "memory profiling" as you'll be stuck there for hours and, might, be forced to break an ashtray over your own head to get away from him.


