Monday, December 04, 2006

Benchmarking, Lies, and Statistics

Work on my book hasn't been going as well as I'd wanted (it's amazing how procrastination and laziness can throw you off a schedule). Recently, I've been doing a little bit better though. Last week I was working on some material on benchmarking Ruby code, and suddenly I started to channel Zed.

I realized that in my discussion of how to use the Benchmark library, I was completely ignoring all of Zed's great advice about applying statistics to performance testing. I really wanted to get that advice in front of anyone reading my book, so I wrote this little sidebar (below).

When I was done with it, I realized two things. First, I wanted to get the information out to a wider audience sooner, rather than later; and second, I think there's an opportunity for an enterprising tool writer to take the idea (and sample code) below and create a tremendously useful benchmarking tool. I talked with my editor, and he agreed to let me post the sidebar here — that takes care of the first realization. Hopefully the lazyweb will take care of the second.

I don't have a deep background in statistics, so I relied on a lot of advice while writing this (any mistakes are certainly mine though). If you think I'm all wet, I'd love to collect some feedback. Please drop a comment here explaining why you think the concept, method, or math needs work. (Of course, if you want to say something nice about it, I certainly won't mind.)

Statistics and Benchmarking

It's important to note (but beyond the scope of this book to do much more than that) that running a single benchmark of one method against another is not sufficient to determine whether or not you've really made a performance improvement. Zed Shaw has talked about this at length.

A quick way to do this is to run a series of tests for your original and new code and calculate the deltas between the results. Then you can calculate a mean and the standard deviation to see if 0 is in the range of the mean +/- 2 times the standard deviation.

The Benchmark library actually makes it fairly easy to build something like this. Here's a quick-n-dirty example:


require 'benchmark'
iterations = 10
loops = 100_000
deltas = []

def mean(ary)
  ary.inject(0) { |sum, i| sum += i }/ary.length.to_f 
end
def std_dev(ary, mean)
  Math.sqrt( (ary.inject(0) { |dev, i| 
                dev += (i - mean) ** 2}/ary.length.to_f) )
end

iterations.times do
  original_result = Benchmark.realtime do
    1.upto(loops) { |num| num+1 }
  end
  new_result = Benchmark.realtime do
    for num in (1..loops) do num+1 end
  end
  deltas << (original_result - new_result)
end

deviation = std_dev(deltas, mean(deltas))
mean_delta = mean(deltas) 

printf "The deviation in the deltas was %f\n", deviation
printf "The mean delta was %f\n", mean_delta
max = mean_delta + 2.0 * deviation
min = mean_delta - 2.0 * deviation
if min <= 0 || 0 >= max
 puts "There's no statistical difference"
end

With this example, you can see a more accurate picture of the performance differences between the for and upto methods:


The deviation in the deltas was 0.005296
The mean delta was 0.006870
There's no statistical difference

For more information (and better tests) go cuddle up with a good book on statistics because my brain hurts from doing this much.

6 comments:

James H. said...

That article of Zed's along with my course this semester on the theoretical foundations of computer science have really renewed my appreciation of mathematics as it directly relates to computer science. It's sometimes easy to get lost in the mundane activities of programming and forget that we're really just doing various, english-like incantations of mathematics.

gnupate said...

Hmm, I guess I wasn't as clear as I should be. The article was actually mine, I just relied on a lot of things Zed has said (and some advice from him) when I was writing it.

Anonymous said...

I have to ask: why doesn't Benchmark come with this already?

gnupate said...

sapphirecat,
I wish it did. I don't think many people think about the statistical end of it.

I have to admit that I find it pretty cool that Benchmarking is part of the Ruby Standard Library at all though, most languages don't even go that far.

Anonymous said...

module Enumerable
def sum
return self.inject(0) { |acc, i| acc + i }
end

def average
return self.sum / self.length.to_f
end

def sample_variance
avg = self.average
sum = self.inject(0) { |acc, i| acc + (i - avg) ** 2 }
return (1 / self.length.to_f * sum)
end

def standard_deviation
return Math.sqrt(self.sample_variance)
end
end

[1,2,3,4].sum #=> 10


## blog would not allow pre or code tags... don't know how to handle formatting... sorry

Unknown said...

I think you meant:

if min <= 0 || 0 <= max

and not

if min <= 0 || 0 >= max

The latter just checks if either min or max are negative.