Thursday, January 17, 2008

Interview with Sander Land, Developer of charlie

Sander Land recently released charlie a Ruby library for using genetic algorithms. Since the initial release, he’s made several updates, and posted some performance information for charlie on Jruby 1.1 RC1. I was intrigued by the activity, and asked Sander to join me for a quick interview.


Why the name ‘charlie’?

Sander It’s a reference to Charles Darwin. The name was originally the project’s code name, and it just stuck.

Everything I know about genetic algorithms, I learned from wikipedia. I’m guessing a lot of other people are in the same boat. Maybe you could help us all out. What are genetic algorithms?

Sander Wikipedia is not a bad place to start:

“A genetic algorithm (GA) is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover.”

The basic idea is that you have a population of solutions, and some kind of fitness ranking. You select solutions such that those with higher fitness (on average) get more children. This makes the solution converge towards some local or global optimum. Crossover and mutation are used to generate more variation in the solutions, so it can explore new parts of the search space.

There are a lot of ways to do selection, crossover and mutation. The basic idea behind charlie is to make it easy to choose these parts, and to be able to compare several of these strategies to see which works the best.

Ok, so what are they good for?

Sander As wikipedia says “optimization and search problems”. That is, any time you need to optimize something or when you can restate your problem as optimizing something. Examples are finding a shortest cycle through a graph, or finding the airplane wing with the most lift.

A big advantage of GA over other optimization algorithms is that you only need a simple representation of your solution (usually an array of floats or a bitstring) and a fitness function which returns a higher value for better solutions. There’s no need for calculating derivatives or knowing specialized algorithms for each problem.

So, is delta1 a good example of the space that GAs play in?

Sander It’s not a typical example, but it’s probably possible:

  1. Take a bitstring of length (# of lines in original file) as your genotype. Use zero/one at position i to mean “solution does (not) include line i”.
  2. Take 1/(# of ones) as the fitness value, but assign a value of 0 to uninteresting files.
  3. Initialize all your initial solutions to all ones.

The problem is hard to solve with GA because you’re forcing the uninteresting files to have a fitness value of zero. There is no way to tell ‘how uninteresting’ the file is. This makes step (3) a necessity, instead of being able to use the standard random initialization, because otherwise you would most likely start with a population of uninteresting files and no way to determine which are the better solutions. It also has a negative effect on the convergence of the algorithm, because every small mistake, even when combined with an improvement somewhere else, is going to result in an fitness of zero.

To get back to my earlier, more typical, examples:

  • All cycles through a graph will have some length.
  • All wing designs will have some lift force. This could be negative, but even in that case you can tell ‘how wrong’ the solution is.

What needs did you have that made you write charlie?

Sander I had used GAs a few times, using a ~250 line Ruby script which was rather messy (think global variables with a hash of several procs of 5+ lines each). I wanted to make a library which was a lot cleaner and easier to use, but without restricting what genotype/selection/crossover/mutation I could use. I also wanted to learn more about genetic algorithms and genetic programming, and thought writing a library would be a nice way to do this.

Finally, when using a GA to solve a problem there was always the question “which settings are the best?”. Most settings work reasonably well, and it can be hard to determine whether some are actually better, or just ‘lucky’ the single time you tried them. Hence the focus on benchmarks to answer these questions.

What books/websites would you recommend for learning more about genetic algorithms?

Sander This website has some basic info and a lot of links to news articles, tutorials, etc.

A simple google search for “genetic algorithms tutorial” will also give some decent results, but reading a book is probably the best way to learn more about GA.

Some of the examples included with charlie are from “Evolutionary Computation for Modeling and Optimization” by Daniel Ashlock. He has the book on his website.

Is Ruby really a good language to do things like this in?

Sander I think so. Ruby certainly makes it easy.

As for performance issues, the fitness function is often the bottleneck for more complex problems. Because you always need to implement this function yourself, it is easy to increase performance by writing this in a lower level language.

Another issue is parallelization. This is not yet implemented anywhere, but I have something planned for multiple parallelly evolving populations (multi-deme GA). I’m not sure how difficult it will be to do this.

Wikipedia talks about a number of programming methods related to genetic algorithms, do you see charlie growing to encompass any of these, or do we need to look for other libraries (and other developers)?

Sander Some of those programming methods, like Evolution strategies and Evolutionary programming are very similar to GA, and sometimes even nearly indistinguishable, so they are easy to fit in a GA framework. A basic form of Genetic programming is already implemented, as there is a tree-based genotype. I definitely plan to extend charlie to encompass more GP techniques, and to try some selection and crossover techniques more common to ES and EP.

I currently don’t have any plans for extending the library to methods that are not similar to GA.

What have you learned about Ruby while writing charlie?

Sander Not much, actually. I’ve been programming in Ruby for more than two years now, and picked up most of the metaprogramming skills needed for this library from reading blog posts and making ruby quizzes.

However, with this being my first ‘large’ project, I’ve learned how to use rdoc, rake, hoe and other tools.

How did you end up picking up Ruby as a programming language?

Sander I was looking for a new programming language, because using C++ or PHP for everything was inconvenient. I first tried python, which was interesting, but somehow not interesting enough to make me stop searching. A few days later I tried Ruby. With the pickaxe included in the installer it was easy to learn and I liked its consistency. Within a week I joined the ruby-talk mailing list, and made Ruby my language of choice.

You were one of the first people to post benchmarks for JRuby 1.1 RC1. How much work did it take to get charlie running on JRuby?

Sander No work at all, all my tests succeeded on the first try. Again, great job JRuby team.

Does charlie run on Rubinius or 1.9 yet?

Sander Ruby 1.9: Yes, as of version 0.6.0 there is partial support, and the recently released version 0.7.0 has full support. This required a few workarounds, since there is still a rather nasty bug in 1.9 which makes superclass lookups fail in some cases. (note: The bug is reported here.)

Rubinius: Just tried it. Over 50% of the tests fail, so I guess that’s a no. But Rubinius’ performance is probably still too low to seriously consider using it for this kind of application.

What’s the coolest thing that someone’s used charlie for thus far?

Sander I don’t know! Except for you mailing me for the interview, and some people telling me they were going to check the library out, there has been no feedback at all.

What plans do you have for charlie?

Sander I have some additional features planned, like extra crossovers and genotypes for 2D arrays. I also plan to learn more about genetic algorithms and genetic programming, and include more features in the library as I learn about these things.

1Delta assists you in minimizing “interesting” files subject to a test of their interestingness. A common such situation is when attempting to isolate a small failure-inducing substring of a large input that causes your program to exhibit a bug’

No comments: