Friday, April 01, 2011

Protocol Buffers - Brian Palmer's Take

Here's another in my continuing series of Ruby Protocol Buffers posts. This time, I've got an interview with Brian Palmer.(I interviewed Brian about his participation in a 'Programming Death Match' back in 2006.)

While working at Mozy, Brian worked with Protocol Buffers, and now maintains the ruby-protocol-buffer. He left Mozy last September, and is now working at a startup called Instructure now.

I hope you enjoy Brian's take on Protocol Buffers.

How did you get started using Protocol Buffers?

Brian: At Mozy, the back-end storage system is written in C++. When we started to standardize our messaging, we evaluated a few different libraries, like Thrift, but ultimately settled on Protocol Buffers because of their great performance and minimal footprint. Our servers handle terabytes of new data a day, so we became sort of "I/O snobs" I guess. We hated using any library that handled its own I/O in a way that we couldn't pipe everything through our finely tuned event loop and zero-copy data structures. So Protocol Buffers, where the data structure and wire format are standardized but the surrounding protocol is up to you, was a perfect fit.

What are its primary application domains?

Brian: I'd say Protocol Buffers are a better fit than say, JSON, for more performance sensitive code, since the wire format is extremely space efficient but fast to parse and generate. Or if you need a very flexible protocol. For instance, we use protocol buffers for the message header but can just pack the actual file data in a raw byte array in the same message, saving the overhead of having our protocol library parsing all those terabytes of data.

Or if you want a flexible data format that still allows a bit more definition than just an arbitrary map of key/value pairs, like JSON. Requiring that fields be defined up front in the .proto file can be really helpful when trying to coordinate communication between different apps internally, especially with the guaranteed backwards compatibility.

Why did you decide to write a Ruby library?

Brian: Mozy's back-end is C++, but we use Ruby for the integration test suite for that system, along with all the web software. So we found that we really needed Protocol Buffers for Ruby. This was 3+ years ago -- at the time, we looked at the existing ruby-protobuf library and it wasn't at all suitable for our needs.

Initially this was just going to be a small internal tool, part of our testing framework. There wasn't any talk of open sourcing the library until we'd already been using it internally for a couple years. I just looked at ruby-protobuf again when you contacted me, and it looks like it's come a long way in both completeness and performance. Makes me a bit sad that I might have muddied the waters with another competing library when neither is a clear winner, that's unfortunate. Hopefully somebody finds it useful, though.

How much effort did you put into making your library performant? Ruby-ish? What are the trade-offs between the two?

Brian: My main focus was on performance, since we found that ruby's time spent encoding/decoding protocol buffers was actually a bottleneck in running our integration tests. Internally at Mozy we install the library as a debian package, rather than a gem, and this includes the C extension that is currently disabled in the gem packaging, which provides another performance boost, especially on ruby 1.8.7.

I think the library remains pretty ruby-ish though. The only place that doesn't feel very ruby-ish is in the code generation: while in theory you could just write your .pb.rb files by hand without writing a .proto file, it's not very natural in the current implementation. So you'll want to use the code generator. But the runtime API is very natural to use, I think.

Since our main goal was interfacing with an existing C++ infrastructure that uses .proto files extensively, this was a natural trade-off for us to make. It wouldn't take much effort to make a more ruby-ish DSL for the modules, though.

With three competing implementations out there, how do you see the playing field shaking out?

Brian: To be honest, I haven't looked closely at the other ruby offerings in a couple years. But I can say that while generally I'm a big fan of choice, I'm not sure it really makes sense to have three ruby libraries for something as simple as protocol buffers — one library could probably easily be made to serve everybody's purposes. So in that sense, I hope a clear winner is established, if just to avoid fragmentation of effort.

What would you like to see happen in terms of documentation for ruby-protocol-buffers?

Brian: The best documentation for Protocol Buffers in general is definitely going to remain on Google's site. And my online library documentation covers how to use the runtime API pretty well. But it'd be nice to have more explanation of how to get up and running with ruby-protocol-buffers.

Thursday, March 31, 2011

Protocol Buffers - BJ Neilsen's Take

BJ Neilsen (@localshred or at github) is a member of my local Ruby Brigade, and he's hacking with/on Protocol Buffers with Ruby — oh, and he's a fan of Real Salt Lake too.

He works for a Provo, Utah based startup MoneyDesktop. Where he helped them transition away from a less-than-desirable PHP solution to Rails. They now enjoy an entirely new service-architecture driven by Ruby (and Protobuf). When not working with Ruby, he runs OneSimpleGoal and plays around with iOS and Objective-C.

To get another take on Protocol Buffers, I asked BJ to join me for a quick interview. Enjoy!

How did you get started using Protocol Buffers?

BJ: At the beginning of 2010 I was hired by a startup in Provo to help build out their product offering. The entire application was written in Java, but for the piece I was to be in charge of I was given free reign to choose a platform. Of course I chose Ruby, but it soon became apparent that we needed a solid way to get data from one application to the other.

This need launched a refactor to a more service-oriented approach. Different solutions were researched for dealing with data interchange such as Thrift and the like, but we ended up choosing Protobuf for its simplicity, pedigree, and multi-platform support. No XML, no WSDL, just simple definitions compiled to the language of your choice. Defining a Data Structure and API with one declarative language, and then being able to build the client and server implementations in two different languages was a huge win. We created a Socket-based RPC server on the Java side, and called the endpoints from Ruby. It was very simple.

I'm now with a new company and the new team was very receptive to the idea of a Protobuf Service ecosystem for our service-oriented application. It is currently the primary method of internal data interchange between multiple service applications. At the time of writing, we have over 20 different proto definition files, 63 separate defined data types (including Enums), 15 independent service classes implementing a total of 32 service endpoints.

What do you see as the strengths of the Protocol Buffers data format?

BJ: One of the greatest strengths of Protobuf is its clear data definitions. Open up any .proto file and it's not hard to deduce the structure of the represented Data Types. Defining Service endpoints is similarly simple, meaning all of the ambiguity of WIki-based (or similar) API documentation is immediately eliminated. Clarity is such a key when building a large system with a team of any size. Being able to clearly understand how and what data is transferred within the system is absolutely key, especially when you hire beyond your core development team and need to get people contributing quickly.

I've already mentioned the power we gained from being able to tie together a Service architecture with multiple languages in a unified API. The Protobuf project officially supports Java, C++, and Python implementations for the definitions compiler and data serialization code, but they have a ton of third party code listed for many other languages like Objective-C and JavaScript (with support in Node.js as well).

Which Protocol Buffers implementation are you using? How did you end up choosing it?

BJ: The only Ruby project listed on Protobuf's "Third Party" page (at the time) was Mack's Ruby-Protobuf. This was a great start as the compiler was built in YACC. However, once I started integrating the API into our Ruby application, it became clear that the RPC side had been half-baked and just sort of thrown out into the wild. Files were compiled and stubbed in the wrong places, meaning that if I added any code to the stubbed client or server files, subsequent compiles would overwrite my changes. Not good.

By that time we were full-steam ahead on the Protobuf implementation in the other services, so I basically had to go in and rewrite the compiler code generation for each of the services, as well as a complete rewrite the entire RPC backend to become compatible with the Protobuf SocketRPC library written for Java. Since that first rewrite at the early part of 2010, I've since done another rewrite (late 2010) to use EventMachine as the RPC backend and I can tell you its lightyears faster, and the DSL is much sexier also, looking much more like an AJAX request with callbacks than a standard socket connection with byte-reading hell. You can get that code on my github fork on the compatability-0.4.0 branch.

What are your plans for you fork of Mack's ruby-protobuf? Will it get wrapped into his distribution or will you go all the way, rename it, and start publishing it as a gem?

BJ: Fantastic question. Currently I've packaged the gem internally for our SOA ecosystem to get around the problem of getting it into a full release with the original code. I've embarked in merge-hell attempting to get my code to work with theirs several times now and each time it just feels like it's not worth it. I've yet to have contact with the original developers (I'm fairly sure they live in Japan) and so I'm not entirely sure they'd accept any patches I'd send anyways.

I've also toyed with the idea that since I've changed a significant chunk of the original code I could just make it my own gem with some witty name (and a reference to the original). The only thing that has kept me from that path is that a) I'd prefer not to insult the original developers, and b) I'm a bit ashamed that there aren't very many tests backing up the RPC backend (the major piece that I wrote from scratch).

Each day we have thousands of successful RPC calls with a virtually non-existent error rate running through the EventMachine RPC code written into this gem, so it has certainly been battle tested in a heavily used production system. Unfortunately it just doesn't have that warm fuzzy feeling (for those who haven't used it yet) that you get when you have 200 green tests behind each class. However, patches with tests are certainly welcome :).

Anyone can pull from my fork on the compatibility-0.4.0 branch (essentially my "master" I build the gem from) and build their own gem if they wish. The current version in my fork is I'd be happy to provide any answers to questions that may arise, and I may even be available to consult with anyone on how to implement Protobuf into your current system.

You gave a presentation on Protocol Buffers at uv.rb. How was it received? Do you see more people starting to use this data format?

BJ: To be honest, I'm not sure my presentation went the way I'd hoped, certainly not well enough to highlight many of the benefits and reasons for using Probotuf. I spent too much time showing the "How" instead of the "Why". I think many people left the meeting intrigued but it was also marred by a drawn-out rant by a few of the developers that were present, debating whether or not it was more prudent to use REST/JSON than a more declarative format like Protobuf.

The argument is moot simply because both styles are great, they just fulfill slightly different needs. When it comes to "Code as Documentation" its hard to argue against Protobuf, a format that is much easier for devs from other languages to buy into. I've never had a developer come to work on a Protobuf API who, after being shown the .proto files, could not understand how to read or extend the definitions.

I hope that developers will give the format a try because I think it's the next level up from normal web application design. It's the start of understanding that for larger applications, different tools should be considered to help alleviate the pains of a (potentially) larger system and the needs of moving data from one place to another on the fly.

Ok, that's a pretty intriguing statement. What different tools should we be looking at (or developing) to work on larger systems and larger data sets?

BJ: Hopefully I don't get myself into too much hot water with the answer to this question (or go off on a large tangent), but here we go. Keep in mind also that this long-winded answer comes with a grain of salt, because every system will be designed to meet different goals. Therefore, there is no "one true way" as some would tout.

That being said, if you are looking to build a system for growth, there are certain concepts and technologies that should at least be considered from the outset. Service-oriented Architecture (SOA) is a way of designing a system for growth, to me it's the most natural way to begin with the journey in mind. For those new to SOA, a short primer: It involves creating smaller independent applications that are easier to write and maintain because they focus on smaller feature sets, while when roped together you can gain the benefit of all the systems working as a whole and ready to scale.

In this type of system we never want to share data between service applications directly, such as connecting from Service A to Service B's database to get user data. We share data by creating APIs for each service application (with protobuf of course :)), then publish those APIs for our other services to consume. If one application needs user data, it doesn't connect to the user database, it connects to the internal User service's API to gather the data. Naturally protobuf fits extremely well here, but REST/JSON or SOAP or (insert other transport protocol here) can obviously be used also.

Other "large systems" or so-called "enterprise" technologies that fit well into an SOA system are background jobs (queues) and various types of messaging systems.

Queueing is essential for the speed and scalability of a system as it offloads non-relevant (yet important) processing to seperate threads or processes. A simple example of how a queue can give you an increase in speed and usability of a system is sending an email when a user is created. The user generally doesn't care (or know) that you are sending him an email when their account is created, but they do care that if its taking 10 seconds. So rather than tie up the user's process just to send an email, you would queue that "job" for later (even if it's processed milliseconds later) and let the process return the result of the user creation. Workers in other threads or processes will pick up the email job and send the email for you.

The main queueing system we use is Github's excellent Resque coupled with my own little resque-remote plugin. Resque-remote gives us the ability to queue a job for another service to consume.

Messaging is such an enormous topic that I'm not sure I'm the one you want to describe its ins and outs. The short of it is that in certain contexts we've found that it can make more sense to use push-based data transfer rather than pull-based. Take the user creation example: when a user is created in my User Service Application, the user service doesn't know about any other systems that may be interested that a user was created, and frankly it shouldn't care. The User Service should only be responsible to post a message (to a message service or bus) that an event occurred in the system, in this case a user was created. Once the event is messaged, user service creation can go about its merry way. Other parts of the system may be listening to the message (event) bus for user creation events and their associated data, and they will receive the data as a push. This specific messaging paradigm is usually referred to as PubSub (Publish/Subscribe). As I've already mentioned, there are many many more types of messaging patterns that can be followed.

These are just a few of the systems we've put in place to manage data transfer complexity in our SOA ecosystem. There's also another branch for data warehousing such as ETL data transfer systems like Pentaho or Jasper. The possibilities are... well, you get the idea.

The coolest part about all of this is that you can use Ruby for 100% of these so-called enterprise situations. We do. You don't have to use Java or .NET to solve "Big Boy" problems. When I first started with Ruby, I wasn't entirely sure of this, but I certainly am now.

So, you've read along this far. What do you think? How are you using Protocol Buffers? Why did you choose to go down this route?

Saturday, March 26, 2011

Ruby and Protocol Buffers, Take One and a Half

In a comment on my previous post on Protocol Buffers, Clayton O'Neill recommended trying out the java protobuf library with jruby. I'll get to that eventually, but his comment made me wonder how jruby and rubinius would do with this little test.
I fired up rvm and looped through my installed versions. Here are the results:
ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-linux]
real 3m11.857s
user 3m11.024s
sys 0m0.124s
jruby 1.5.5 (ruby 1.8.7 patchlevel 249) (2010-11-10 4bd4200) (Java HotSpot(TM) Client VM 1.6.0_24) [i386-java]
real 2m54.035s
user 2m53.355s
sys 0m0.388s
rubinius 1.1.1 (1.8.7 release 2010-11-16 JI) [i686-pc-linux-gnu]
real 1m59.693s
user 2m5.292s
sys 0m0.148s
ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux]
real 1m50.293s
user 1m49.811s
sys 0m0.092s
I certainly wouldn't choose a ruby implementation based on this alone, but it's good to see where things stand at the get go. As I keep going with this exploration, I'll try to keep posting timing results.

Update: Evan Phoenix (@evanphx) pointed out that I was using old versions of Rubinius and JRuby. Since I'm boxed into which Ruby implementation I use (1.8.7 on our boxes at work), I wasn't thinking about keeping things up to date in my RVM installation. I've updated JRuby and Rubinius and rerun the test. The results are as follows:
jruby 1.6.0 (ruby 1.8.7 patchlevel 330) (2011-03-15 f3b6154) (Java HotSpot(TM) Server VM 1.6.0_24) [linux-i386-java]
real 1m38.390s
user 1m42.914s
sys 0m0.508s
rubinius 1.2.4dev (1.8.7 536a6eb8 yyyy-mm-dd JI) [i686-pc-linux-gnu]
real 1m58.138s
user 2m1.492s
sys 0m0.144s

Thursday, March 24, 2011

Ruby and Protocol Buffers, Take One

At work, we're moving from XML to protocol buffers.  While we're mostly a Java shop, the operations/sysadmin team I'm on does a lot of Ruby. I was interested in how we might use the same technology for some of our stuff. After a bit of looking, I found two libraries that looked mature enough to investigate:

ruby-protobuf, by MATSUYAMA Kengo (@macks_jp), was straightforward to install and use.  It has a good online tutorial and the redme has all I needed to get started.
ruby-protocol-buffers, by Brian Palmer was also easy to install and use.  It seems a bit lacking in the online documentation, but does have some examples to follow.  (If Brian's name rings a bell, it might be because I interviewed him some time ago about winning a programming contest sponsored by Mozy's former incarnation.)
I started out with a very simple proto file:

package bench;

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

I compiled this with rprotoc for ruby-protobuf and with ruby-protoc for ruby-protocol-buffers. This generated the following (which I edited lightly).  For ruby-protof:

### Generated by rprotoc. DO NOT EDIT!
### <proto file: bench.proto>
# package bench;
# message Person {
#   required string name = 1;
#   required int32 id = 2;
#   optional string email = 3;
# }
require 'protobuf/message/message'
require 'protobuf/message/enum'
require 'protobuf/message/service'
require 'protobuf/message/extend'

module Bench1
  class Person1 < ::Protobuf::Message
    defined_in __FILE__
    required :string, :name, 1
    required :int32, :id, 2
    optional :string, :email, 3
for ruby-protocol-buffer
#!/usr/bin/env ruby
# Generated by the protocol buffer compiler. DO NOT EDIT!

require 'protocol_buffers'

# Reload support
Object.__send__(:remove_const, :Bench2) if defined?(Bench2)

module Bench2
  # forward declarations
  class Person2 < ::ProtocolBuffers::Message; end

  class Person2 < ::ProtocolBuffers::Message
    required :string, :name, 1
    required :int32, :id, 2
    optional :string, :email, 3

    gen_methods! # new fields ignored after this point

Then I pulled out the statistical benchmarking I wrote about a while ago (Since no one else has taken the bait, maybe I should bundle up a gem for that.).  Instead of quoting the whole thing at you, here are the pertinent loops.  For ruby-protobuf:

msg = => idx.to_s,
                          :id => idx,
                          :email => idx.to_s)
msg_str = msg.serialize_to_string
msg == msg.parse_from_string(msg_str)

for ruby-protocol-buffer:

msg = => idx.to_s,
                          :id => idx,
                          :email => idx.to_s)
 msg == Bench2::Person2.parse(msg.to_s)
And here are the results:

$ rvm 1.9.2
$ ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]
$ time ruby ProtoBufBench
testing ruby-protobuff against ruby-protocol-buffer
The deviation in the deltas was 0.021731
The mean delta was 0.198301
max = 0.241761921640842 :: min = 0.154839326146633
ruby-protocol-buffer was better

real 1m50.672s
user 1m50.599s
sys 0m0.092s

$ rvm 1.8.7
$ ruby -v
ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-linux]
$ time ruby ProtoBufBench
testing ruby-protobuff against ruby-protocol-buffer
The deviation in the deltas was 0.009414
The mean delta was -2.205984
max = -2.18715483485056 :: min = -2.22481263341116
There's no statistical difference

real 3m8.131s
user 3m7.996s
sys 0m0.056s

I didn't try compiling the c extension for ruby-protocol-buffers, and I haven't tried any more involved .proto files yet.  I'll work on those in the next couple of days and post results as I see them.

Wednesday, March 23, 2011

Review - Eloquent Ruby

The system management/administration team that I work on is starting to do more scripting and tool building.  That means bringing a bunch of people up to speed on Ruby.  We're using a combination of the Pickaxe Book and pair programming/mentoring to help bootstrap people.  So far it's been working pretty well.
Watching everyone else reading and learning made me want to get in on the action.  Fortunately, there's a new book out from Russ Olsen (@russolsen) — Eloquent Ruby.  I had the opportunity to interview Russ (look for it to show up soon) about his book, and Addison-Wesley was kind enough to send me a copy of it.
Eloquent Ruby is primarily aimed at people coming to Ruby from other languages.  It aims to explore and explain the idioms common to our community, and I think it does a great job of it.
It will serve current rubyists well too.  I learned several things from it as I read, and cemented other concepts as well.  Some of the book's explanations have crept their way into discussions about Ruby here at work.  It's good stuff.
Section One, on basics, is a rich source of little gems that will help you use Ruby's built in classes more effectively.  Section Two, on modules and classes, helps you build better classes of your own.  Section Three, which covers meta-programming, dives into an oft discussed but underused side of Ruby to push your Ruby-fu to the limit.  Section Four covers a variety of things that either don't fit into the other sections or build on concepts from them.
With this book, Russ has hit another one out of the park.  Go grab a copy for yourself.
If you're interested inDesign Patterns in Ruby  you can read my review here.  You can also read a previous interview with Russ.