LegNeato! Christian Legnitto's blog about Mozilla, Apple, technology, and random stuff

17Jul/10Off

Mozilla Pulse and RabbitMQ

I did a lightning talk at the Mozilla Summit about my pet infrastructure project, Mozilla Pulse. I'll be talking about it in more depth in a future blog post. This post is more a call for help from message broker experts.

I've been running into issues with RabbitMQ (the erlang message broker that runs on pulse). I griped a little on Twitter and got some responses, so I decided to write a more in-depth description of what I am running into. I'm not going to explain any message broker specific terminology, so feel free to skip this post if you don't know what I am talking about. None of this should be important if you just want to use pulse in the future.

The general idea of using a message broker at Mozilla is to make useful tools on top of infrastructure, with the infrastructure (producers) being loosely coupled from the tools (consumers). Because of this, I came up with this configuration for an initial prototype:

Exchanges

org.mozilla.exchange.bugzilla (topic)

  • All Bugzilla messages are routed in here. Bugzilla is the producer, with permissions of ".*bugzilla" ".*bugzilla" ".*bugzilla". That is, the Bugzilla producer can do anything to the Bugzilla exchange
  • The message routing key hierarchy looks like bug.added, bug.changed.[field], etc
  • The plan was to add more, sticking logic in the producer (that is, bug.changed.resolution when the message data is CLOSED should be elevated to bug.closed instead, etc)
  • The message rate is very high-volume for Mozilla's Bugzilla, as you can imagine

org.mozilla.exchange.hg (topic)

  • All hg.mozilla.org messages are routed in here. HG is the producer, with permissions of ".*hg" ".*hg" ".*hg". That is, the HG producer can do anything to the HG exchange
  • The message routing key hierarchy looks like hg.mozilla.central.repo.[opened/closed], hg.releases.mozilla.1.9.2.[commit/push], etc
  • The message rate is not that high-volume, though when watching all repositories it could be a bit bursty

org.mozilla.exchange.build (topic)

  • All build.mozilla.org messages are routed in here. Buildbot is the producer, with permissions of ".*build" ".*build" ".*build". That is, the Buildbot producer can do anything to the build exchange
  • This is currently experimental and the routing keys haven't been figured out to provide the most value
  • Very high-volume, though less so than the Bugzilla exchange

Consumers

These were my general goals for consumers:

  1. Be as simple as possible so people can start playing with pulse, proving the idea and getting some momentum
  2. I do not want to be the bottleneck for experimentation, so no user accounts or administration tasks necessary to just consume messages
  3. Users writing consumers should not need to learn about any of the underlying message broker terminology or technology
  4. Users could be running consumers on their local machines, and when they reconnect all the messages they missed should be there waiting (they could clear the old messages or process them depending on their needs)

Because of those, I came up with the following plan:

  1. Create a user named public with a password of public and permissions of "" "" ".*", which as far as I know means the user can read from anything but not write or create. The public user can still write and create server-created resources, which means when it asks for the foo queue, the server will create it if it doesn't exist and public will then only have access to read from it
  2. Create a trivial shim library in python on top of carrot to abstract out the message broker bits and help Mozilla-specific consumers get up and running quickly
  3. Make sure people testing set a unique string for their applabel, which means their queue will be unique and message delivery will not fall back to round-robin between different people

So, seemed like a good plan, right? And it worked! Until...

Issues

Deleting unused queues

It became clear people (myself included) created some queues and then later changed to a different queue. The old queues were sitting there accumulating messages which would never be consumed. I went to delete the queues and.....rabbitmqctl doesn't have a delete queue command. Darn. Ok, I have the BQL plugin installed, so not a huge deal to pop in and delete them through that, but it seems odd this functionality is missing.

Running out of memory with old persister

There were some bugs in the Bugzilla producer which caused messages to be extremely throttled. I fixed them and immediately the broker ran out of memory and fell over. This was because there were 10 or so queues that weren't having messages actively consumed, each with ~1000 messages. I didn't see this in testing because all my testing consumers were running and consuming the messages that were sent without any buildup. Additionally, the server is running on a VM (it's a prototype after all) which doesn't have a bunch of memory to begin with.

I tried to connect to the queues with a python consumer (using carrot) to drain them, but everything just hung. I could not drain the queues and unblock the server, which meant I couldn't write an administration script that removed 500 messages out of any queue with > 500 un-acked messages.

Reading around, a lot of people are running into this problem. The good news is that the new persister is supposed to fix it, though it isn't quite done yet. It looks like the new persister is in QA and many people on the mailing lists are running it, so I decided to take the plunge on this prototype system.

Incompatibilities between RabbitMQ 1.7.x and 1.8.x

The prototype pulse system was running RabbitMQ 1.7.x and everything was working well (except for the out of memory bit above). To get the new persister, I had to update to 1.8 (as the latest persister branch is 1.8 based). I decided to upgrade to 1.8 release and make sure everything else still worked before adding the additional layer of pre-release code on top. This is what I did:

  1. Downloaded rabbitmq-public-umbrella
  2. Compiled, installed, and then activated some plugins

I deleted the old persister log, started the server, and immediately found an issue.

The public user couldn't seem to create queues anymore. Darn, that meant people wouldn't be able to use my shim lib. Reading around, it looked like it could be caused by having a 1.7.x data directory with 1.8.x, so I deleted the whole data directory and let RabbitMQ recreate it. I then built up the exchanges, users, and permissions exactly as before. The problem was still there.

So, it looks like the RabbitMQ change to the new AMQP semantics in 1.8 broke what I was doing. Apparently, it is no longer possible to have a read-only user create a queue. I guess this makes sense, though it was my (naive) understanding that automatic queue creation was built into the AMQP spec. That is, the read-only user is requesting it, and if it exists it is handed back to the user, otherwise the server creates it on their behalf. Perhaps this is a bug?

In any case, I opened up the permissions for the public user (this is a prototype system with no real users remember).

Running out of memory with new persister

I decided to take the plunge and make sure the new persister fixed my memory issue before pursuing the permissions issue. This is roughly what I did to upgrade:

  1. Downloaded rabbitmq-public-umbrella
  2. Downloaded the new persister branch
  3. Replaced rabbitmq-server in rabbitmq-public-umbrella with the persister branch
  4. Compiled, installed, and then activated some plugins

I then created some queues, started up the Bugzilla producer, and sent thousands of messages through. RabbitMQ fell over again, as far as I can tell with the same problem. I deleted the whole data directory and let RabbitMQ recreate it. I then built up the exchanges, users, and permissions exactly as before. And it still ran out of memory.

Questions

  1. Are people successfully running the new persister for RabbitMQ?
  2. Do I need to explicitly turn on the new persister when using the new persister branch? If so, how? There are (understandably) no docs that I can find.
  3. Am I setting up the exchanges, queues, and vhosts wrong? As far as I can tell everything was working great before the OOM stuff and the 1.8 semantic changes.
  4. Is there a better way to structure what I want to do?
  5. Is my use-case not supported by RabbitMQ? That would be odd, as this seems like the exact use case that message brokers were made to solve. Do other brokers support what I want?
25Jun/10Off

Reminder: Firefox 3.6.6 and 3.5.11 code freeze is TONIGHT @ 11:59 pm Pacific

Just a reminder that code freeze for Firefox 3.6.6 and 3.5.11 is TONIGHT @ 11:59 pm Pacific time.

If you have any bugs in these queries, we need your attention on them ASAP (if you aren't in the critical path for Firefox 4 beta of course):

If you don't think one of your bugs should be blocking, please say so in the bug or email me directly.

If you have any bugs in these queries, your patch needs to be landed:

These bugs have the checkin-needed keyword and it would be great for other people to land them:

Ehsan and Reed may beat you to those checkin-needed bugs though!