I did a lightning talk at the Mozilla Summit about my pet infrastructure project, Mozilla Pulse. I’ll be talking about it in more depth in a future blog post. This post is more a call for help from message broker experts.

I’ve been running into issues with RabbitMQ (the erlang message broker that runs on pulse). I griped a little on Twitter and got some responses, so I decided to write a more in-depth description of what I am running into. I’m not going to explain any message broker specific terminology, so feel free to skip this post if you don’t know what I am talking about. None of this should be important if you just want to use pulse in the future.

The general idea of using a message broker at Mozilla is to make useful tools on top of infrastructure, with the infrastructure (producers) being loosely coupled from the tools (consumers). Because of this, I came up with this configuration for an initial prototype:

Exchanges

org.mozilla.exchange.bugzilla (topic)

  • All Bugzilla messages are routed in here. Bugzilla is the producer, with permissions of “.*bugzilla” “.*bugzilla” “.*bugzilla”. That is, the Bugzilla producer can do anything to the Bugzilla exchange
  • The message routing key hierarchy looks like bug.added, bug.changed.[field], etc
  • The plan was to add more, sticking logic in the producer (that is, bug.changed.resolution when the message data is CLOSED should be elevated to bug.closed instead, etc)
  • The message rate is very high-volume for Mozilla’s Bugzilla, as you can imagine

org.mozilla.exchange.hg (topic)

  • All hg.mozilla.org messages are routed in here. HG is the producer, with permissions of “.*hg” “.*hg” “.*hg”. That is, the HG producer can do anything to the HG exchange
  • The message routing key hierarchy looks like hg.mozilla.central.repo.[opened/closed], hg.releases.mozilla.1.9.2.[commit/push], etc
  • The message rate is not that high-volume, though when watching all repositories it could be a bit bursty

org.mozilla.exchange.build (topic)

  • All build.mozilla.org messages are routed in here. Buildbot is the producer, with permissions of “.*build” “.*build” “.*build”. That is, the Buildbot producer can do anything to the build exchange
  • This is currently experimental and the routing keys haven’t been figured out to provide the most value
  • Very high-volume, though less so than the Bugzilla exchange

Consumers

These were my general goals for consumers:

  1. Be as simple as possible so people can start playing with pulse, proving the idea and getting some momentum
  2. I do not want to be the bottleneck for experimentation, so no user accounts or administration tasks necessary to just consume messages
  3. Users writing consumers should not need to learn about any of the underlying message broker terminology or technology
  4. Users could be running consumers on their local machines, and when they reconnect all the messages they missed should be there waiting (they could clear the old messages or process them depending on their needs)

Because of those, I came up with the following plan:

  1. Create a user named public with a password of public and permissions of “” “” “.*”, which as far as I know means the user can read from anything but not write or create. The public user can still write and create server-created resources, which means when it asks for the foo queue, the server will create it if it doesn’t exist and public will then only have access to read from it
  2. Create a trivial shim library in python on top of carrot to abstract out the message broker bits and help Mozilla-specific consumers get up and running quickly
  3. Make sure people testing set a unique string for their applabel, which means their queue will be unique and message delivery will not fall back to round-robin between different people

So, seemed like a good plan, right? And it worked! Until…

Issues

Deleting unused queues

It became clear people (myself included) created some queues and then later changed to a different queue. The old queues were sitting there accumulating messages which would never be consumed. I went to delete the queues and…..rabbitmqctl doesn’t have a delete queue command. Darn. Ok, I have the BQL plugin installed, so not a huge deal to pop in and delete them through that, but it seems odd this functionality is missing.

Running out of memory with old persister

There were some bugs in the Bugzilla producer which caused messages to be extremely throttled. I fixed them and immediately the broker ran out of memory and fell over. This was because there were 10 or so queues that weren’t having messages actively consumed, each with ~1000 messages. I didn’t see this in testing because all my testing consumers were running and consuming the messages that were sent without any buildup. Additionally, the server is running on a VM (it’s a prototype after all) which doesn’t have a bunch of memory to begin with.

I tried to connect to the queues with a python consumer (using carrot) to drain them, but everything just hung. I could not drain the queues and unblock the server, which meant I couldn’t write an administration script that removed 500 messages out of any queue with > 500 un-acked messages.

Reading around, a lot of people are running into this problem. The good news is that the new persister is supposed to fix it, though it isn’t quite done yet. It looks like the new persister is in QA and many people on the mailing lists are running it, so I decided to take the plunge on this prototype system.

Incompatibilities between RabbitMQ 1.7.x and 1.8.x

The prototype pulse system was running RabbitMQ 1.7.x and everything was working well (except for the out of memory bit above). To get the new persister, I had to update to 1.8 (as the latest persister branch is 1.8 based). I decided to upgrade to 1.8 release and make sure everything else still worked before adding the additional layer of pre-release code on top. This is what I did:

  1. Downloaded rabbitmq-public-umbrella
  2. Compiled, installed, and then activated some plugins

I deleted the old persister log, started the server, and immediately found an issue.

The public user couldn’t seem to create queues anymore. Darn, that meant people wouldn’t be able to use my shim lib. Reading around, it looked like it could be caused by having a 1.7.x data directory with 1.8.x, so I deleted the whole data directory and let RabbitMQ recreate it. I then built up the exchanges, users, and permissions exactly as before. The problem was still there.

So, it looks like the RabbitMQ change to the new AMQP semantics in 1.8 broke what I was doing. Apparently, it is no longer possible to have a read-only user create a queue. I guess this makes sense, though it was my (naive) understanding that automatic queue creation was built into the AMQP spec. That is, the read-only user is requesting it, and if it exists it is handed back to the user, otherwise the server creates it on their behalf. Perhaps this is a bug?

In any case, I opened up the permissions for the public user (this is a prototype system with no real users remember).

Running out of memory with new persister

I decided to take the plunge and make sure the new persister fixed my memory issue before pursuing the permissions issue. This is roughly what I did to upgrade:

  1. Downloaded rabbitmq-public-umbrella
  2. Downloaded the new persister branch
  3. Replaced rabbitmq-server in rabbitmq-public-umbrella with the persister branch
  4. Compiled, installed, and then activated some plugins

I then created some queues, started up the Bugzilla producer, and sent thousands of messages through. RabbitMQ fell over again, as far as I can tell with the same problem. I deleted the whole data directory and let RabbitMQ recreate it. I then built up the exchanges, users, and permissions exactly as before. And it still ran out of memory.

Questions

  1. Are people successfully running the new persister for RabbitMQ?
  2. Do I need to explicitly turn on the new persister when using the new persister branch? If so, how? There are (understandably) no docs that I can find.
  3. Am I setting up the exchanges, queues, and vhosts wrong? As far as I can tell everything was working great before the OOM stuff and the 1.8 semantic changes.
  4. Is there a better way to structure what I want to do?
  5. Is my use-case not supported by RabbitMQ? That would be odd, as this seems like the exact use case that message brokers were made to solve. Do other brokers support what I want?
Tagged with:  

One Response to Mozilla Pulse and RabbitMQ

  1. Hi Christian,

    The new persister branch is regularly merged into from the current default branch. Whilst it is correct to say it is currently based off the 1.8 release, that’s only true in the sense of the above. The new persister stores messages in a completely different format from the old, and there is currently no tool to allow upgrading from a released version of Rabbit to the new persister without losing persistent messages.

    The issue you ran into when going 1.7 to 1.8 is subtly different. Whilst both use the old persister, both the on-disk format of the messages when they are written to disk, and a database schema changed, again, resulting in no state-maintaining upgrade path. To date we have never produced a tool which can do upgrades maintaining state when database schema or on-disk formats have changed.

    You talk about users creating queues. I think that what you want is for all users to use queues which have server-generated names, thus you guarantee they are private, and you want to declare them “exclusive”, which means that when the connection that created the queue disappears, the queue itself (and any bindings to the queue) also automatically get deleted.

    I quote this text from http://www.rabbitmq.com/admin-guide.html#access-control:

    “Some AMQP operations can create resources with server-generated names. Every user has configure, write and read permissions for such resources. However, the names are strong and not discoverable as part of the protocol, only through management functionality. Therefore these resources are in effect private to the user unless they choose to dilvuge their names to other users.”

    Thus I think that if you force users to create server-named queues, you don’t need to grant any privileges to your public user. It’ll need read access to the exchange to create the binding, and it should automatically have write access to create the binding to the private queues. If this doesn’t work please let us know.

    The 1.8 semantic changes concern what happens when you *re*declare a queue. Previously, if the queue already existed and you redeclare it, but with different attributes, it would still come back with an OK result. This is misleading because it could lead the user to think that a queue had been created with the specified attributes when in fact it has not. Thus now, you must ensure you redeclare with the same attributes otherwise the redeclaration will fail and close the channel. Full details can be found in the lower half of http://lists.rabbitmq.com/pipermail/rabbitmq-announce/2010-June/000025.html

    I am very curious about you managing to get the new persister to crash. I’ve been on holiday for the last week and have not caught up with the rabbitmq-discuss mailing list since getting back, so I have no idea whether you’ve posted everything there too. If not, I’d request you do so, and ideally include the various rabbit logs showing the crash – the new persister just *should not* crash.

    One thing that might be happening is that Rabbit is raising flow control, to request that publishers stop sending further messages to Rabbit – even with the new persister this can happen sometimes to allow disks to catch up, but this tends to only be necessary at high data rates. The client must respond with a flow_ok message to the broker to confirm it understands the flow control, and it must then not send any further messages – this is usually handled by the AMQP client library as it just makes any subsequent publishes block – until it receives a further flow control message from the broker, informing it it can resume. Now I notice you’re using a python client, and they have historically not supported flow control, which can lead to Rabbit forcibly disconnecting clients that do not respond appropriately to the flow control messages.

    You have a very interesting use case, and there is absolutely nothing about it that shouldn’t work perfectly well with RabbitMQ. I’ll try and catch up with the mailing list early next week but in the mean time, please try and dig through the logs, if you’ve not already done so, to see if you can find some sort of stack trace with the crash in it, if it’s there at all, or whether it’s some other issue.