summaryrefslogtreecommitdiff
path: root/deps/rabbitmq_federation/README-hacking
diff options
context:
space:
mode:
Diffstat (limited to 'deps/rabbitmq_federation/README-hacking')
-rw-r--r--deps/rabbitmq_federation/README-hacking143
1 files changed, 143 insertions, 0 deletions
diff --git a/deps/rabbitmq_federation/README-hacking b/deps/rabbitmq_federation/README-hacking
new file mode 100644
index 0000000000..6432552fe3
--- /dev/null
+++ b/deps/rabbitmq_federation/README-hacking
@@ -0,0 +1,143 @@
+This file is intended to tell you How It All Works, concentrating on
+the things you might not expect.
+
+The theory
+==========
+
+The 'x-federation' exchange is defined in
+rabbit_federation_exchange. This starts up a bunch of link processes
+(one for each upstream) which:
+
+ * Connect to the upstream broker
+ * Create a queue and bind it to the upstream exchange
+ * Keep bindings in sync with the downstream exchange
+ * Consume messages from the upstream queue and republish them to the
+ downstream exchange (matching confirms with acks)
+
+Each link process monitors the connections / channels it opens, and
+dies if they do. We use a supervisor2 to ensure that we get some
+backoff when restarting.
+
+We use process groups to identify all link processes for a certain
+exchange, as well as all link processes together.
+
+However, there are a bunch of wrinkles:
+
+
+Wrinkle: The exchange will be recovered when the Erlang client is not available
+===============================================================================
+
+Exchange recovery happens within the rabbit application - therefore at
+the time that the exchange is recovered, we can't make any connections
+since the amqp_client application has not yet started. Each link
+therefore initially has a state 'not_started'. When it is created it
+checks to see if the rabbitmq_federation application is running. If
+so, it starts fully. If not, it goes into the 'not_started'
+state. When rabbitmq_federation starts, it sends a 'go' message to all
+links, prodding them to bring up the link.
+
+
+Wrinkle: On reconnect we want to assert bindings atomically
+===========================================================
+
+If the link goes down for whatever reason, then by the time it comes
+up again the bindings downstream may no longer be in sync with those
+upstream. Therefore on link establishment we want to ensure that a
+certain set of bindings exists. (Of course bringing up a link for the
+first time is a simple case of this.) And we want to do this with AMQP
+methods. But if we were to tear down all bindings and recreate them,
+we would have a time period when messages would not be forwarded for
+bindings that *do* still exist before and after.
+
+We use exchange to exchange bindings to work around this:
+
+We bind the upstream exchange (X) to the upstream queue (Q) via an
+internal fanout exchange (IXA) like so: (routing keys R1 and R2):
+
+ X----R1,R2--->IXA---->Q
+
+This has the same effect as binding the queue to the exchange directly.
+
+Now imagine the link has gone down, and is about to be
+reestablished. In the meanwhile, routing has changed downstream so
+that we now want routing keys R1 and R3. On link reconnection we can
+create and bind another internal fanout exchange IXB:
+
+ X----R1,R2--->IXA---->Q
+ | ^
+ | |
+ \----R1,R3--->IXB-----/
+
+and then delete the original exchange IXA:
+
+ X Q
+ | ^
+ | |
+ \----R1,R3--->IXB-----/
+
+This means that messages matching R1 are always routed during the
+switchover. Messages for R3 will start being routed as soon as we bind
+the second exchange, and messages for R2 will be stopped in a timely
+way. Of course this could lag the downstream situation somewhat, in
+which case some R2 messages will get thrown away downstream since they
+are unroutable. However this lag is inevitable when the link goes
+down.
+
+This means that the downstream only needs to keep track of whether the
+upstream is currently going via internal exchange A or B. This is
+held in the exchange scratch space in Mnesia.
+
+
+Wrinkle: We need to amalgamate bindings
+=======================================
+
+Since we only bind to one exchange upstream, but the downstream
+exchange can be bound to many queues, we can have duplicated bindings
+downstream (same source, routing key and args but different
+destination) that cannot be duplicated upstream (since the destination
+is the same). The link therefore maintains a mapping of (Key, Args) to
+set(Dest). Duplicated bindings do not get repeated upstream, and are
+only unbound upstream when the last one goes away downstream.
+
+Furthermore, this works as an optimisation since this will tend to
+reduce upstream binding count and churn.
+
+
+Wrinkle: We may receive binding events out of order
+===================================================
+
+The rabbit_federation_exchange callbacks are invoked by channel
+processes within rabbit. Therefore they can be executed concurrently,
+and can arrive at the link processes in an order that does not
+correspond to the wall clock.
+
+We need to keep the state of the link in sync with Mnesia. Therefore
+not only do we need to impose an ordering on these events, we need to
+impose Mnesia's ordering on them. We therefore added a function to the
+callback interface, serialise_events. When this returns true, the
+callback mechanism inside rabbit increments a per-exchange counter
+within an Mnesia transaction, and returns the value as part of the
+add_binding and remove_binding callbacks. The link process then queues
+up these events, and replays them in order. The link process's state
+thus always follows Mnesia (it may be delayed, but the effects happen
+in the same order).
+
+
+Other issues
+============
+
+Since links are implemented in terms of AMQP, link failure may cause
+messages to be redelivered. If you're unlucky this could lead to
+duplication.
+
+Message duplication can also happen with some topologies. In some
+cases it may not be possible to set max_hops such that messages arrive
+once at every node.
+
+While we correctly order bind / unbind events, we don't do the same
+thing for exchange creation / deletion. (This is harder - if you
+delete and recreate an exchange with the same name, is it the same
+exchange? What about if its type changes?) This would only be an issue
+if exchanges churn rapidly; however we could get into a state where
+Mnesia sees CDCD but we see CDDC and leave a process running when we
+shouldn't.