Work on documentation of ha

author: Matthew Sackman <matthew@rabbitmq.com> 2011-04-10 12:53:52 +0100
committer: Matthew Sackman <matthew@rabbitmq.com> 2011-04-10 12:53:52 +0100
commit: 700c770bfd458bb63bc1ade77c2ce4a3f6bb3811 (patch)
tree: a1f04d3c11022f8a5b0107b0402e08ae38d1aeb6
parent: 3fe72bb890545f839265f3a2a00f889170ade6d1 (diff)
download: rabbitmq-server-git-700c770bfd458bb63bc1ade77c2ce4a3f6bb3811.tar.gz
2 files changed, 124 insertions, 62 deletions
diff --git a/src/rabbit_mirror_queue_coordinator.erl b/src/rabbit_mirror_queue_coordinator.erl
index 84220a5b54..7e521e4997 100644
--- a/src/rabbit_mirror_queue_coordinator.erl
+++ b/src/rabbit_mirror_queue_coordinator.erl
@@ -60,47 +60,49 @@
 %%   ||
 %%  consumers
 %%
-%% The master is merely an implementation of BQ, and thus is invoked
-%% through the normal BQ interface by the amqqueue_process. The slaves
+%% The master is merely an implementation of bq, and thus is invoked
+%% through the normal bq interface by the amqqueue_process. The slaves
 %% meanwhile are processes in their own right (as is the
 %% coordinator). The coordinator and all slaves belong to the same gm
 %% group. Every member of a gm group receives messages sent to the gm
-%% group. Because the master is the BQ of amqqueue_process, it doesn't
+%% group. Because the master is the bq of amqqueue_process, it doesn't
 %% have sole control over its mailbox, and as a result, the master
-%% itself cannot be passed messages directly, yet it needs to react to
-%% gm events, such as the death of slaves. Thus the master creates the
-%% coordinator, and it is the coordinator that is the gm callback
-%% module and event handler for the master.
+%% itself cannot be passed messages directly (well, it could by via
+%% the amqqueue:run_backing_queue_async callback but that would induce
+%% additional unnecessary loading on the master queue process), yet it
+%% needs to react to gm events, such as the death of slaves. Thus the
+%% master creates the coordinator, and it is the coordinator that is
+%% the gm callback module and event handler for the master.
 %%
 %% Consumers are only attached to the master. Thus the master is
 %% responsible for informing all slaves when messages are fetched from
-%% the BQ, when they're acked, and when they're requeued.
+%% the bq, when they're acked, and when they're requeued.
 %%
 %% The basic goal is to ensure that all slaves performs actions on
-%% their BQ in the same order as the master. Thus the master
-%% intercepts all events going to its BQ, and suitably broadcasts
+%% their bqs in the same order as the master. Thus the master
+%% intercepts all events going to its bq, and suitably broadcasts
 %% these events on the gm. The slaves thus receive two streams of
 %% events: one stream is via the gm, and one stream is from channels
-%% directly. Note that whilst the stream via gm is guaranteed to be
-%% consistently seen by all slaves, the same is not true of the stream
-%% via channels. For example, in the event of an unexpected death of a
+%% directly. Whilst the stream via gm is guaranteed to be consistently
+%% seen by all slaves, the same is not true of the stream via
+%% channels. For example, in the event of an unexpected death of a
 %% channel during a publish, only some of the mirrors may receive that
 %% publish. As a result of this problem, the messages broadcast over
 %% the gm contain published content, and thus slaves can operate
 %% successfully on messages that they only receive via the gm. The key
 %% purpose of also sending messages directly from the channels to the
 %% slaves is that without this, in the event of the death of the
-%% master, messages can be lost until a suitable slave is promoted.
+%% master, messages could be lost until a suitable slave is promoted.
 %%
-%% However, there are other reasons as well. For example, if confirms
-%% are in use, then there is no guarantee that every slave will see
-%% the delivery with the same msg_seq_no. As a result, the slaves have
-%% to wait until they've seen both the publish via gm, and the publish
+%% However, that is not the only reason. For example, if confirms are
+%% in use, then there is no guarantee that every slave will see the
+%% delivery with the same msg_seq_no. As a result, the slaves have to
+%% wait until they've seen both the publish via gm, and the publish
 %% via the channel before they have enough information to be able to
-%% issue the confirm, if necessary. Either form of publish can arrive
-%% first, and a slave can be upgraded to the master at any point
-%% during this process. Confirms continue to be issued correctly,
-%% however.
+%% perform the publish to their own bq, and subsequently issue the
+%% confirm, if necessary. Either form of publish can arrive first, and
+%% a slave can be upgraded to the master at any point during this
+%% process. Confirms continue to be issued correctly, however.
 %%
 %% Because the slave is a full process, it impersonates parts of the
 %% amqqueue API. However, it does not need to implement all parts: for
@@ -108,6 +110,106 @@
 %% a slave from a channel: it is only publishes that pass both
 %% directly to the slaves and go via gm.
 %%
+%% Slaves can be added dynamically. When this occurs, there is no
+%% attempt made to sync the current contents of the master with the
+%% new slave, thus the slave will start empty, regardless of the state
+%% of the master. Thus the slave needs to be able to detect and ignore
+%% operations which are for messages it has not received: because of
+%% the strict FIFO nature of queues in general, this is
+%% straightforward - all new publishes that the new slave receives via
+%% gm should be processed as normal, but fetches which are for
+%% messages the slave has never seen should be ignored. Similarly,
+%% acks for messages the slave never fetched should be
+%% ignored. Eventually, as the master is consumed from, the messages
+%% at the head of the queue which were there before the slave joined
+%% will disappear, and the slave will become fully synced with the
+%% state of the master. The detection of the sync-status of a slave is
+%% done entirely based on length: if the slave and the master both
+%% agree on the length of the queue after the fetch of the head of the
+%% queue, then the queues must be in sync. The only other possibility
+%% is that the slave's queue is shorter, and thus the fetch should be
+%% ignored.
+%%
+%% Because acktags are issued by the bq independently, and because
+%% there is no requirement for the master and all slaves to use the
+%% same bq, all references to msgs going over gm is by msg_id. Thus
+%% upon acking, the master must convert the acktags back to msg_ids
+%% (which happens to be what bq:ack returns), then sends the msg_ids
+%% over gm, the slaves must convert the msg_ids to acktags (a mapping
+%% the slaves themselves must maintain).
+%%
+%% When the master dies, a slave gets promoted. This will be the
+%% eldest slave, and thus the hope is that that slave is most likely
+%% to be sync'd with the master. The design of gm is that the
+%% notification of the death of the master will only appear once all
+%% messages in-flight from the master have been fully delivered to all
+%% members of the gm group. Thus at this point, the slave that gets
+%% promoted cannot broadcast different events in a different order
+%% than the master for the same msgs: there is no possibility for the
+%% same msg to be processed by the old master and the new master - if
+%% it was processed by the old master then it will have been processed
+%% by the slave before the slave was promoted, and vice versa.
+%%
+%% Upon promotion, all msgs pending acks are requeued as normal, the
+%% slave constructs state suitable for use in the master module, and
+%% then dynamically changes into an amqqueue_process with the master
+%% as the bq, and the slave's bq as the master's bq. Thus the very
+%% same process that was the slave is now a full amqqueue_process.
+%%
+%% In the event of channel failure, there is the possibility that a
+%% msg that was being published only makes it to some of the
+%% mirrors. If it makes it to the master, then the master will push
+%% the entire message onto gm, and all the slaves will publish it to
+%% their bq, even though they may not receive it directly from the
+%% channel. This currently will create a small memory leak in the
+%% slave's msg_id_status mapping as the slaves will expect that
+%% eventually they'll receive the msg from the channel. If the message
+%% does not make it to the master then the slaves that receive it will
+%% hold onto the message, assuming it'll eventually appear via
+%% gm. Again, this will currently result in a memory leak, though this
+%% time, it's the entire message rather than tracking the status of
+%% the message, which is potentially much worse. This may eventually
+%% be solved by monitoring publishing channels in some way.
+%%
+%% We don't support transactions on mirror queues. To do so is
+%% challenging. The underlying bq is free to add the contents of the
+%% txn to the queue proper at any point after the tx.commit comes in
+%% but before the tx.commit-ok goes out. This means that it is not
+%% safe for all mirrors to simply issue the bq:tx_commit at the same
+%% time, as the addition of the txn's contents to the queue may
+%% subsequently be inconsistently interwoven with other actions on the
+%% bq. The solution to this is, in the master, wrap the PostCommitFun
+%% and do the gm:broadcast in there: at that point, you're in the bq
+%% (well, there's actually nothing to stop that function being invoked
+%% by some other process, but let's pretend for now: you could always
+%% use run_backing_queue to ensure you really are in the queue process
+%% (the _async variant would be unsafe from an ordering pov)), the
+%% gm:broadcast is safe because you don't have to worry about races
+%% with other gm:broadcast calls (same process). Thus this signal
+%% would indicate sufficiently to all the slaves that they must insert
+%% the complete contents of the txn at precisely this point in the
+%% stream of events.
+%%
+%% However, it's quite difficult for the slaves to make that happen:
+%% they would be forced to issue the bq:tx_commit at that point, but
+%% then stall processing any further instructions from gm until they
+%% receive the notification from their bq that the tx_commit has fully
+%% completed (i.e. they need to treat what is an async system as being
+%% fully synchronous). This is not too bad (apart from the
+%% vomit-inducing notion of it all): just need a queue of instructions
+%% from the GM; but then it gets rather worse when you consider what
+%% needs to happen if the master dies at this point and the slave in
+%% the middle of this tx_commit needs to be promoted.
+%%
+%% Finally, we can't possibly hope to make transactions atomic across
+%% mirror queues, and it's not even clear that that's desirable: if a
+%% slave fails whilst there's an open transaction in progress then
+%% when the channel comes to commit the txn, it will detect the
+%% failure and destroy the channel. However, the txn will have
+%% actually committed successfully in all the other mirrors (including
+%% master). To do this bit properly would require 2PC and all the
+%% baggage that goes with that.
+%%
 %%----------------------------------------------------------------------------
 
 start_link(Queue, GM) ->
diff --git a/src/rabbit_mirror_queue_master.erl b/src/rabbit_mirror_queue_master.erl
index e6a71370ae..481ee7c49d 100644
--- a/src/rabbit_mirror_queue_master.erl
+++ b/src/rabbit_mirror_queue_master.erl
@@ -44,46 +44,6 @@
 
 %% For general documentation of HA design, see
 %% rabbit_mirror_queue_coordinator
-%%
-%% Some notes on transactions
-%%
-%% We don't support transactions on mirror queues. To do so is
-%% challenging. The underlying bq is free to add the contents of the
-%% txn to the queue proper at any point after the tx.commit comes in
-%% but before the tx.commit-ok goes out. This means that it is not
-%% safe for all mirrors to simply issue the BQ:tx_commit at the same
-%% time, as the addition of the txn's contents to the queue may
-%% subsequently be inconsistently interwoven with other actions on the
-%% BQ. The solution to this is, in the master, wrap the PostCommitFun
-%% and do the gm:broadcast in there: at that point, you're in the BQ
-%% (well, there's actually nothing to stop that function being invoked
-%% by some other process, but let's pretend for now: you could always
-%% use run_backing_queue_async to ensure you really are in the queue
-%% process), the gm:broadcast is safe because you don't have to worry
-%% about races with other gm:broadcast calls (same process). Thus this
-%% signal would indicate sufficiently to all the slaves that they must
-%% insert the complete contents of the txn at precisely this point in
-%% the stream of events.
-%%
-%% However, it's quite difficult for the slaves to make that happen:
-%% they would be forced to issue the tx_commit at that point, but then
-%% stall processing any further instructions from gm until they
-%% receive the notification from their bq that the tx_commit has fully
-%% completed (i.e. they need to treat what is an async system as being
-%% fully synchronous). This is not too bad (apart from the
-%% vomit-inducing notion of it all): just need a queue of instructions
-%% from the GM; but then it gets rather worse when you consider what
-%% needs to happen if the master dies at this point and the slave in
-%% the middle of this tx_commit needs to be promoted.
-%%
-%% Finally, we can't possibly hope to make transactions atomic across
-%% mirror queues, and it's not even clear that that's desirable: if a
-%% slave fails whilst there's an open transaction in progress then
-%% when the channel comes to commit the txn, it will detect the
-%% failure and destroy the channel. However, the txn will have
-%% actually committed successfully in all the other mirrors (including
-%% master). To do this bit properly would require 2PC and all the
-%% baggage that goes with that.
 
 %% ---------------------------------------------------------------------------
 %% Backing queue
author	Matthew Sackman <matthew@rabbitmq.com>	2011-04-10 12:53:52 +0100
committer	Matthew Sackman <matthew@rabbitmq.com>	2011-04-10 12:53:52 +0100
commit	700c770bfd458bb63bc1ade77c2ce4a3f6bb3811 (patch)
tree	a1f04d3c11022f8a5b0107b0402e08ae38d1aeb6
parent	3fe72bb890545f839265f3a2a00f889170ade6d1 (diff)
download	rabbitmq-server-git-700c770bfd458bb63bc1ade77c2ce4a3f6bb3811.tar.gz