summaryrefslogtreecommitdiff
path: root/qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml
diff options
context:
space:
mode:
authorKeith Wall <kwall@apache.org>2014-10-12 15:48:18 +0000
committerKeith Wall <kwall@apache.org>2014-10-12 15:48:18 +0000
commit39b5bc8a8456235540730c88922f60855a16e46b (patch)
treeed495f958ebe78b2559dbbbb0c4e292e1555250f /qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml
parent71c395d26e425571573b0bda06e23e5f84c1ca2c (diff)
downloadqpid-python-39b5bc8a8456235540730c88922f60855a16e46b.tar.gz
QPID-6108: [Java Broker Documentation] Rewrite HA documentation to reflect the new model and the include multi-node support.
* Correct many spelling errors * Improve web-console documentation around add/edit/delete entities, and the setting of context variables * Extract new top level section for backup/recovery git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1631195 13f79535-47bb-0310-9956-ffa450edef68
Diffstat (limited to 'qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml')
-rw-r--r--qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml1268
1 files changed, 458 insertions, 810 deletions
diff --git a/qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml b/qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml
index d838bee626..9dfbbaf764 100644
--- a/qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml
+++ b/qpid/doc/book/src/java-broker/Java-Broker-High-Availability.xml
@@ -26,882 +26,530 @@
<chapter id="Java-Broker-High-Availability">
<title>High Availability</title>
- <section role="h3" id="Java-Broker-High-Availability-GeneralIntroduction">
+ <section id="Java-Broker-High-Availability-GeneralIntroduction">
<title>General Introduction</title>
- <para>The term High Availability (HA) usually refers to having a number of instances of a service such as a Message Broker
- available so that should a service unexpectedly fail, or requires to be shutdown for maintenance, users may quickly connect
- to another instance and continue their work with minimal interuption. HA is one way to make a overall system more resilient
- by eliminating a single point of failure from a system.</para>
- <para>HA offerings are usually categorised as <emphasis role="bold">Active/Active</emphasis> or <emphasis role="bold">Active/Passive</emphasis>.
- An Active/Active system is one where all nodes within the cluster are usuaully available for use by clients all of the time. In an
- Active/Passive system, one only node within the cluster is available for use by clients at any one time, whilst the others are in
- some kind of standby state, awaiting to quickly step-in in the event the active node becomes unavailable.
- </para>
+ <para>The term High Availability (HA) usually refers to having a number of instances of a
+ service such as a Message Broker available so that should a service unexpectedly fail, or
+ requires to be shutdown for maintenance, users may quickly connect to another instance and
+ continue their work with minimal interruption. HA is one way to make a overall system more
+ resilient by eliminating a single point of failure from a system.</para>
+ <para>HA offerings are usually categorised as <emphasis role="bold">Active/Active</emphasis> or
+ <emphasis role="bold">Active/Passive</emphasis>. An Active/Active system is one where all
+ nodes within the group are usually available for use by clients all of the time. In an
+ Active/Passive system, one only node within the group is available for use by clients at any
+ one time, whilst the others are in some kind of standby state, awaiting to quickly step-in in
+ the event the active node becomes unavailable. </para>
</section>
- <section role="h3" id="Java-Broker-High-Availability-OfferingsOfJavaBroker">
- <title>HA offerings of the Java Broker</title>
- <para>HA is provided by way of the HA features built into the <ulink url="&oracleBdbProductOverviewUrl;">Java Edition of the Berkley Database
- (BDB JE)</ulink> and as such is currently only available to Java Broker users who use the optional BDB JE based persistence store. This
- <emphasis role="bold">optional</emphasis> store requires the use of BDB JE which is licensed under the Sleepycat Licence, which is
- not compatible with the Apache Licence and thus BDB JE is not distributed with Qpid. Users who elect to use this optional store for
- the broker have to provide this dependency.</para>
- <para>HA in the Java Broker provides an <emphasis role="bold">Active/Passive</emphasis> mode of operation with Virtual hosts being
- the unit of replication. The Active node (referred to as the <emphasis role="bold">Master</emphasis>) accepts all work from all the clients.
- The Passive nodes (referred to as <emphasis role="bold">Replicas</emphasis>) are unavailable for work: the only task they must perform is
- to remain in synch with the Master node by consuming a replication stream containing all data and state.</para>
- <para>If the Master node fails, a Replica node is elected to become the new Master node. All clients automatically failover
- <footnote><para>The automatic failover feature is available only for AMQP connections from the Java client. Management connections (JMX)
- do not current offer this feature.</para></footnote> to the new Master and continue their work.</para>
- <para>The Java Broker HA solution is incompatible with the HA solution offered by the CPP Broker. It is not possible to co-locate Java and CPP
- Brokers within the same cluster.</para>
- <para>HA is not currently available for those using the the <emphasis role="bold">Derby Store</emphasis> or <emphasis role="bold">Memory
- Message Store</emphasis>.</para>
+ <section id="Java-Broker-High-Availability-OverviewOfHA">
+ <title>Overview of HA within the Java Broker</title>
+ <para>The Java Broker provides a HA implementation offering an <emphasis role="bold"
+ >Active/Passive</emphasis> mode of operation. When using HA, many instances of the Java
+ Broker work together to form an high availability group of two or more nodes.</para>
+ <para>The remainder of this section now talks about the specifics of how HA is achieved in terms
+ of the <link linkend="Java-Broker-Concepts">concepts</link> introduced earlier in this
+ book.</para>
+ <para>The <link linkend="Java-Broker-Concepts-Virtualhosts">Virtualhost</link> is the unit of
+ replication. This means that any <emphasis>durable</emphasis> queues, exchanges, and bindings
+ belonging to that virtualhost, any <emphasis>persistent</emphasis> messages contained within
+ the queues and any attribute settings applied to the virtualhost itself are automatically
+ replicated to all nodes within the group.<footnote>
+ <para>Transient messages and messages on non-durable queues are not replicated.</para>
+ </footnote></para>
+ <para>It is the <link linkend="Java-Broker-Concepts-Virtualhost-Nodes">Virtualhost Nodes</link>
+ (from different Broker instances) that join together to form a group. The virtualhost nodes
+ collectively to coordinate the group: they organise replication between the master and
+ replicas and conduct elections to determine who becomes the new master in the event of the old
+ failing.</para>
+ <para>When a virtualhost node is in the <emphasis>master</emphasis> role, the virtualhost
+ beneath it is available for messaging work. Any write operations sent to the virtualhost are
+ automatically replicated to all other nodes in group.</para>
+ <para>When a virtualhost node is in the <emphasis>replica</emphasis> role, the virtualhost
+ beneath it is always unavailable for message work. Any attempted connections to a virtualhost
+ in this state are automatically turned away, allowing a messaging client to discover where the
+ master currently resides. When in replica role, the node sole responsibility is to consume a
+ replication stream in order that it remains up to date with the master.</para>
+ <para>Messaging clients discover the active virtualhost.This can be achieved using a static
+ technique (for instance, a failover url (a feature of a Qpid Java Client)), or a dynamic one
+ utilising some kind of proxy or virtual IP (VIP).</para>
+ <para>The figure that follows illustrates a group formed of three virtualhost nodes from three
+ separate Broker instances. A client is connected to the virtualhost node that is in the master
+ role. The two virtualhost nodes <literal>weather1</literal> and <literal>weather3</literal>
+ are replicas and are receiving a stream of updates.</para>
+ <figure id="Java-Broker-High-Availability-OverviewOfHA-Figure">
+ <title>3-node group deployed across three Brokers.</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="images/HA-Overview.png" format="PNG" scalefit="1"/>
+ </imageobject>
+ <textobject>
+ <phrase>Diagram showing a 3 node group deployed across three Brokers</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <para>Currently, the only virtualhost/virtualhost node type offering HA is BDB HA. Internally,
+ this leverages the HA capabilities of the Berkeley DB JE edition. BDB JE is an <link
+ linkend="Java-Broker-Miscellaneous-Installing-Oracle-BDB-JE">optional dependency</link> of
+ the Broker.</para>
+ <note>
+ <para>The Java Broker HA solution is incompatible with the HA solution offered by the CPP
+ Broker. It is not possible to co-locate Java and CPP Brokers within the same group.</para>
+ </note>
</section>
+ <section id="Java-Broker-High-Availability-CreatingGroup">
+ <title>Creating a group</title>
+ <para>This section describes how to create a group. At a high level, creating a group involves
+ first creating the first node standalone, then creating subsequent nodes referencing the first
+ node so the nodes can introduce themselves and gradually the group is built up.</para>
+ <para>A group is created through either <link
+ linkend="Java-Broker-Management-Channel-Web-Console">Web Management</link> or the <link
+ linkend="Java-Broker-Management-Channel-REST-API">REST API</link>. These instructions
+ presume you are using Web Management. To illustrate the example it builds the group
+ illustrated in figure <xref linkend="Java-Broker-High-Availability-OverviewOfHA-Figure"
+ /></para>
+ <para><orderedlist>
+ <listitem>
+ <para>Install a Broker on each machine that will be used to host the group. As messaging
+ clients will need to be able to connect to and authentication to all Brokers, it usually
+ makes sense to choose a common authentication mechanism e.g. Simple LDAP Authentication,
+ External with SSL client authentication or Kerberos.</para>
+ </listitem>
+ <listitem>
+ <para>Select one Broker instance to host the first node instance. This choice is an
+ arbitrary one. The node is special only whilst creating group. Once creation is complete,
+ all nodes will be considered equal.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <para>Click the <literal>Add</literal> button on the Virtualhost Panel on the Broker
+ tab.</para>
+ <orderedlist>
+ <listitem>
+ <para>Give the Virtualhost node a unique name e.g. <literal>weather1</literal>. The
+ name must be unique within the group and unique to that Broker. It is best if the
+ node names are chosen from a different nomenclature than the machine names
+ themselves.</para>
+ </listitem>
+ <listitem>
+ <para>Choose <literal>BDB_HA</literal> and select <literal>New group</literal>
+ </para>
+ </listitem>
+ <listitem>
+ <para>Give the group a name e.g. <literal>weather</literal>. The group name must be
+ unique and will be the name also given to the virtualhost, so this is the name the
+ messaging clients will use in their connection url.</para>
+ </listitem>
+ <listitem>
+ <para>Give the address of this node. This is an address on this node's host that
+ will be used for replication purposes. The hostname <emphasis>must</emphasis> be
+ resolvable by all the other nodes in the group. This is separate from the address
+ used by messaging clients to connect to the Broker. It is usually best to choose a
+ symbolic name, rather than an IP address.</para>
+ </listitem>
+ <listitem>
+ <para>Now add the node addresses of all the other nodes that will form the group. In
+ our example we are building a three node group so we give the node addresses of
+ <literal>chaac:5000</literal> and <literal>indra:5000</literal>.</para>
+ </listitem>
+ <listitem>
+ <para>Click Add to create the node. The virtualhost node will be created with the
+ virtualhost. As there is only one node at this stage, the role will be
+ master.</para>
+ </listitem>
+ </orderedlist>
+ <para>
+ <figure>
+ <title>Creating 1st node in a group</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="images/HA-Create-1.png" format="PNG" scalefit="1"/>
+ </imageobject>
+ <textobject>
+ <phrase>Creating 1st node in a group</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ </para>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <para>Now move to the second Broker to be the group. Click the <literal>Add</literal>
+ button on the Virtualhost Panel on the Broker tab of the second Broker.</para>
+ <orderedlist>
+ <listitem>
+ <para>Give the Virtualhost node a unique name e.g.
+ <literal>weather2</literal>.</para>
+ </listitem>
+ <listitem>
+ <para>Choose <literal>BDB_HA</literal> and choose <literal>Existing group</literal>
+ </para>
+ </listitem>
+ <listitem>
+ <para>Give the details of the <emphasis>existing node</emphasis>. Following our
+ example, specify <literal>weather</literal>, <literal>weather1</literal> and
+ <literal>thor:5000</literal></para>
+ </listitem>
+ <listitem>
+ <para>Give the address of this node.</para>
+ </listitem>
+ <listitem>
+ <para>Click Add to create the node. The node will use the existing details to
+ contact it and introduce itself into the group. At this stage, the group will have
+ two nodes, with the second node in the replica role.</para>
+ </listitem>
+ <listitem>Repeat these steps until you have added all the nodes to the
+ group.</listitem>
+ </orderedlist>
+ <para>
+ <figure>
+ <title>Adding subsequent nodes to the group</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="images/HA-Create-2.png" format="PNG" scalefit="1"/>
+ </imageobject>
+ <textobject>
+ <phrase>Adding subsequent nodes to the group</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ </para>
+ </para>
+ </listitem>
- <section role="h3" id="Java-Broker-High-Availability-TwoNodeCluster">
- <title>Two Node Cluster</title>
- <section role="h4">
- <title>Overview</title>
- <para>In this HA solution, a cluster is formed with two nodes. one node serves as
- <emphasis role="bold">master</emphasis> and the other is a <emphasis role="bold">replica</emphasis>.
- </para>
- <para>All data and state required for the operation of the virtual host is automatically sent from the
- master to the replica. This is called the replication stream. The master virtual host confirms each
- message is on the replica before the client transaction completes. The exact way the client awaits
- for the master and replica is gorverned by the <link linkend="Java-Broker-High-Availability-DurabilityGuarantee">durability</link>
- configuration, which is discussed later. In this way, the replica remains ready to take over the
- role of the master if the master becomes unavailable.
- </para>
- <para>It is important to note that there is an inherent limitation of two node clusters is that
- the replica node cannot make itself master automatically in the event of master failure. This
- is because the replica has no way to distinguish between a network partition (with potentially
- the master still alive on the other side of the partition) and the case of genuine master failure.
- (If the replica were to elect itself as master, the cluster would run the risk of a
- <ulink url="http://en.wikipedia.org/wiki/Split-brain_(computing)">split-brain</ulink> scenario).
- In the event of a master failure, a third party must designate the replica as primary. This process
- is described in more detail later.
- </para>
- <para>Clients connect to the cluster using a <link linkend="Java-Broker-High-Availability-ClientFailover">failover url</link>.
- This allows the client to maintain a connection to the master in a way that is transparent
- to the client application.</para>
- </section>
- <section role="h4">
- <title>Depictions of cluster operation</title>
- <para>In this section, the operation of the cluster is depicted through a series of figures
- supported by explanatory text.</para>
- <figure>
- <title>Key for figures</title>
+ </orderedlist></para>
+ <para>The group is now formed and is ready for us. Looking at the virtualhost node of any of the
+ nodes shows a complete view of the whole group. <figure>
+ <title>View of group from one node</title>
<mediaobject>
<imageobject>
- <imagedata fileref="images/HA-2N-Key.png" format="PNG" scalefit="1"/>
+ <imagedata fileref="images/HA-Create-3.png" format="PNG" scalefit="1"/>
</imageobject>
<textobject>
- <phrase>Key to figures</phrase>
+ <phrase>View of group from one node</phrase>
</textobject>
</mediaobject>
- </figure>
- <section role="h5" id="Java-Broker-High-Availability-TwoNodeNormalOperation">
- <title>Normal Operation</title>
- <para>The figure below illustrates normal operation. Clients connecting to the cluster by way
- of the failover URL achieve a connection to the master. As clients perform work (message
- production, consumption, queue creation etc), the master additionally sends this data to the
- replica over the network.</para>
- <figure>
- <title>Normal operation of a two-node cluster</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/HA-2N-Normal.png" format="PNG" scalefit="1"/>
- </imageobject>
- <textobject>
- <phrase>Normal operation</phrase>
- </textobject>
- </mediaobject>
- </figure>
- </section>
- <section role="h5" id="Java-Broker-High-Availability-TwoNodeMasterFailure">
- <title>Master Failure and Recovery</title>
- <para>The figure below illustrates a sequence of events whereby the master suffers a failure
- and the replica is made the master to allow the clients to continue to work. Later the
- old master is repaired and comes back on-line in replica role.</para>
- <para>The item numbers in this list apply to the numbered boxes in the figure below.</para>
- <orderedlist>
- <listitem>
- <para>System operating normally</para>
- </listitem>
- <listitem>
- <para>Master suffers a failure and disconnects all clients. Replica realises that it is no
- longer in contact with master. Clients begin to try to reconnect to the cluster, although these
- connection attempts will fail at this point.</para>
- </listitem>
+ </figure></para>
+ </section>
+
+ <section id="Java-Broker-High-Availability-Behaviour">
+ <title>Behaviour of the Group</title>
+ <para>This section first describes the behaviour of the group in its default configuration. It
+ then goes on to talk about the various controls that are available to override it. It
+ describes the controls available that affect the <ulink
+ url="http://en.wikipedia.org/wiki/ACID#Durability">durability</ulink> of transactions and
+ the data consistency between the master and replicas and thus make trade offs between
+ performance and reliability.</para>
+
+ <section id="Java-Broker-High-Availability-Behaviour-Default-Behaviour">
+ <title>Default Behaviour</title>
+ <para>Let's first look at the behaviour of a group in default configuration.</para>
+ <para>In the default configuration, for any messaging work to be done, there must be at least
+ <emphasis>quorum</emphasis> nodes present. This means for example, in a three node group,
+ this means there must be at least two nodes available.</para>
+ <para>When a messaging client sends a transaction, it can be assured that, before the control
+ returns back to his application after the commit call that the following is true:</para>
+ <para><itemizedlist>
<listitem>
- <para>A third-party (an operator, a script or a combination of the two) verifies that the master has truely
- failed <emphasis role="bold">and is no longer running</emphasis>. If it has truely failed, the decision is made
- to designate the replica as primary, allowing it to assume the role of master despite the other node being down.
- This primary designation is performed using <link linkend="Java-Broker-High-Availability-JMXAPI">JMX</link>.</para>
+ <para>At the master, the transaction is <emphasis>written to disk and OS level caches
+ are flushed</emphasis> meaning the data is on the storage device.</para>
</listitem>
<listitem>
- <para>Client connections to the new master succeed and the <emphasis role="bold">service is restored
- </emphasis>, albeit without a replica.</para>
+ <para>At least quorum minus 1 replicas, <emphasis>acknowledge the receipt of
+ transaction</emphasis>. The replicas will write the data to the storage device
+ sometime later.</para>
</listitem>
+ </itemizedlist></para>
+ <para>If there were to be a master failure immediately after the transaction was committed,
+ the transaction would be held by at least quorum minus one replicas. For example, if we
+ had a group of three, then we would be assured that at least one replica held the
+ transaction.</para>
+
+ <para>In the event of a master failure, if quorum nodes remain, those nodes hold an election.
+ The nodes will elect master the node with the most recent transaction. If two or more nodes
+ have the most recent transaction the group makes an arbitrary choice. If quorum number of
+ nodes does not remain, the nodes cannot elect a new master and will wait until nodes rejoin.
+ You will see later that manual controls are available allow service to be restored from
+ fewer than quorum nodes and to influence which node gets elected in the event of a
+ tie.</para>
+
+ <para>Whenever a group has fewer than quorum nodes present, the virtualhost will be unavailable
+ and messaging connections will be refused. If quorum disappears at the very moment a
+ messaging client sends a transaction that transaction will fail.</para>
+
+ <para>You will have noticed the difference in the synchronization policies applied the master
+ and the replicas. The replicas send the acknowledgement back before the data is written to
+ disk. The master synchronously writes the transaction to storage. This is an example of a
+ trade off between durability and performance. We will see more about how to control this
+ trade off later.</para>
+ </section>
+ <section id="Java-Broker-High-Availability-Behaviour-SynchronizationPolicy">
+ <title>Synchronization Policy</title>
+ <para>The <emphasis>synchronization policy</emphasis> dictates what a node must do when it
+ receives a transaction before it acknowledges that transaction to the rest of the
+ group.</para>
+ <para>The following options are available: <itemizedlist>
<listitem>
- <para>The old master is repaired and brought back on-line. It automatically rejoins the cluster
- in the <emphasis role="bold">replica</emphasis> role.</para>
+ <para><emphasis>SYNC</emphasis>. The node must write the transaction to disk and flush
+ any OS level buffers before sending the acknowledgement. SYNC is offers the highest
+ durability but offers the least performance.</para>
</listitem>
- </orderedlist>
- <figure>
- <title>Failure of master and recovery sequence</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/HA-2N-MasterFail.png" format="PNG" scalefit="1"/>
- </imageobject>
- <textobject>
- <phrase>Failure of master and subsequent recovery sequence</phrase>
- </textobject>
- </mediaobject>
- </figure>
- </section>
- <section role="h5" id="Java-Broker-High-Availability-TwoNodeReplicaFailure">
- <title>Replica Failure and Recovery</title>
- <para>The figure that follows illustrates a sequence of events whereby the replica suffers a failure
- leaving the master to continue processing alone. Later the replica is repaired and is restarted.
- It rejoins the cluster so that it is once again ready to take over in the event of master failure.</para>
- <para>The behavior of the replica failure case is governed by the <varname>designatedPrimary</varname>
- configuration item. If set true on the master, the master will continue to operate solo without outside
- intervention when the replica fails. If false, a third-party must designate the master as primary in order
- for it to continue solo.</para>
- <para>The item numbers in this list apply to the numbered boxes in the figure below. This example assumes
- that <varname>designatedPrimary</varname> is true on the original master node.</para>
- <orderedlist>
<listitem>
- <para>System operating normally</para>
+ <para><emphasis>WRITE_NO_SYNC</emphasis>. The node must write the transaction to disk
+ before sending the acknowledgement. OS level buffers will be flush as some point
+ later. This typically provides an assurance against failure of the application but not
+ the operating system or hardware.</para>
</listitem>
<listitem>
- <para>Replica suffers a failure. Master realises that replica longer in contact but as
- <varname>designatedPrimary</varname> is true, master continues processing solo and thus client
- connections are uninterrupted by the loss of the replica. System continues operating normally, albeit
- with a single node.</para>
+ <para><emphasis>NO_SYNC</emphasis>. The node immediately sends the acknowledgement. The
+ transaction will be written and OS level buffers flushed as some point later. NO_SYNC
+ offers the highest performance but the lowest durability level. This synchronization
+ policy is sometimes known as <emphasis>commit to the network</emphasis>.</para>
</listitem>
+ </itemizedlist></para>
+ <para>It is possible to assign a one policy to the master and a different policy to the
+ replicas. These are configured as <link
+ linkend="Java-Broker-Management-Managing-Virtualhost-Attributes">attributes on the
+ virtualhost</link>. By default the master uses <emphasis>SYNC</emphasis> and replicas use
+ <emphasis>NO_SYNC</emphasis>.</para>
+ </section>
+ <section id="Java-Broker-High-Availability-Behaviour-NodePriority">
+ <title>Node Priority</title>
+ <para>Node priority can be used to influence the behaviour of the election algorithm. It is
+ useful in the case were you want to favour some nodes over others. For instance, if you wish
+ to favour nodes located in a particular data centre over those in a remote site. </para>
+ <para>The following options are available: <itemizedlist>
<listitem>
- <para>Replica is repaired.</para>
+ <para><emphasis>Highest</emphasis>. Nodes with this priority will be more favoured. In
+ the event of two or more nodes having the most recent transaction, the node with this
+ priority will be elected master. If two or more nodes have this priority the algorithm
+ will make an arbitrary choice.</para>
</listitem>
<listitem>
- <para>After catching up with missed work, replica is once again ready to take over in the event of master failure.</para>
+ <para><emphasis>High</emphasis>. Nodes with this priority will be favoured but not as
+ much so as those with Highest.</para>
</listitem>
- </orderedlist>
- <figure>
- <title>Failure of replica and subsequent recovery sequence</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/HA-2N-ReplicaFail.png" format="PNG" scalefit="1"/>
- </imageobject>
- <textobject>
- <phrase>Failure of replica and subsequent recovery sequence</phrase>
- </textobject>
- </mediaobject>
- </figure>
- </section>
- <section role="h5" id="Java-Broker-High-Availability-TwoNodeNetworkPartition">
- <title>Network Partition and Recovery</title>
- <para>The figure below illustrates the sequence of events that would occur if the network between
- master and replica were to suffer a partition, and the nodes were out of contact with one and other.</para>
- <para>As with <link linkend="Java-Broker-High-Availability-TwoNodeReplicaFailure">Replica Failure and Recovery</link>, the
- behaviour is governed by the <varname>designatedPrimary</varname>.
- Only if <varname>designatedPrimary</varname> is true on the master, will the master continue solo.</para>
- <para>The item numbers in this list apply to the numbered boxes in the figure below. This example assumes
- that <varname>designatedPrimary</varname> is true on the original master node.</para>
- <orderedlist>
<listitem>
- <para>System operating normally</para>
+ <para><emphasis>Normal</emphasis>. This is default election priority.</para>
</listitem>
<listitem>
- <para>Network suffers a failure. Master realises that replica longer in contact but as
- <varname>designatedPrimary</varname> is true, master continues processing solo and thus client
- connections are uninterrupted by the network partition between master and replica.</para>
+ <para><emphasis>Never</emphasis>. The node will never be elected <emphasis>even if the
+ node has the most recent transaction</emphasis>. The node will still keep up to date
+ with the replication stream and will still vote itself, but can just never be
+ elected.</para>
</listitem>
+ </itemizedlist>
+ </para>
+ <para>Node priority is configured as an <link
+ linkend="Java-Broker-Management-Managing-Virtualhost-Nodes-Attributes">attribute on the
+ virtualhost node</link> and can be changed at runtime and is effective immediately.</para>
+ <important>
+ <para>Use of the Never priority can lead to transaction loss. For example, consider a group
+ of three where replica-2 is marked as Never. If a transaction were to arrive and it be
+ acknowledged only by Master and Replica-2, the transaction would succeed. Replica 1 is
+ running behind for some reason (perhaps a full-GC). If a Master failure were to occur at
+ that moment, the replicas would elect Replica-1 even though Replica-2 had the most recent
+ transaction.</para>
+ </important>
+ </section>
+ <section id="Java-Broker-High-Availability-Behaviour-MinimumNumberOfNodes">
+ <title>Required Minimum Number Of Nodes</title>
+ <para>This controls the required minimum number of nodes to complete a transaction and to
+ elect a new master. By default, the required number of nodes is set to
+ <emphasis>Default</emphasis> (which signifies quorum).</para>
+ <para>It is possible to reduce the required minimum number of nodes. The rationale for doing
+ this is normally to temporarily restore service from fewer than quorum nodes following an
+ extraordinary failure.</para>
+ <para>For example, consider a group of three. If one node were to fail, as quorum still
+ remained, the system would continue work without any intervention. If the failing node were
+ the master, a new master would be elected.</para>
+ <para>What if a further node were to fail? Quorum no longer remains, and the remaining node
+ would just wait. It cannot elect itself master. What if we wanted to restore service from
+ just this one node?</para>
+ <para>In this case, Required Number of Nodes can be reduced to 1 on the remain node, allowing
+ the node to elect itself and service to be restored from the singleton. Required minimum
+ number of nodes is configured as an <link
+ linkend="Java-Broker-Management-Managing-Virtualhost-Nodes-Attributes">attribute on the
+ virtualhost node</link> and can be changed at runtime and is effective immediately.</para>
+ <important>
+ <para>The attribute must be used cautiously. Careless use will lead to lost transactions and
+ can lead to a <ulink url="http://en.wikipedia.org/wiki/Split-brain_(computing)"
+ >split-brain</ulink> in the event of a network partition. If used to temporarily restore
+ service from fewer than quorum nodes, it is <emphasis>imperative</emphasis> to revert it
+ to the Default value as the failed nodes are restored.</para>
+ </important>
+ </section>
+ <section id="Java-Broker-High-Availability-Behaviour-DesignatedPrimary">
+ <title>Designated Primary</title>
+ <para>This attribute applies to the groups of two only.</para>
+ <para> In a group of two, if a node were to fail then in default configuration work will cease
+ as quorum no longer exists. A single node cannot elect itself master. </para>
+ <para>The designated primary flag allows a node in a two node group to elect itself master and
+ to operate sole. Designated Primary is configured as an <link
+ linkend="Java-Broker-Management-Managing-Virtualhost-Nodes-Attributes">attribute on the
+ virtualhost node</link> and can be changed at runtime and is effective immediately.</para>
+ <para>For example, consider a group of two where the master fails. Service will be interrupted
+ as the remaining node cannot elect itself master. To allow it to become master, apply the
+ designated primary flag to it. It will elect itself master and work can continue, albeit
+ from one node.</para>
+ <important>It is imperative not to allow designated primary to be set on both nodes at once.
+ To do so will mean, in the event of a network partition, a <ulink
+ url="http://en.wikipedia.org/wiki/Split-brain_(computing)">split-brain</ulink> will occur.
+ </important>
+ </section>
+ </section>
+ <section id="Java-Broker-High-Availability-NodeOperations">
+ <title>Node Operations</title>
+ <section id="Java-Broker-High-Availability-NodeOperations-Lifecycle">
+ <title>Lifecycle</title>
+ <para>Virtualhost nodes can be stopped, started and deleted.</para>
+ <itemizedlist>
+ <listitem>
+ <para><emphasis>Stop</emphasis></para>
+ <para>Stopping a master node will cause the node to temporarily leave the group. Any
+ messaging clients will be disconnected and any in-flight transaction rollbacked. The
+ remaining nodes will elect a new master if quorum number of nodes still remains.</para>
+ <para>Stopping a replica node will cause the node to temporarily leave the group too.
+ Providing quorum still exists, the current master will continue without interruption. If
+ by leaving the group, quorum no longer exists, all the nodes will begin waiting,
+ disconnecting any messaging clients, and the virtualhost will become unavailable.</para>
+ <para>A stopped virtualhost node is still considered to be a member of the group.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis>Start</emphasis></para>
+ <para>Starting a virtualhost node allows it to rejoin the group.</para>
+ <para>If the group already has a master, the node will catch up from the master and then
+ become a replica once it has done so.</para>
+ <para>If the group did not have quorum and so had no master, but the rejoining of this
+ node means quorum now exists, an election will take place. The node with the most up to
+ date transaction will become master unless influenced by the priority rules described
+ above.</para>
+ <note>
+ <para>The length of time taken to catch up will depend on how long the node has been
+ stopped. The worst case is where the node has been stopped for more than one hour. In
+ this case, the master will perform an automated <literal>network restore</literal>.
+ This involves streaming all the data held by the master over to the replica. This
+ could take considerable time.</para>
+ </note>
+ </listitem>
+ <listitem>
+ <para><emphasis>Delete</emphasis></para>
+ <para>A virtualhost node can be deleted. Deleting a node permanently removes the node
+ from the group. The data stored locally is removed but this does not affect the data
+ held by the remainder of the group.</para>
+ <note>
+ <para>The names of deleted virtualhost node cannot be reused within a group.</para>
+ </note>
+ </listitem>
+ </itemizedlist>
+ <para>It is also possible to add nodes to an existing group using the procedure described
+ above.</para>
+ </section>
+ <section id="Java-Broker-High-Availability-NodeOperations-TransferMaster">
+ <title>Transfer Master</title>
+ <para>This operation allows the mastership to be moved from node to node. This is useful for
+ restoring a business as usual state after a failure.</para>
+ <para>When using this function, the following occurs. <orderedlist>
<listitem>
- <para>Network is repaired.</para>
+ <para>The system first gives time for the chosen new master to become reasonable up to
+ date. </para>
</listitem>
<listitem>
- <para>After catching up with missed work, replica is once again ready to take over in the event of master failure.
- System operating normally again.</para>
+ <para>It then suspends transactions on the old master and allows the chosen node to
+ become up to date.</para>
</listitem>
- </orderedlist>
- <figure>
- <title>Partition of the network separating master and replica</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/HA-2N-NetworkPartition.png" format="PNG" scalefit="1"/>
- </imageobject>
- <textobject>
- <phrase>Network Partition and Recovery</phrase>
- </textobject>
- </mediaobject>
- </figure>
- </section>
- <section role="h5" id="Java-Broker-High-Availability-TwoNodeSplitBrain">
- <title>Split Brain</title>
- <para>A <ulink url="http://en.wikipedia.org/wiki/Split-brain_(computing)">split-brain</ulink>
- is a situation where the two node cluster has two masters. BDB normally strives to prevent
- this situation arising by preventing two nodes in a cluster being master at the same time.
- However, if the network suffers a partition, and the third-party intervenes incorrectly
- and makes the replica a second master a split-brain will be formed and both masters will
- proceed to perform work <emphasis role="bold">independently</emphasis> of one and other.</para>
- <para>There is no automatic recovery from a split-brain.</para>
- <para>Manual intervention will be required to choose which store will be retained as master
- and which will be discarded. Manual intervention will be required to identify and repeat the
- lost business transactions.</para>
- <para>The item numbers in this list apply to the numbered boxes in the figure below.</para>
- <orderedlist>
<listitem>
- <para>System operating normally</para>
+ <para>The suspended transactions are aborted and any messaging clients connected to the
+ old master are disconnected.</para>
</listitem>
<listitem>
- <para>Network suffers a failure. Master realises that replica longer in contact but as
- <varname>designatedPrimary</varname> is true, master continues processing solo. Client
- connections are uninterrupted by the network partition.</para>
- <para>A third-party <emphasis role="bold">erroneously</emphasis> designates the replica as primary while the
- original master continues running (now solo).</para>
+ <para>The chosen master becomes the new master. The old master becomes a replica.</para>
</listitem>
<listitem>
- <para>As the nodes cannot see one and other, both behave as masters. Clients may perform work against
- both master nodes.</para>
+ <para>Messaging clients reconnect the new master.</para>
</listitem>
- </orderedlist>
- <figure>
- <title>Split Brain</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/HA-2N-SplitBrain.png" format="PNG" scalefit="1"/>
- </imageobject>
- <textobject>
- <phrase>Split Brain</phrase>
- </textobject>
- </mediaobject>
- </figure>
- </section>
- </section>
- </section>
-
- <section role="h3" id="Java-Broker-High-Availability-MultiNodeCluster">
- <title>Multi Node Cluster</title>
- <para>Multi node clusters are now supported. TODO expand</para>
- </section>
-
- <section role="h3" id="Java-Broker-High-Availability-Creating-A-Grouop">
- <title>Creating a Group</title>
- <para>TODO</para>
- </section>
- <section role="h3" id="Java-Broker-High-Availability-DurabilityGuarantee">
- <title>Durability Guarantees</title>
- <para>The term <ulink url="http://en.wikipedia.org/wiki/ACID#Durability">durability</ulink> is used to mean that once a
- transaction is committed, it remains committed regardless of subsequent failures. A highly durable system is one where
- loss of a committed transaction is extermely unlikely, whereas with a less durable system loss of a transaction is likely
- in a greater number of scenarios. Typically, the more highly durable a system the slower and more costly it will be.</para>
- <para>Qpid exposes the all the
- <ulink url="&oracleBdbRepGuideUrl;txn-management.html#durabilitycontrols">durability controls</ulink>
- offered by by BDB JE JA and a Qpid specific optimisation called <emphasis role="bold">coalescing-sync</emphasis> which defaults
- to enabled.</para>
- <section role="h4" id="Java-Broker-High-Availability-DurabilityGuarantee_BDBControls">
- <title>BDB Durability Controls</title>
- <para>BDB expresses durability as a triplet with the following form:</para>
- <programlisting><![CDATA[<master sync policy>,<replica sync policy>,<replica acknowledgement policy>]]></programlisting>
- <para>The sync polices controls whether the thread performing the committing thread awaits the successful completion of the
- write, or the write and sync before continuing. The master sync policy and replica sync policy need not be the same.</para>
- <para>For master and replic sync policies, the available values are:
- <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/Durability.SyncPolicy.html#SYNC">SYNC</ulink>,
- <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/Durability.SyncPolicy.html#WRITE_NO_SYNC">WRITE_NO_SYNC</ulink>,
- <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/Durability.SyncPolicy.html#NO_SYNC">NO_SYNC</ulink>. SYNC
- is offers the highest durability whereas NO_SYNC the lowest.</para>
- <para>Note: the combination of a master sync policy of SYNC and <link linkend="Java-Broker-High-Availability-DurabilityGuarantee_CoalescingSync">coalescing-sync</link>
- true would result in poor performance with no corresponding increase in durability guarantee. It cannot not be used.</para>
- <para>The acknowledgement policy defines whether when a master commits a transaction, it also awaits for the replica(s) to
- commit the same transaction before continuing. For the two-node case, ALL and SIMPLE_MAJORITY are equal.</para>
- <para>For acknowledgement policy, the available value are:
- <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/Durability.ReplicaAckPolicy.html#ALL">ALL</ulink>,
- <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/Durability.ReplicaAckPolicy.html#SIMPLE_MAJORITY">SIMPLE_MAJORITY</ulink>
- <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/Durability.ReplicaAckPolicy.html#NONE">NONE</ulink>.</para>
- </section>
- <section role="h4" id="Java-Broker-High-Availability-DurabilityGuarantee_CoalescingSync">
- <title>Coalescing-sync</title>
- <para>If enabled (the default) Qpid works to reduce the number of separate
- <ulink url="&oracleJdkDocUrl;java/io/FileDescriptor.html#sync()">file-system sync</ulink> operations
- performed by the <emphasis role="bold">master</emphasis> on the underlying storage device thus improving performance. It does
- this coalescing separate sync operations arising from the different client commits operations occuring at approximately the same time.
- It does this in such a manner not to reduce the ACID guarantees of the system.</para>
- <para>Coalescing-sync has no effect on the behaviour of the replicas.</para>
- </section>
- <section role="h4" id="Java-Broker-High-Availability-DurabilityGuarantee_Default">
- <title>Default</title>
- <para>The default durability guarantee is <constant>NO_SYNC, NO_SYNC, SIMPLE_MAJORITY</constant> with coalescing-sync enabled. The effect
- of this combination is described in the table below. It offers a good compromise between durability guarantee and performance
- with writes being guaranteed on the master and the additional guarantee that a majority of replicas have received the
- transaction.</para>
- </section>
- <section role="h4" id="Java-Broker-High-Availability-DurabilityGuarantee_Examples">
- <title>Examples</title>
- <para>Here are some examples illustrating the effects of the durability and coalescing-sync settings.</para>
- <para>
- <table>
- <title>Effect of different durability guarantees</title>
- <tgroup cols="4">
- <thead>
- <row>
- <entry/>
- <entry>Durability</entry>
- <entry>Coalescing-sync</entry>
- <entry>Description</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>1</entry>
- <entry>NO_SYNC, NO_SYNC, SIMPLE_MAJORITY</entry>
- <entry>true</entry>
- <entry>Before the commit returns to the client, the transaction will be written/sync'd to the Master's disk (effect of
- coalescing-sync) and a majority of the replica(s) will have acknowledged the <emphasis role="bold">receipt</emphasis>
- of the transaction. The replicas will write and sync the transaction to their disk at a point in the future governed by
- <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/rep/ReplicationMutableConfig.html#LOG_FLUSH_TASK_INTERVAL">ReplicationMutableConfig#LOG_FLUSH_INTERVAL</ulink>.
- </entry>
- </row>
- <row>
- <entry>2</entry>
- <entry>NO_SYNC, WRITE_NO_SYNC, SIMPLE_MAJORITY</entry>
- <entry>true</entry>
- <entry>Before the commit returns to the client, the transaction will be written/sync'd to the Master's disk (effect of
- coalescing-sync and a majority of the replica(s) will have acknowledged the <emphasis role="bold">write</emphasis> of
- the transaction to their disk. The replicas will sync the transaction to disk at a point in the future with an upper bound governed by
- ReplicationMutableConfig#LOG_FLUSH_INTERVAL.</entry>
- </row>
- <row>
- <entry>3</entry>
- <entry>NO_SYNC, NO_SYNC, NONE</entry>
- <entry>false</entry>
- <entry>After the commit returns to the client, the transaction is neither guaranteed to be written to the disk of the master
- nor received by any of the replicas. The master and replicas will write and sync the transaction to their disk at a point
- in the future with an upper bound governed by ReplicationMutableConfig#LOG_FLUSH_INTERVAL. This offers the weakest durability guarantee.</entry>
- </row>
- </tbody>
- </tgroup>
- </table>
- </para>
+ </orderedlist></para>
</section>
</section>
<section id="Java-Broker-High-Availability-ClientFailover">
- <title>Client failover configuration</title>
- <para>The details about format of Qpid connection URLs can be found at section
- <ulink url="&qpidjmsdocClientConectionUrl;">Connection URLs</ulink> within the client documentation.</para>
- <para>The failover policy option in the connection URL for the HA Cluster should be set to <emphasis>roundrobin</emphasis>.
- The Master broker should be put into a first place in <emphasis>brokerlist</emphasis> URL option.
- The recommended value for <emphasis>connectdelay</emphasis> option in broker URL should be set to
- the value greater than 1000 milliseconds. If it is desired that clients re-connect automatically after a
- master to replica failure, <varname>cyclecount</varname> should be tuned so that the retry period is longer than
- the expected length of time to perform the failover.</para>
- <example><title>Example of connection URL for the HA Cluster</title><para><![CDATA[
-amqp://guest:guest@clientid/test?brokerlist='tcp://localhost:5672?connectdelay='2000'&retries='3';tcp://localhost:5671?connectdelay='2000'&retries='3';tcp://localhost:5673?connectdelay='2000'&retries='3''&failover='roundrobin?cyclecount='30''
- ]]></para></example>
+ <title>Client failover</title>
+ <para>As mentioned above, the clients need to be able to find the location of the active
+ virtualhost within the group.</para>
+ <para>Clients can do this using a static technique, for example , utilising the <ulink
+ url="&qpidjmsdocClientConectionUrl;">failover feature of the Qpid connection url</ulink>
+ where the client has a list of all the nodes, and tries each node in sequence until it
+ discovers the node with the active virtualhost.</para>
+ <para>Another possibility is a dynamic technique utilising a proxy or Virtual IP (VIP). These
+ require other software and/or hardware and are outside the scope of this document.</para>
</section>
-
-
<section role="h3" id="Java-Broker-High-Availability-JMXAPI">
<title>Qpid JMX API for HA</title>
- <para>Qpid exposes the BDB HA store information via its JMX interface and provides APIs to remove a Node from
- the group, update a Node IP address, and assign a Node as the designated primary.</para>
- <para>An instance of the <classname>BDBHAMessageStore</classname> MBean is instantiated by the broker for the each virtualhost using the HA store.</para>
- <para>The reference to this MBean can be obtained via JMX API using an ObjectName like <emphasis>org.apache.qpid:type=BDBHAMessageStore,name=&quot;&lt;virtualhost name&gt;&quot;</emphasis>
- where &lt;virtualhost name&gt; is the name of a specific virtualhost on the broker.</para>
- <table border="1">
- <title>Mbean <classname>BDBHAMessageStore</classname> attributes</title>
- <thead>
- <tr>
- <td>Name</td>
- <td>Type</td>
- <td>Accessibility</td>
- <td>Description</td>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>GroupName</td>
- <td>String</td>
- <td>Read only</td>
- <td>Name identifying the group</td>
- </tr>
- <tr>
- <td>NodeName</td>
- <td>String</td>
- <td>Read only</td>
- <td>Unique name identifying the node within the group</td>
- </tr>
- <tr>
- <td>NodeHostPort</td>
- <td>String</td>
- <td>Read only</td>
- <td>Host/port used to replicate data between this node and others in the group</td>
- </tr>
- <tr>
- <td>HelperHostPort</td>
- <td>String</td>
- <td>Read only</td>
- <td>Host/port used to allow a new node to discover other group members</td>
- </tr>
- <tr>
- <td>NodeState</td>
- <td>String</td>
- <td>Read only</td>
- <td>Current state of the node</td>
- </tr>
- <tr>
- <td>ReplicationPolicy</td>
- <td>String</td>
- <td>Read only</td>
- <td>Node replication durability</td>
- </tr>
- <tr id="JMXDesignatedPrimary">
- <td>DesignatedPrimary</td>
- <td>boolean</td>
- <td>Read/Write</td>
- <td>Designated primary flag. Applicable to the two node case.</td>
- </tr>
- <tr>
- <td>CoalescingSync</td>
- <td>boolean</td>
- <td>Read only</td>
- <td>Coalescing sync flag. Applicable to the master sync policies NO_SYNC and WRITE_NO_SYNC only.</td>
- </tr>
- <tr>
- <td>getAllNodesInGroup</td>
- <td>TabularData</td>
- <td>Read only</td>
- <td>Get all nodes within the group, regardless of whether currently attached or not</td>
- </tr>
- </tbody>
- </table>
-
- <table border="1">
- <title>Mbean <classname>BDBHAMessageStore</classname> operations</title>
- <thead>
- <tr>
- <td>Operation</td>
- <td>Parameters</td>
- <td>Returns</td>
- <td>Description</td>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>removeNodeFromGroup</td>
- <td>
- <para><emphasis>nodeName</emphasis>, name of node, string</para>
- </td>
- <td>void</td>
- <td>Remove an existing node from the group</td>
- </tr>
- <tr>
- <td>updateAddress</td>
- <td>
- <itemizedlist>
- <listitem>
- <para><emphasis>nodeName</emphasis>, name of node, string</para>
- </listitem>
- <listitem>
- <para><emphasis>newHostName</emphasis>, new host name, string</para>
- </listitem>
- <listitem>
- <para><emphasis>newPort</emphasis>, new port number, int</para>
- </listitem>
- </itemizedlist>
- </td>
- <td>void</td>
- <td>Update the address of another node. The node must be in a STOPPED state.</td>
- </tr>
- </tbody>
- </table>
- <figure>
- <title>BDBHAMessageStore view from jconsole.</title>
- <graphic fileref="images/HA-BDBHAMessageStore-MBean-jconsole.png"/>
- </figure>
- <example>
- <title>Example of java code to get the node state value</title>
- <programlisting language="java"><![CDATA[
-Map<String, Object> environment = new HashMap<String, Object>();
-
-// credentials: user name and password
-environment.put(JMXConnector.CREDENTIALS, new String[] {"admin","admin"});
-JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://localhost:9001/jmxrmi");
-JMXConnector jmxConnector = JMXConnectorFactory.connect(url, environment);
-MBeanServerConnection mbsc = jmxConnector.getMBeanServerConnection();
-
-ObjectName queueObjectName = new ObjectName("org.apache.qpid:type=BDBHAMessageStore,name=\"test\"");
-String state = (String)mbsc.getAttribute(queueObjectName, "NodeState");
-
-System.out.println("Node state:" + state);
- ]]></programlisting>
- <para>Example system output:</para>
- <screen><![CDATA[Node state:MASTER]]></screen>
- </example>
- </section>
-
- <section id="Java-Broker-High-Availability-Monitoring-cluster">
- <title>Monitoring cluster</title>
- <para>In order to discover potential issues with HA Cluster early, all nodes in the Cluster should be monitored on regular basis
- using the following techniques:</para>
- <itemizedlist>
- <listitem>
- <para>Broker log files scrapping for WARN or ERROR entries and operational log entries like:</para>
- <itemizedlist>
- <listitem>
- <para><emphasis>MST-1007 :</emphasis> Store Passivated. It can indicate that Master virtual host has gone down.</para>
- </listitem>
- <listitem>
- <para><emphasis>MST-1006 :</emphasis> Recovery Complete. It can indicate that a former Replica virtual host is up and became the Master.</para>
- </listitem>
- </itemizedlist>
- </listitem>
- <listitem>
- <para>Disk space usage and system load using system tools.</para>
- </listitem>
- <listitem>
- <para>Berkeley HA node status using <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/rep/util/DbPing.html"><classname>DbPing</classname></ulink> utility.</para>
- <example><title>Using <classname>DbPing</classname> utility for monitoring HA nodes.</title><command>
-java -jar je-&oracleBdbProductVersion;.jar DbPing -groupName TestClusterGroup -nodeName Node-5001 -nodeHost localhost:5001 -socketTimeout 10000
-</command><screen>
-Current state of node: Node-5001 from group: TestClusterGroup
- Current state: MASTER
- Current master: Node-5001
- Current JE version: &oracleBdbProductVersion;
- Current log version: 8
- Current transaction end (abort or commit) VLSN: 165
- Current master transaction end (abort or commit) VLSN: 0
- Current active feeders on node: 0
- Current system load average: 0.35
-</screen></example>
- <para>In the example above <classname>DbPing</classname> utility requested status of Cluster node with name
- <emphasis>Node-5001</emphasis> from replication group <emphasis>TestClusterGroup</emphasis> running on host <emphasis>localhost:5001</emphasis>.
- The state of the node was reported into a system output.
- </para>
- </listitem>
- <listitem>
- <para>Using Qpid broker JMX interfaces.</para>
- <para>Mbean <classname>BDBHAMessageStore</classname> can be used to request the following node information:</para>
- <itemizedlist>
- <listitem>
- <para><emphasis>NodeState</emphasis> indicates whether node is a Master or Replica.</para>
- </listitem>
- <listitem>
- <para><emphasis>Durability</emphasis> replication durability.</para>
- </listitem>
- <listitem>
- <para><emphasis>DesignatedPrimary</emphasis> indicates whether Master node is designated primary.</para>
- </listitem>
- <listitem>
- <para><emphasis>GroupName</emphasis> replication group name.</para>
- </listitem>
- <listitem>
- <para><emphasis>NodeName</emphasis> node name.</para>
- </listitem>
- <listitem>
- <para><emphasis>NodeHostPort</emphasis> node host and port.</para>
- </listitem>
- <listitem>
- <para><emphasis>HelperHostPort</emphasis> helper host and port.</para>
- </listitem>
- <listitem>
- <para><emphasis>AllNodesInGroup</emphasis> lists of all nodes in the replication group including their names, hosts and ports.</para>
- </listitem>
- </itemizedlist>
- <para>For more details about <classname>BDBHAMessageStore</classname> MBean please refer section <link linkend="Java-Broker-High-Availability-JMXAPI">Qpid JMX API for HA</link></para>
- </listitem>
- </itemizedlist>
+ <para>The Qpid JMX API for HA is now deprecated. New users are recommended to use the REST
+ API.</para>
</section>
<section id="Java-Broker-High-Availability-DiskSpace">
<title>Disk space requirements</title>
- <para>Disk space is a critical resource for the HA Qpid broker.</para>
- <para>In case when a Replica goes down (or falls behind the Master in 2 node cluster where the Master is designated primary)
- and the Master continues running, the non-replicated store files are kept on the Masters disk for the period of time
- as specified in <emphasis>je.rep.repStreamTimeout</emphasis> JE setting in order to replicate this data later
- when the Replica is back. This setting is set to 1 hour by default by the broker.</para>
- <para>Depending from the application publishing/consuming rates and message sizes,
- the disk space might become overfull during this period of time due to preserved logs.
- Please, make sure to allocate enough space on your disk to avoid this from happening.
- </para>
+ <para>In the case where node in a group are down, the master must keep the data they are missing
+ for them to allow them to return to the replica role quickly.</para>
+ <para>By default, the master will retain up to 1hour of missed transactions. In a busy
+ production system, the disk space occupied could be considerable.</para>
+ <para>This setting is controlled by virtualhost context variable
+ <literal>je.rep.repStreamTimeout</literal>.</para>
</section>
<section id="Java-Broker-High-Availability-Network-Requirements">
<title>Network Requirements</title>
- <para>The HA Cluster performance depends on the network bandwidth, its use by existing traffic, and quality of service.</para>
- <para>In order to achieve the best performance it is recommended to use a separate network infrastructure for the Qpid HA Nodes
- which might include installation of dedicated network hardware on Broker hosts, assigning a higher priority to replication ports,
- installing a cluster in a separate network not impacted by any other traffic.</para>
+ <para>The HA Cluster performance depends on the network bandwidth, its use by existing traffic,
+ and quality of service.</para>
+ <para>In order to achieve the best performance it is recommended to use a separate network
+ infrastructure for the Qpid HA Nodes which might include installation of dedicated network
+ hardware on Broker hosts, assigning a higher priority to replication ports, installing a group
+ in a separate network not impacted by any other traffic.</para>
</section>
<section id="Java-Broker-High-Availability-Security">
<title>Security</title>
- <para>At the moment Berkeley replication API supports only TCP/IP protocol to transfer replication data between Master and Replicas.</para>
- <para>As result, the replicated data is unprotected and can be intercepted by anyone having access to the replication network.</para>
- <para>Also, anyone who can access to this network can introduce a new node and therefore receive a copy of the data.</para>
- <para>In order to reduce the security risks the entire HA cluster is recommended to run in a separate network protected from general access.</para>
+ <para>The replication stream between the master and the replicas is insecure and can be
+ intercepted by anyone having access to the replication network.</para>
+ <para>In order to reduce the security risks the entire HA group is recommended to run in a
+ separate network protected from general access and/or utilise SSH-tunnels/IPsec.</para>
</section>
<section id="Java-Broker-High-Availability-Backup">
<title>Backups</title>
- <para>In order to protect the entire cluster from some cataclysms which might destroy all cluster nodes,
- backups of the Master store should be taken on a regular basis.</para>
- <para>Qpid Broker distribution includes the "hot" backup utility <emphasis>backup.sh</emphasis> which can be found at broker bin folder.
- This utility can perform the backup when broker is running.</para>
- <para><emphasis>backup.sh</emphasis> script invokes <classname>org.apache.qpid.server.store.berkeleydb.BDBBackup</classname> to do the job.</para>
- <para>You can also run this class from command line like in an example below:</para>
- <example><title>Performing store backup by using <classname>BDBBackup</classname> class directly</title><command>
- java -cp qpid-bdbstore-&qpidCurrentRelease;.jar org.apache.qpid.server.store.berkeleydb.BDBBackup -fromdir path/to/store/folder -todir path/to/backup/folder</command>
- </example>
- <para>In the example above BDBBackup utility is called from qpid-bdbstore-&qpidCurrentRelease;.jar to backup the store at <emphasis>path/to/store/folder</emphasis> and copy store logs into <emphasis>path/to/backup/folder</emphasis>.</para>
- <para>Linux and Unix users can take advantage of <emphasis>backup.sh</emphasis> bash script by running this script in a similar way.</para>
- <example><title>Performing store backup by using <classname>backup.sh</classname> bash script</title>
- <command>backup.sh -fromdir path/to/store/folder -todir path/to/backup/folder</command>
- </example>
- <note>
- <para>Do not forget to ensure that the Master store is being backed up, in the event the Node elected Master changes during
- the lifecycle of the cluster.</para>
- </note>
+ <para>It is recommend to use the hot backup script to periodically backup every node in the
+ group. <xref linkend="Java-Broker-Backup-And-Recovery-Virtualhost-Node-BDB-HA"/>.</para>
</section>
- <section id="Java-Broker-High-Availability-MigrationFromNonHA">
- <title>Migration of a non-HA store to HA</title>
- <para>Non HA stores starting from schema version 4 (0.14 Qpid release) can be automatically converted into HA store on broker startup if replication is first enabled with the <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/rep/util/DbEnableReplication.html"><classname>DbEnableReplication</classname></ulink> utility from the BDB JE jar.</para>
- <para>DbEnableReplication converts a non HA store into an HA store and can be used as follows:</para>
- <example><title>Enabling replication</title><command>
-java -jar je-&oracleBdbProductVersion;.jar DbEnableReplication -h /path/to/store -groupName MyReplicationGroup -nodeName MyNode1 -nodeHostPort localhost:5001
- </command></example>
- <para>In the examples above, je jar of version &oracleBdbProductVersion; is used to convert store at <emphasis>/path/to/store</emphasis> into HA store having replication group name <emphasis>MyReplicationGroup</emphasis>, node name <emphasis>MyNode1</emphasis> and running on host <emphasis>localhost</emphasis> and port <emphasis>5001</emphasis>.</para>
- <para>After running DbEnableReplication and updating the virtual host store to configuration to be an HA message store, like in example below,
- on broker start up the store schema will be upgraded to the most recent version and the broker can be used as normal.</para>
+
+ <section id="Java-Broker-High-Availability-Reset-Group-Infomational">
+ <title>Reset Group Information</title>
+ <para>BDB JE internally stores details of the group within its database. There are some
+ circumstances when resetting this information is useful.<itemizedlist>
+ <listitem>
+ <para>Copying data between environments (e.g. production to UAT)</para>
+ </listitem>
+ <listitem>
+ <para>Some disaster recovery situations where a group must be recreated on new
+ hardware</para>
+ </listitem>
+ </itemizedlist></para>
+ <para>This is not an normal operation and is not usually required</para>
+ <para>The following command replaces the group table contained within the JE logs files with the
+ provided information. </para>
<example>
- <title>Example of XML configuration for HA message store</title>
- <programlisting language="xml"><![CDATA[
-<store>
- <class>org.apache.qpid.server.store.berkeleydb.BDBHAMessageStore</class>
- <environment-path>/path/to/store</environment-path>
- <highAvailability>
- <groupName>MyReplicationGroup</groupName>
- <nodeName>MyNode1</nodeName>
- <nodeHostPort>localhost:5001</nodeHostPort>
- <helperHostPort>localhost:5001</helperHostPort>
- </highAvailability>
-</store>]]></programlisting>
+ <title>Resetting of replication group with <classname>DbResetRepGroup</classname></title>
+ <command>java -cp je-&oracleBdbProductVersion;.jar com.sleepycat.je.rep.util.DbResetRepGroup
+ -h path/to/jelogfiles -groupName newgroupname -nodeName nodename -nodeHostPort
+ thor:5000</command>
</example>
- <para>The Replica nodes can be started with empty stores. The data will be automatically copied from Master to Replica on Replica start-up.
- This will take a period of time determined by the size of the Masters store and the network bandwidth between the nodes.</para>
- <note>
- <para>Due to existing caveats in Berkeley JE with copying of data from Master into Replica it is recommended to restart the Master node after store schema upgrade is finished before starting the Replica nodes.</para>
- </note>
- </section>
-
- <section id="Java-Broker-High-Availability-DisasterRecovery">
- <title>Disaster Recovery</title>
- <para>This section describes the steps required to restore HA broker cluster from backup.</para>
- <para>The detailed instructions how to perform backup on replicated environment can be found <link linkend="Java-Broker-High-Availability-Backup">here</link>.</para>
- <para>At this point we assume that backups are collected on regular basis from Master node.</para>
- <para>Replication configuration of a cluster is stored internally in HA message store.
- This information includes IP addresses of the nodes.
- In case when HA message store needs to be restored on a different host with a different IP address
- the cluster replication configuration should be reseted in this case</para>
- <para>Oracle provides a command line utility <ulink url="&oracleBdbJavaDocUrl;com/sleepycat/je/rep/util/DbResetRepGroup.html"><classname>DbResetRepGroup</classname></ulink>
- to reset the members of a replication group and replace the group with a new group consisting of a single new member
- as described by the arguments supplied to the utility</para>
- <para>Cluster can be restored with the following steps:</para>
- <itemizedlist>
- <listitem><para>Copy log files into the store folder from backup</para></listitem>
- <listitem>
- <para>Use <classname>DbResetRepGroup</classname> to reset an existing environment. See an example below</para>
- <example>
- <title>Reseting of replication group with <classname>DbResetRepGroup</classname></title><command>
-java -cp je-&oracleBdbProductVersion;.jar com.sleepycat.je.rep.util.DbResetRepGroup -h ha-work/Node-5001/bdbstore -groupName TestClusterGroup -nodeName Node-5001 -nodeHostPort localhost:5001</command>
- </example>
- <para>In the example above <classname>DbResetRepGroup</classname> utility from Berkeley JE of version &oracleBdbProductVersion; is used to reset the store
- at location <emphasis>ha-work/Node-5001/bdbstore</emphasis> and set a replication group to <emphasis>TestClusterGroup</emphasis>
- having a node <emphasis>Node-5001</emphasis> which runs at <emphasis>localhost:5001</emphasis>.</para>
- </listitem>
- <listitem><para>Start a broker with HA store configured as specified on running of <classname>DbResetRepGroup</classname> utility.</para></listitem>
- <listitem><para>Start replica nodes having the same replication group and a helper host port pointing to a new master. The store content will be copied into Replicas from Master on their start up.</para></listitem>
- </itemizedlist>
- </section>
-
- <section id="Java-Broker-High-Availability-Performance">
- <title>Performance</title>
- <para>The aim of this section is not to provide exact performance metrics relating to HA, as this depends heavily on the test
- environment, but rather showing an impact of HA on Qpid broker performance in comparison with the Non HA case.</para>
- <para>For testing of impact of HA on a broker performance a special test script was written using Qpid performance test framework.
- The script opened a number of connections to the Qpid broker, created producers and consumers on separate connections,
- and published test messages with concurrent producers into a test queue and consumed them with concurrent consumers.
- The table below shows the number of producers/consumers used in the tests.
- The overall throughput was collected for each configuration.
- </para>
- <table border="1">
- <title>Number of producers/consumers in performance tests</title>
- <thead>
- <tr>
- <th>Test</th>
- <th>Number of producers</th>
- <th>Number of consumers</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>1</td>
- <td>1</td>
- <td>1</td>
- </tr>
- <tr>
- <td>2</td>
- <td>2</td>
- <td>2</td>
- </tr>
- <tr>
- <td>3</td>
- <td>4</td>
- <td>4</td>
- </tr>
- <tr>
- <td>4</td>
- <td>8</td>
- <td>8</td>
- </tr>
- <tr>
- <td>5</td>
- <td>16</td>
- <td>16</td>
- </tr>
- <tr>
- <td>6</td>
- <td>32</td>
- <td>32</td>
- </tr>
- <tr>
- <td>7</td>
- <td>64</td>
- <td>64</td>
- </tr>
- </tbody>
- </table>
- <para>The test was run against the following Qpid Broker configurations</para>
- <itemizedlist>
- <listitem>
- <para>Non HA Broker</para>
- </listitem>
- <listitem>
- <para>HA 2 Nodes Cluster with durability <emphasis>SYNC,SYNC,ALL</emphasis></para>
- </listitem>
- <listitem>
- <para>HA 2 Nodes Cluster with durability <emphasis>WRITE_NO_SYNC,WRITE_NO_SYNC,ALL</emphasis></para>
- </listitem>
- <listitem>
- <para>HA 2 Nodes Cluster with durability <emphasis>WRITE_NO_SYNC,WRITE_NO_SYNC,ALL</emphasis> and <emphasis>coalescing-sync</emphasis> Qpid mode</para>
- </listitem>
- <listitem>
- <para>HA 2 Nodes Cluster with durability <emphasis>WRITE_NO_SYNC,NO_SYNC,ALL</emphasis> and <emphasis>coalescing-sync</emphasis> Qpid mode</para>
- </listitem>
- <listitem>
- <para>HA 2 Nodes Cluster with durability <emphasis>NO_SYNC,NO_SYNC,ALL</emphasis> and <emphasis>coalescing-sync</emphasis> Qpid option</para>
- </listitem>
- </itemizedlist>
- <para>The evironment used in testing consisted of 2 servers with 4 CPU cores (2x Intel(r) Xeon(R) CPU 5150@2.66GHz), 4GB of RAM
- and running under OS Red Hat Enterprise Linux AS release 4 (Nahant Update 4). Network bandwidth was 1Gbit.
- </para>
- <para>We ran Master node on the first server and Replica and clients(both consumers and producers) on the second server.</para>
- <para>In non-HA case Qpid Broker was run on a first server and clients were run on a second server.</para>
- <para>The table below contains the test results we measured on this environment for different Broker configurations.</para>
- <para>Each result is represented by throughput value in KB/second and difference in % between HA configuration and non HA case for the same number of clients.</para>
- <table border="1">
- <title>Performance Comparison</title>
- <thead>
- <tr>
- <td>Test/Broker</td>
- <td>No HA</td>
- <td>SYNC, SYNC, ALL</td>
- <td>WRITE_NO_SYNC, WRITE_NO_SYNC, ALL</td>
- <td>WRITE_NO_SYNC, WRITE_NO_SYNC, ALL - coalescing-sync</td>
- <td>WRITE_NO_SYNC, NO_SYNC,ALL - coalescing-sync</td>
- <td>NO_SYNC, NO_SYNC, ALL - coalescing-sync</td>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>1 (1/1)</td>
- <td>0.0%</td>
- <td>-61.4%</td>
- <td>117.0%</td>
- <td>-16.02%</td>
- <td>-9.58%</td>
- <td>-25.47%</td>
- </tr>
- <tr>
- <td>2 (2/2)</td>
- <td>0.0%</td>
- <td>-75.43%</td>
- <td>67.87%</td>
- <td>-66.6%</td>
- <td>-69.02%</td>
- <td>-30.43%</td>
- </tr>
- <tr>
- <td>3 (4/4)</td>
- <td>0.0%</td>
- <td>-84.89%</td>
- <td>24.19%</td>
- <td>-71.02%</td>
- <td>-69.37%</td>
- <td>-43.67%</td>
- </tr>
- <tr>
- <td>4 (8/8)</td>
- <td>0.0%</td>
- <td>-91.17%</td>
- <td>-22.97%</td>
- <td>-82.32%</td>
- <td>-83.42%</td>
- <td>-55.5%</td>
- </tr>
- <tr>
- <td>5 (16/16)</td>
- <td>0.0%</td>
- <td>-91.16%</td>
- <td>-21.42%</td>
- <td>-86.6%</td>
- <td>-86.37%</td>
- <td>-46.99%</td>
- </tr>
- <tr>
- <td>6 (32/32)</td>
- <td>0.0%</td>
- <td>-94.83%</td>
- <td>-51.51%</td>
- <td>-92.15%</td>
- <td>-92.02%</td>
- <td>-57.59%</td>
- </tr>
- <tr>
- <td>7 (64/64)</td>
- <td>0.0%</td>
- <td>-94.2%</td>
- <td>-41.84%</td>
- <td>-89.55%</td>
- <td>-89.55%</td>
- <td>-50.54%</td>
- </tr>
- </tbody>
- </table>
- <para>The figure below depicts the graphs for the performance test results</para>
- <figure>
- <title>Test results</title>
- <graphic fileref="images/HA-perftests-results.png"/>
- </figure>
- <para>On using durability <emphasis>SYNC,SYNC,ALL</emphasis> (without coalescing-sync) the performance drops significantly (by 62-95%) in comparison with non HA broker.</para>
- <para>Whilst, on using durability <emphasis>WRITE_NO_SYNC,WRITE_NO_SYNC,ALL</emphasis> (without coalescing-sync) the performance drops by only half, but with loss of durability guarantee, so is not recommended.</para>
- <para>In order to have better performance with HA, Qpid Broker comes up with the special mode called <link linkend="Java-Broker-High-Availability-DurabilityGuarantee_CoalescingSync">coalescing-sync</link>,
- With this mode enabled, Qpid broker batches the concurrent transaction commits and syncs transaction data into Master disk in one go.
- As result, the HA performance only drops by 25-60% for durability <emphasis>NO_SYNC,NO_SYNC,ALL</emphasis> and by 10-90% for <emphasis>WRITE_NO_SYNC,WRITE_NO_SYNC,ALL</emphasis>.</para>
+ <para>The modified log files can then by copied into
+ <literal>${QPID_WORK}/&lt;nodename&gt;/config</literal> directory of a target Broker. Then
+ start the Broker, and add a BDB HA Virtualhost node specify the same group name, node name
+ and node address. You will then have a group with a single node, ready to start re-adding
+ additional nodes as described above. </para>
</section>
</chapter>