diff options
| author | Alan Conway <aconway@apache.org> | 2013-08-30 14:11:53 +0000 |
|---|---|---|
| committer | Alan Conway <aconway@apache.org> | 2013-08-30 14:11:53 +0000 |
| commit | 27d31ba355acfef3ec66c23e48864e88a358014b (patch) | |
| tree | 460cdf169eeacc80c71cb604169afb742703e08c /qpid/cpp | |
| parent | 72f706cac68f5e7146ca5facb0e72e9d35b4332b (diff) | |
| download | qpid-python-27d31ba355acfef3ec66c23e48864e88a358014b.tar.gz | |
NO-JIRA: Remove obsolete (never implemented) cluster design docs.
git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1518975 13f79535-47bb-0310-9956-ffa450edef68
Diffstat (limited to 'qpid/cpp')
| -rw-r--r-- | qpid/cpp/design_docs/new-cluster-design.txt | 298 | ||||
| -rw-r--r-- | qpid/cpp/design_docs/new-cluster-plan.txt | 340 |
2 files changed, 0 insertions, 638 deletions
diff --git a/qpid/cpp/design_docs/new-cluster-design.txt b/qpid/cpp/design_docs/new-cluster-design.txt deleted file mode 100644 index d29ecce445..0000000000 --- a/qpid/cpp/design_docs/new-cluster-design.txt +++ /dev/null @@ -1,298 +0,0 @@ --*-org-*- -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -* A new design for Qpid clustering. - -** Issues with old cluster design - -See [[./old-cluster-issues.txt]] - -** A new cluster design. - -1. Clearly defined interface between broker code and cluster plug-in. - -2. Replicate queue events rather than client data. - - Only requires consistent enqueue order. - - Events only need be serialized per-queue, allows concurrency between queues - - Allows for replicated and non-replicated queues. - -3. Use a lock protocol to agree order of dequeues: only the broker - holding the lock can acqiure & dequeue. No longer relies on - identical state and lock-step behavior to cause identical dequeues - on each broker. - -4. Use multiple CPG groups to process different queues in - parallel. Use a fixed set of groups and hash queue names to choose - the group for each queue. - -*** Requirements - -The cluster must provide these delivery guarantees: - -- client sends transfer: message must be replicated and not lost even if the local broker crashes. -- client acquires a message: message must not be delivered on another broker while acquired. -- client accepts message: message is forgotten, will never be delivered or re-queued by any broker. -- client releases message: message must be re-queued on cluster and not lost. -- client rejects message: message must be dead-lettered or discarded and forgotten. -- client disconnects/broker crashes: acquired but not accepted messages must be re-queued on cluster. - -Each guarantee takes effect when the client receives a *completion* -for the associated command (transfer, acquire, reject, accept) - -*** Broker receiving messages - -On recieving a message transfer, in the connection thread we: -- multicast a message-received event. -- enqueue and complete the transfer when it is self-delivered. - -Other brokers enqueue the message when they recieve the message-received event. - -Enqueues are queued up with other queue operations to be executed in the -thread context associated with the queue. - -*** Broker sending messages: moving queue ownership - -Each queue is *owned* by at most one cluster broker at a time. Only -that broker may acquire or dequeue messages. The owner multicasts -notification of messages it acquires/dequeues to the cluster. -Periodically the owner hands over ownership to another interested -broker, providing time-shared access to the queue among all interested -brokers. - -We assume the same IO-driven dequeuing algorithm as the standalone -broker with one modification: queues can be "locked". A locked queue -is not available for dequeuing messages and will be skipped by the -output algorithm. - -At any given time only those queues owned by the local broker will be -unlocked. - -As messages are acquired/dequeued from unlocked queues by the IO threads -the broker multicasts acquire/dequeue events to the cluster. - -When an unlocked queue has no more consumers with credit, or when a -time limit expires, the broker relinquishes ownership by multicasting -a release-queue event, allowing another interested broker to take -ownership. - -*** Asynchronous completion of accept - -In acknowledged mode a message is not forgotten until it is accepted, -to allow for requeue on rejection or crash. The accept should not be -completed till the message has been forgotten. - -On receiving an accept the broker: -- dequeues the message from the local queue -- multicasts an "accept" event -- completes the accept asynchronously when the dequeue event is self delivered. - -NOTE: The message store does not currently implement asynchronous -completions of accept, this is a bug. - -*** Multiple CPG groups. - -The old cluster was bottlenecked by processing everything in a single -CPG deliver thread. - -The new cluster uses a set of CPG groups, one per core. Queue names -are hashed to give group indexes, so statistically queues are likely -to be spread over the set of groups. - -Operations on a given queue always use the same group, so we have -order within each queue, but operations on different queues can use -different groups giving greater throughput sending to CPG and multiple -handler threads to process CPG messages. - -** Inconsistent errors. - -An inconsistent error means that after multicasting an enqueue, accept -or dequeue, some brokers succeed in processing it and others fail. - -The new design eliminates most sources of inconsistent errors in the -old broker: connections, sessions, security, management etc. Only -store journal errors remain. - -The new inconsistent error protocol is similar to the old one with one -major improvement: brokers do not have to stall processing while an -error is being resolved. - -** Updating new members - -When a new member (the updatee) joins a cluster it needs to be brought -up to date with the rest of the cluster. An existing member (the -updater) sends an "update". - -In the old cluster design the update is a snapshot of the entire -broker state. To ensure consistency of the snapshot both the updatee -and the updater "stall" at the start of the update, i.e. they stop -processing multicast events and queue them up for processing when the -update is complete. This creates a back-log of work to get through, -which leaves them lagging behind the rest of the cluster till they -catch up (which is not guaranteed to happen in a bounded time.) - -With the new cluster design only exchanges, queues, bindings and -messages need to be replicated. - -We update individual objects (queues and exchanges) independently. -- create queues first, then update all queues and exchanges in parallel. -- multiple updater threads, per queue/exchange. - -Queue updater: -- marks the queue position at the sync point -- sends messages starting from the sync point working towards the head of the queue. -- send "done" message. - -Queue updatee: -- enqueues received from CPG: add to back of queue as normal. -- dequeues received from CPG: apply if found, else save to check at end of update. -- messages from updater: add to the *front* of the queue. -- update complete: apply any saved dequeues. - -Exchange updater: -- updater: send snapshot of exchange as it was at the sync point. - -Exchange updatee: -- queue exchange operations after the sync point. -- when snapshot is received: apply saved operations. - -Note: -- Updater is active throughout, no stalling. -- Consuming clients actually reduce the size of the update. -- Updatee stalls clients until the update completes. - (Note: May be possible to avoid updatee stall as well, needs thought) - -** Internal cluster interface - -The new cluster interface is similar to the MessageStore interface, but -provides more detail (message positions) and some additional call -points (e.g. acquire) - -The cluster interface captures these events: -- wiring changes: queue/exchange declare/bind -- message enqueued/acquired/released/rejected/dequeued. -- transactional events. - -** Maintainability - -This design gives us more robust code with a clear and explicit interfaces. - -The cluster depends on specific events clearly defined by an explicit -interface. Provided the semantics of this interface are not violated, -the cluster will not be broken by changes to broker code. - -The cluster no longer requires identical processing of the entire -broker stack on each broker. It is not affected by the details of how -the broker allocates messages. It is independent of the -protocol-specific state of connections and sessions and so is -protected from future protocol changes (e.g. AMQP 1.0) - -A number of specific ways the code will be simplified: -- drop code to replicate management model. -- drop timer workarounds for TTL, management, heartbeats. -- drop "cluster-safe assertions" in broker code. -- drop connections, sessions, management from cluster update. -- drop security workarounds: cluster code now operates after message decoding. -- drop connection tracking in cluster code. -- simper inconsistent-error handling code, no need to stall. - -** Performance - -The standalone broker processes _connections_ concurrently, so CPU -usage increases as you add more connections. - -The new cluster processes _queues_ concurrently, so CPU usage increases as you -add more queues. - -In both cases, CPU usage peaks when the number of "units of - concurrency" (connections or queues) goes above the number of cores. - -When all consumers on a queue are connected to the same broker the new -cluster uses the same messagea allocation threading/logic as a -standalone broker, with a little extra asynchronous book-keeping. - -If a queue has multiple consumers connected to multiple brokers, the -new cluster time-shares the queue which is less efficient than having -all consumers on a queue connected to the same broker. - -** Flow control -New design does not queue up CPG delivered messages, they are -processed immediately in the CPG deliver thread. This means that CPG's -flow control is sufficient for qpid. - -** Live upgrades - -Live upgrades refers to the ability to upgrade a cluster while it is -running, with no downtime. Each brokers in the cluster is shut down, -and then re-started with a new version of the broker code. - -To achieve this -- Cluster protocl XML file has a new element <version number=N> attached - to each method. This is the version at which the method was added. -- New versions can only add methods, existing methods cannot be changed. -- The cluster handshake for new members includes the protocol version - at each member. -- Each cpg message starts with header: version,size. Allows new encodings later. -- Brokers ignore messages of higher version. - -- The cluster's version is the lowest version among its members. -- A newer broker can join and older cluster. When it does, it must restrict - itself to speaking the older version protocol. -- When the cluster version increases (because the lowest version member has left) - the remaining members may move up to the new version. - - -* Design debates -** Active/active vs. active passive - -An active-active cluster can be used in an active-passive mode. In -this mode we would like the cluster to be as efficient as a strictly -active-passive implementation. - -An active/passive implementation allows some simplifications over active/active: -- drop Queue ownership and locking -- don't need to replicate message acquisition. -- can do immediate local enqueue and still guarantee order. - -Active/passive introduces a few extra requirements: -- Exactly one broker has to take over if primary fails. -- Passive members must refuse client connections. -- On failover, clients must re-try all known addresses till they find the active member. - -Active/active benefits: -- A broker failure only affects the subset of clients connected to that broker. -- Clients can switch to any other broker on failover -- Backup brokers are immediately available on failover. -- As long as a client can connect to any broker in the cluster, it can be served. - -Active/passive benefits: -- Don't need to replicate message allocation, can feed consumers at top speed. - -Active/passive drawbacks: -- All clients on one node so a failure affects every client in the - system. -- After a failure there is a "reconnect storm" as every client - reconnects to the new active node. -- After a failure there is a period where no broker is active, until - the other brokers realize the primary is gone and agree on the new - primary. -- Clients must find the single active node, may involve multiple - connect attempts. -- No service if a partition separates a client from the active broker, - even if the client can see other brokers. - - diff --git a/qpid/cpp/design_docs/new-cluster-plan.txt b/qpid/cpp/design_docs/new-cluster-plan.txt deleted file mode 100644 index 042e03f177..0000000000 --- a/qpid/cpp/design_docs/new-cluster-plan.txt +++ /dev/null @@ -1,340 +0,0 @@ --*-org-*- - -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -* Status of impementation - -Meaning of priorities: -[#A] Essential for basic functioning. -[#B] Required for first release. -[#C] Can be addressed in a later release. - -The existig prototype is bare bones to do performance benchmarks: -- Implements publish and consumer locking protocol. -- Defered delivery and asynchronous completion of message. -- Optimize the case all consumers are on the same node. -- No new member updates, no failover updates, no transactions, no persistence etc. - -Prototype code is on branch qpid-2920-active, in cpp/src/qpid/cluster/exp/ - -** Similarities to existing cluster. - -/Active-active/: the new cluster can be a drop-in replacement for the -old, existing tests & customer deployment configurations are still -valid. - -/Virtual synchrony/: Uses corosync to co-ordinate activity of members. - -/XML controls/: Uses XML to define the primitives multicast to the -cluster. - -** Differences with existing cluster. - -/Report rather than predict consumption/: brokers explicitly tell each -other which messages have been acquired or dequeued. This removes the -major cause of bugs in the existing cluster. - -/Queue consumer locking/: to avoid duplicates only one broker can acquire or -dequeue messages at a time - while has the consume-lock on the -queue. If multiple brokers are consuming from the same queue the lock -is passed around to time-share access to the queue. - -/Per-queue concurrency/: uses a fixed-size set of CPG groups (reflecting -the concurrency of the host) to allow concurrent processing on -different queues. Queues are hashed onto the groups. - -* Completed tasks -** DONE [#A] Minimal POC: publish/acquire/dequeue protocol. - CLOSED: [2011-10-05 Wed 16:03] - -Defines broker::Cluster interface and call points. -Initial interface commite - -Main classes -Core: central object holding cluster classes together (replaces cluster::Cluster) -BrokerContext: implements broker::Cluster interface. -QueueContext: Attached to a broker::Queue, holds cluster status. -MessageHolder:holds local messages while they are being enqueued. - -Implements multiple CPG groups for better concurrency. - -** DONE [#A] Large message replication. - CLOSED: [2011-10-05 Wed 17:22] -Multicast using fixed-size (64k) buffers, allow fragmetation of messages across buffers (frame by frame) - -* Design Questions -** [[Queue sequence numbers vs. independant message IDs]] - -Current prototype uses queue+sequence number to identify message. This -is tricky for updating new members as the sequence numbers are only -known on delivery. - -Independent message IDs that can be generated and sent as part of the -message simplify this and potentially allow performance benefits by -relaxing total ordering. However they require additional map lookups -that hurt performance. - -- [X] Prototype independent message IDs, check performance. -Throughput worse by 30% in contented case, 10% in uncontended. - -* Tasks to match existing cluster -** TODO [#A] Review old cluster code for more tasks. 1 -** TODO [#A] Put cluster enqueue after all policy & other checks. - -gsim: we do policy check after multicasting enqueue so -could have inconsistent outcome. - -aconway: Multicast should be after enqueue and any other code that may -decide to send/not send the message. - -gsime: while later is better, is moving it that late the right thing? -That will mean for example that any dequeues triggered by the enqueue -(e.g. ring queue or lvq) will happen before the enqueue is broadcast. - -** TODO [#A] Defer and async completion of wiring commands. 5 -Testing requirement: Many tests assume wiring changes are visible -across the cluster once the wiring commad completes. - -Name clashes: avoid race if same name queue/exchange declared on 2 -brokers simultaneously. - -Ken async accept, never merged: http://svn.apache.org/viewvc/qpid/branches/qpid-3079/ - -Clashes with non-replicated: see [[Allow non-replicated]] below. - -** TODO [#A] defer & async completion for explicit accept. - -Explicit accept currently ignores the consume lock. Defer and complete -it when the lock is acquired. - -** TODO [#A] Update to new members joining. 10. - -Need to resolve [[Queue sequence numbers vs. independant message IDs]] -first. -- implicit sequence numbers are more tricky to replicate to new member. - -Update individual objects (queues and exchanges) independently. -- create queues first, then update all queues and exchanges in parallel. -- multiple updater threads, per queue/exchange. -- updater sends messages to special exchange(s) (not using extended AMQP controls) - -Queue updater: -- marks the queue position at the sync point -- sends messages starting from the sync point working towards the head of the queue. -- send "done" message. -Note: updater remains active throughout, consuming clients actually reduce the -size of the update. - -Queue updatee: -- enqueues received from CPG: add to back of queue as normal. -- dequeues received from CPG: apply if found, else save to check at end of update. -- messages from updater: add to the *front* of the queue. -- update complete: apply any saved dequeues. - -Exchange updater: -- updater: send snapshot of exchange as it was at the sync point. - -Exchange updatee: -- queue exchange operations after the sync point. -- when snapshot is received: apply saved operations. - -Updater remains active throughout. -Updatee stalls clients until the update completes. - -Updating queue/exchange/binding objects is via the same encode/decode -that is used by the store. Updatee to use recovery interfaces to -recover? - -** TODO [#A] Failover updates to client. 2 -Implement the amq.failover exchange to notify clients of membership. -** TODO [#A] Passing all existing cluster tests. 5 - -The new cluster should be a drop-in replacement for the old, so it -should be able to pass all the existing tests. - -** TODO [#B] Initial status protocol. 3 -Handshake to give status of each broker member to new members joining. -Status includes -- cluster protocol version. -- persistent store state (clean, dirty) -- make it extensible, so additional state can be added in new protocols - -Clean store if last man standing or clean shutdown. -Need to add multicast controls for shutdown. - -** TODO [#B] Persistent cluster startup. 4 - -Based on existing code: -- Exchange dirty/clean exchanged in initial status. -- Only one broker recovers from store, others update. -** TODO [#B] Replace boost::hash with our own hash function. 1 -The hash function is effectively part of the interface so -we need to be sure it doesn't change underneath us. - -** TODO [#B] Management model. 3 -Alerts for inconsistent message loss. - -** TODO [#B] Management methods that modify queues. 5 - -Replicate management methods that modify queues - e.g. move, purge. -Target broker may not have all messages on other brokers for purge/destroy. -- Queue::purge() - wait for lock, purge local, mcast dequeues. -- Queue::move() - wait for lock, move msgs (mcasts enqueues), mcast dequeues. -- Queue::destroy() - messages to alternate exchange on all brokers. -- Queue::get() - ??? -Need to add callpoints & mcast messages to replicate these? -** TODO [#B] TX transaction support. 5 -Extend broker::Cluster interface to capture transaction context and completion. -Running brokers exchange TX information. -New broker update includes TX information. - - // FIXME aconway 2010-10-18: As things stand the cluster is not - // compatible with transactions - // - enqueues occur after routing is complete - // - no call to Cluster::enqueue, should be in Queue::process? - // - no transaction context associated with messages in the Cluster interface. - // - no call to Cluster::accept in Queue::dequeueCommitted - -Injecting holes into a queue: -- Multicast a 'non-message' that just consumes one queue position. -- Used to reserve a message ID (position) for a non-commited message. -- Also could allow non-replicated messages on a replicated queue if required. - -** TODO [#B] DTX transaction support. 5 -Extend broker::Cluster interface to capture transaction context and completion. -Running brokers exchange DTX information. -New broker update includes DTX information. - -** TODO [#B] Replicate message groups? -Message groups may require additional state to be replicated. -** TODO [#B] Replicate state for Fairshare? -gsim: fairshare would need explicit code to keep it in sync across -nodes; that may not be required however. -** TODO [#B] Timed auto-delete queues? -gsim: may need specific attention? -** TODO [#B] Async completion of accept. 4 -When this is fixed in the standalone broker, it should be fixed for cluster. - -** TODO [#B] Network partitions and quorum. 2 -Re-use existing implementation. - -** TODO [#B] Review error handling, put in a consitent model. 4. -- [ ] Review all asserts, for possible throw. -- [ ] Decide on fatal vs. non-fatal errors. - -** TODO [#B] Implement inconsistent error handling policy. 5 -What to do if a message is enqueued sucessfully on some broker(s), -but fails on other(s) - e.g. due to store limits? -- fail on local broker = possible message loss. -- fail on non-local broker = possible duplication. - -We have more flexibility now, we don't *have* to crash -- but we've lost some of our redundancy guarantee, how to inform user? - -Options to respond to inconsistent error: -- stop broker -- reset broker (exec a new qpidd) -- reset queue -- log critical -- send management event - -Most important is to inform of the risk of message loss. -Current favourite: reset queue+log critical+ management event. -Configurable choices? - -** TODO [#C] Allow non-replicated exchanges, queues. 5 - -3 levels set in declare arguments: -- qpid.replicate=no - nothing is replicated. -- qpid.replicate=wiring - queues/exchanges are replicated but not messages. -- qpid.replicate=yes - queues exchanges and messages are replicated. - -Wiring use case: it's OK to lose some messages (up to the max depth of -the queue) but the queue/exchange structure must be highly available -so clients can resume communication after fail over. - -Configurable default? Default same as old cluster? - -Need to -- save replicated status to stored (in arguments). -- support in management tools. - -Avoid name clashes between replicated/non-replicated: multicast -local-only names as well, all brokers keep a map and refuse to create -clashes. - -** TODO [#C] Handling immediate messages in a cluster. 2 -Include remote consumers in descision to deliver an immediate message. -* Improvements over existing cluster -** TODO [#C] Remove old cluster hacks and workarounds. -The old cluster has workarounds in the broker code that can be removed. -- [ ] drop code to replicate management model. -- [ ] drop timer workarounds for TTL, management, heartbeats. -- [ ] drop "cluster-safe assertions" in broker code. -- [ ] drop connections, sessions, management from cluster update. -- [ ] drop security workarounds: cluster code now operates after message decoding. -- [ ] drop connection tracking in cluster code. -- [ ] simpler inconsistent-error handling code, no need to stall. - -** TODO [#C] Support for live upgrades. - -Allow brokers in a running cluster to be replaced one-by-one with a new version. -(see new-cluster-design for design notes.) - -The old cluster protocol was unstable because any changes in broker -state caused changes to the cluster protocol.The new design should be -much more stable. - -Points to implement in anticipation of live upgrade: -- Prefix each CPG message with a version number and length. - Version number determines how to decode the message. -- Brokers ignore messages that have a higher version number than they understand. -- Protocol version XML element in cluster.xml, on each control. -- Initial status protocol to include protocol version number. - -New member udpates: use the store encode/decode for updates, use the -same backward compatibility strategy as the store. This allows for -adding new elements to the end of structures but not changing or -removing new elements. - -NOTE: Any change to the association of CPG group names and queues will -break compatibility. How to work around this? - -** TODO [#C] Refactoring of common concerns. - -There are a bunch of things that act as "Queue observers" with intercept -points in similar places. -- QueuePolicy -- QueuedEvents (async replication) -- MessageStore -- Cluster - -Look for ways to capitalize on the similarity & simplify the code. - -In particular QueuedEvents (async replication) strongly resembles -cluster replication, but over TCP rather than multicast. - -** TODO [#C] Support for AMQP 1.0. - -* Testing -** TODO [#A] Pass all existing cluster tests. -Requires [[Defer and async completion of wiring commands.]] -** TODO [#A] New cluster tests. -Stress tests & performance benchmarks focused on changes in new cluster: -- concurrency by queues rather than connections. -- different handling shared queues when consuemrs are on different brokers. |
