diff options
| author | Alan Conway <aconway@apache.org> | 2013-06-28 18:59:05 +0000 |
|---|---|---|
| committer | Alan Conway <aconway@apache.org> | 2013-06-28 18:59:05 +0000 |
| commit | 891eaaacb213d5fa8f1c626cf5bb648802ff6796 (patch) | |
| tree | e4cdb2810e7caa670c989e43e2bcf5b0677b9b2f /qpid/cpp/design_docs | |
| parent | c11bd16a3240f05dea6618b9c572fd3cd3d417f4 (diff) | |
| download | qpid-python-891eaaacb213d5fa8f1c626cf5bb648802ff6796.tar.gz | |
QPID-4328: HA support for transactions initial design doc.
git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1497883 13f79535-47bb-0310-9956-ffa450edef68
Diffstat (limited to 'qpid/cpp/design_docs')
| -rw-r--r-- | qpid/cpp/design_docs/ha-transactions.md | 175 |
1 files changed, 175 insertions, 0 deletions
diff --git a/qpid/cpp/design_docs/ha-transactions.md b/qpid/cpp/design_docs/ha-transactions.md new file mode 100644 index 0000000000..a5b952398f --- /dev/null +++ b/qpid/cpp/design_docs/ha-transactions.md @@ -0,0 +1,175 @@ + +# Design note: HA with Transactions. + +Clients can use transactions (TX or DTX) with the current HA module but: + +- the results are not guaranteed to be comitted atomically on the backups. +- the backups do not record the transaction in the store for possible recovery. + +## Requirements + +1. Atomic: Transaction results must be atomic on primary and backups. + +2. Consistent: Backups and primary must agree on whether the transaction comitted. + +3. Isolated: Concurrent transactions must not interfere with each other. + +4. Durable: Transactional state is written to the store. + +5. Recovery: If a cluster crashes while a DTX is in progress, it can be + re-started and participate in DTX recovery with a transaction co-ordinater + (currently via JMS) + +## TX and DTX + +Both TX and DTX transactions require a DTX-like `prepare` phase to synchronize +the commit or rollback across multiple brokers. This design note applies to both. + +For DTX the transaction is identified by the `xid`. For TX the HA module generates +a unique identifier. + +## Intercepting transactional operations + +Introduce 2 new interfaces on the Primary to intercept transactional operations: + + TransactionObserverFactory { + TransactionObserver create(...); + } + + TransactionObserver { + publish(queue, msg); + accept(accumulatedAck, unacked) # NOTE: translate queue positions to replication ids + prepare() + commit() + rollback() + } + +The primary will register a `TransactionObserverFactory` with the broker to hook +into transactions. On the primary, transactional events are processed as normal +and additionally are passed to the `TransactionObserver` which replicates the +events to the backups. + +## Replicating transaction content + +The primary creates a specially-named queue for each transaction (the tx-queue) + +TransactionObserver operations: + +- publish(queue, msg): push enqueue event onto tx-queue +- accept(accumulatedAck, unacked) push dequeue event onto tx-queue + (NOTE: must translate queue positions to replication ids) + +The tx-queue is replicated like a normal queue with some extensions for transactions. + +- Backups create a `TransactionReplicator` (extends `QueueReplicator`) + - `QueueReplicator` builds up a `TxBuffer` or `DtxBuffer` of transaction events. +- Primary has `TxReplicatingSubscription` (extends `ReplicatingSubscription` + + +Using a tx-queue has the following benefits: + +- Already have the tools to replicate it. +- Handles async fan-out of transactions to multiple Backups. +- keeps the tx events in linear order even if the tx spans multiple queues. +- Keeps tx data available for new backups until the tx is closed +- By closing the queue (see next section) its easy to establish what backups are + in/out of the transaction. + +## DTX Prepare + +Primary receives dtx.prepare: + +- "closes" the tx-queue + - A closed queue rejects any attempt to subscribe. + - Backups subscribed before the close are in the transaction. +- Puts prepare event on tx-queue, and does local prepare +- Returns ok if outcome of local and all backup prepares is ok, fail otherwise. + +*TODO*: this means we need asynchronous completion of the prepare control +so we complete only when all backups have responded (or time out.) + +Backups receiving prepare event do a local prepare. Outcome is communicated to +the TxReplicatingSubscription on the primary as follows: + +- ok: Backup acknowledges prepare event message on the tx-queue +- fail: Backup cancels tx-queue subscription + +## DTX Commit/Rollback + +Primary receives dtx.commit/rollback + +- Primary puts commit/rollback event on the tx-queue and does local commit/rollback. +- Backups commit/rollback as instructed and unsubscribe from tx-queue. +- tx-queue auto deletes when last backup unsubscribes. + +## TX Commit/Rollback + +Primary receives tx.commit/rollback + +- Do prepare phase as for DTX. +- Do commit/rollback phase as for DTX. + +## De-duplication + +When tx commits, each broker (backup & primary) pushes tx messages to the local queue. + +On the primary, ReplicatingSubscriptions will see the tx messages on the local +queue. Need to avoid re-sending to backups that already have the messages from +the transaction. + +ReplicatingSubscriptions has a "skip set" of messages already on the backup, +add tx messages to the skip set before Primary commits to local queue. + +## Failover + +Backups abort all open tx if the primary fails. + +## Atomicity + +Keep tx atomic when backups catch up while a tx is in progress. +A `ready` backup should never contain a subset of messages in a transaction. + +Scenario: + +- Guard placed on Q for an expected backup. +- Transaction with messages A,B,C on Q is prepared, tx-queue is closed. +- Expected backup connects and is declared ready due to guard. +- Transaction commits, primary adds A, B to Q and crashes before adding C. +- Backup is ready so promoted to new primary with a partial transaction A,B. + +*TODO*: Not happy with the following solution. + +Solution: Primary delays `ready` status till full tx is replicated. + +- When a new backup joins it is considered 'blocked' by any TX in progress. + - create a TxTracker per queue, per backup at prepare, before local commit. + - TxTracker holds set of enqueues & dequeues in the tx. + - ReplicatingSubscription removes enqueus & dequeues from TxTracker as they are replicated. +- Backup starts to catch up as normal but is not granted ready till: + - catches up on it's guard (as per normal catch-up) AND + - all TxTrackers are clear. + +NOTE: needs thougthful locking to correctly handle + +- multiple queues per tx +- multiple tx per queue +- multiple TxTrackers per queue per per backup +- synchronize backups in/out of transaction wrt setting trackers. + +*TODO*: A blocking TX eliminates the benefit of the queue guard for expected backups. + +- It won't be ready till it has replicated all guarded positions up to the tx. +- Could severely delay backups becoming ready after fail-over if queues are long. + +## Recovery + +*TODO* + +## Testing + +*TODO* + +- JMS tests +- Port old cluster_tests.py Dtx & Tx tests to HA. +- tests/src/py/qpid_tests/broker_0_10/dtx.py,tx.py (run all python tests against HA broker?) +- new tests with messaging API tx |
