# Design note: HA with Transactions. Clients can use transactions (TX or DTX) with the current HA module but: - the results are not guaranteed to be comitted atomically on the backups. - the backups do not record the transaction in the store for possible recovery. ## Requirements 1. Atomic: Transaction results must be atomic on primary and backups. 2. Consistent: Backups and primary must agree on whether the transaction comitted. 3. Isolated: Concurrent transactions must not interfere with each other. 4. Durable: Transactional state is written to the store. 5. Recovery: If a cluster crashes while a DTX is in progress, it can be re-started and participate in DTX recovery with a transaction co-ordinater (currently via JMS) ## TX and DTX Both TX and DTX transactions require a DTX-like `prepare` phase to synchronize the commit or rollback across multiple brokers. This design note applies to both. For DTX the transaction is identified by the `xid`. For TX the HA module generates a unique identifier. ## Intercepting transactional operations Introduce 2 new interfaces on the Primary to intercept transactional operations: TransactionObserverFactory { TransactionObserver create(...); } TransactionObserver { publish(queue, msg); accept(accumulatedAck, unacked) # NOTE: translate queue positions to replication ids prepare() commit() rollback() } The primary will register a `TransactionObserverFactory` with the broker to hook into transactions. On the primary, transactional events are processed as normal and additionally are passed to the `TransactionObserver` which replicates the events to the backups. ## Replicating transaction content The primary creates a specially-named queue for each transaction (the tx-queue) TransactionObserver operations: - publish(queue, msg): push enqueue event onto tx-queue - accept(accumulatedAck, unacked) push dequeue event onto tx-queue (NOTE: must translate queue positions to replication ids) The tx-queue is replicated like a normal queue with some extensions for transactions. - Backups create a `TransactionReplicator` (extends `QueueReplicator`) - `QueueReplicator` builds up a `TxBuffer` or `DtxBuffer` of transaction events. - Primary has `TxReplicatingSubscription` (extends `ReplicatingSubscription` Using a tx-queue has the following benefits: - Already have the tools to replicate it. - Handles async fan-out of transactions to multiple Backups. - keeps the tx events in linear order even if the tx spans multiple queues. - Keeps tx data available for new backups until the tx is closed - By closing the queue (see next section) its easy to establish what backups are in/out of the transaction. ## DTX Prepare Primary receives dtx.prepare: - "closes" the tx-queue - A closed queue rejects any attempt to subscribe. - Backups subscribed before the close are in the transaction. - Puts prepare event on tx-queue, and does local prepare - Returns ok if outcome of local and all backup prepares is ok, fail otherwise. *TODO*: this means we need asynchronous completion of the prepare control so we complete only when all backups have responded (or time out.) Backups receiving prepare event do a local prepare. Outcome is communicated to the TxReplicatingSubscription on the primary as follows: - ok: Backup acknowledges prepare event message on the tx-queue - fail: Backup cancels tx-queue subscription ## DTX Commit/Rollback Primary receives dtx.commit/rollback - Primary puts commit/rollback event on the tx-queue and does local commit/rollback. - Backups commit/rollback as instructed and unsubscribe from tx-queue. - tx-queue auto deletes when last backup unsubscribes. ## TX Commit/Rollback Primary receives tx.commit/rollback - Do prepare phase as for DTX. - Do commit/rollback phase as for DTX. ## De-duplication When tx commits, each broker (backup & primary) pushes tx messages to the local queue. On the primary, ReplicatingSubscriptions will see the tx messages on the local queue. Need to avoid re-sending to backups that already have the messages from the transaction. ReplicatingSubscriptions has a "skip set" of messages already on the backup, add tx messages to the skip set before Primary commits to local queue. ## Failover Backups abort all open tx if the primary fails. ## Atomicity Keep tx atomic when backups catch up while a tx is in progress. A `ready` backup should never contain a proper subset of messages in a transaction. Scenario: - Guard placed on Q for an expected backup. - Transaction with messages A,B,C on Q is prepared, tx-queue is closed. - Expected backup connects and is declared ready due to guard. - Transaction commits, primary publishes A, B to Q and crashes before adding C. - Backup is ready so promoted to new primary with a partial transaction A,B. *TODO*: Not happy with the following solution. Solution: Primary delays `ready` status till full tx is replicated. - When a new backup joins it is considered 'blocked' by any TX in progress. - create a TxTracker per queue, per backup at prepare, before local commit. - TxTracker holds set of enqueues & dequeues in the tx. - ReplicatingSubscription removes enqueus & dequeues from TxTracker as they are replicated. - Backup starts to catch up as normal but is not granted ready till: - catches up on it's guard (as per normal catch-up) AND - all TxTrackers are clear. NOTE: needs thougthful locking to correctly handle - multiple queues per tx - multiple tx per queue - multiple TxTrackers per queue per per backup - synchronize backups in/out of transaction wrt setting trackers. *TODO*: A blocking TX eliminates the benefit of the queue guard for expected backups. - It won't be ready till it has replicated all guarded positions up to the tx. - Could severely delay backups becoming ready after fail-over if queues are long. ## Recovery *TODO* ## Testing New tests: - TX transaction tests in python. - TX failover stress test (c.f. `test_failover_send_receive`) Existing tests: - JMS transaction tests (DTX & TX) - `qpid/tests/src/py/qpid_tests/broker_0_10/dtx.py,tx.py` run against HA broker.