QPID-5973: HA cluster state may get stuck in recovering

A backup queue is considered "ready" when all messages up to the first guarded position have either been replicated and acknowledged or dequeued. Previously this was implemented by waiting for the replicationg subscription to advance to the first guarded position and wating for all expected acks. However if messages are dequeued out-of-order (which happens with transactions) there can be a gap at the tail of the queue. The replicating subscription will not advance past this gap because it only advances when there are messages to consume. This resulted in backups stuck in catch-up. The recovering primary has a time-out for backups that never re-connect, but if they connect sucessfully and don't disconnect, the primary assumes they will become ready and waits - causing the primary to be stuck in "recovering". The fixes is to notify a replicating subscription if it becomes "stopped" because there are no more messages available on the queue. This implies that either it is at the tail OR there are no more messags until the tail. Either way we should consider this "ready" from the point of view of HA catch-up. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1616702 13f79535-47bb-0310-9956-ffa450edef68
author: Alan Conway <aconway@apache.org> 2014-08-08 09:23:54 +0000
committer: Alan Conway <aconway@apache.org> 2014-08-08 09:23:54 +0000
commit: d06ee666b50104bcd7cc42f656a68cce8636f79c (patch)
tree: dc4050b0b94a49bb10d01db78dfb3f56499736c1 /qpid/cpp/src/tests
parent: 9f797837921732d538d2331e8018125d3a6eaf2a (diff)
download: qpid-python-d06ee666b50104bcd7cc42f656a68cce8636f79c.tar.gz
1 files changed, 3 insertions, 2 deletions
diff --git a/qpid/cpp/src/tests/ha_tests.py b/qpid/cpp/src/tests/ha_tests.py
index dfb65318a9..f71560dffb 100755
--- a/qpid/cpp/src/tests/ha_tests.py
+++ b/qpid/cpp/src/tests/ha_tests.py
@@ -20,7 +20,6 @@
 import os, signal, sys, time, imp, re, subprocess, glob, random, logging, shutil, math, unittest
 import traceback
 from qpid.datatypes import uuid4, UUID
-from qpid.harness import Skipped
 from brokertest import *
 from ha_test import *
 from threading import Thread, Lock, Condition
@@ -363,7 +362,9 @@ class ReplicationTests(HaBrokerTest):
             cluster[0].wait_status("ready")
             cluster.bounce(1)
             # FIXME aconway 2014-02-20: pr does not fail over with 1.0/swig
-            if qm == qpid_messaging: raise Skipped("FIXME SWIG client failover bug")
+            if qm == qpid_messaging:
+                print "WARNING: Skipping SWIG client failover bug"
+                return
             self.assertEqual("a", pr.fetch().content)
             pr.session.acknowledge()
             backup.assert_browse_backup("q", ["b"])
author	Alan Conway <aconway@apache.org>	2014-08-08 09:23:54 +0000
committer	Alan Conway <aconway@apache.org>	2014-08-08 09:23:54 +0000
commit	d06ee666b50104bcd7cc42f656a68cce8636f79c (patch)
tree	dc4050b0b94a49bb10d01db78dfb3f56499736c1 /qpid/cpp/src/tests
parent	9f797837921732d538d2331e8018125d3a6eaf2a (diff)
download	qpid-python-d06ee666b50104bcd7cc42f656a68cce8636f79c.tar.gz