| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
| |
Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
|
| |\ |
|
| | | |
|
| | | |
|
| | |\
| |/
|/| |
|
| |\ \
| | |
| | | |
Use new rabbitmqctl features for monitoring
|
| |/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This will stop wasting network bandwidth for monitoring.
E.g. a 200-node OpenStack installation produces aronud 10k queues and
10k channels. Doing single list_queues/list_channels in cluster in this
environment results in 27k TCP packets and around 12 megabytes of
network traffic. Given that this calls happen ~10 times a minute with 3
controllers, it results in pretty significant overhead.
To enable those features you shoud have rabbitmq containing following
patches:
- https://github.com/rabbitmq/rabbitmq-server/pull/883
- https://github.com/rabbitmq/rabbitmq-server/pull/911
- https://github.com/rabbitmq/rabbitmq-server/pull/915
|
| |\ \
| | |
| | | |
[OCF HA] Change master score computation & split-brain detection logic
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Previous split brain logic worked as follows: each slave checked
that it is connected to master. If check fails, slave restarts. The
ultimate flaw in that logic is that there is little guarantee that
master is alive at the moment. Moreover, if master dies, it is very
probable that during the next monitor check slaves will detect its
death and restart, causing complete RabbitMQ cluster downtime.
With the new approach master node checks that slaves are connected to
it and orders them to restart if they are not. The check is performed
after master node health check, meaning that at least that node
survives. Also, orders expire in one minute and freshly started node
ignores orders to restart for three minutes to give cluster time to
stabilize.
Also corrected the problem, when node starts and is already clustered.
In that case OCF script forgot to start the RabbitMQ app, causing
subsequent restart. Now we ensure that RabbitMQ app is running.
The two introduced attributes rabbit-start-phase-1-time and
rabbit-ordered-to-restart are made private. In order to allow master
to set node's order to restart, both ocf_update_private_attr and
ocf_get_private_attr signatures are expanded to allow passing
node name.
Finally, a bug is fixed in ocf_get_private_attr. Unlike crm_attribute,
attrd_updater returns empty string instead of "(null)", when an
attribute is not defined on needed node, but is defined on some other
node. Correspondingly changed code to expect empty string, not a
"(null)".
This fix is a fix for Fuel bugs
https://bugs.launchpad.net/fuel/+bug/1559136
https://bugs.launchpad.net/mos/+bug/1561894
|
| |/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Right now we assign 1000 to the oldest nodes and 1 to others. That
creates a problem when Master restarts and no node is promoted until
that node starts back. In that case the returned node will have score
of 1, like all other slaves and Pacemaker will select to promote it
again. The node is clean empty and afterwards other slaves join to
it, wiping their data as well. As a result, we loose all the messages.
The new algorithm actually ranks nodes, not just selects the oldest
one. It also maintains the invariant that if node A started later
than node B, then node A score must be smaller than that of
node B. As a result, freshly started node has no chance of being
selected in preference to older node. If several nodes start
simultaneously, among them an older node might temporarily receive
lower score than a younger one, but that is neglectable.
Also remove any action on demote or demote notification - all of
these duplicate actions done in stop or stop notification. With these
removed, changing master on a running cluster does not affect RabbitMQ
cluster in any way - we just declare another node master and that is
it. It is important for the current change because master score might
change after initial cluster start up causing master migration from
one node to another.
This fix is a prerequsite for fix to Fuel bugs
https://bugs.launchpad.net/fuel/+bug/1559136
https://bugs.launchpad.net/mos/+bug/1561894
|
| |/
|
|
| |
* Also solves deadlocks when leader aborts autoheal in node down
|
| |\ |
|
| | | |
|
| | |\
| | |
| | |
| | | |
https://github.com/Ayanda-D/rabbitmq-server into Ayanda-D-rabbitmq-server-914
|
| | | |
| | |
| | |
| | | |
dead pids from queue
|
| | | | |
|
| | | |
| | |
| | |
| | | |
from the gm process' neighbours
|
| |\ \ \
| |/ /
|/| | |
[OCF HA] Add ocf_get_private_attr function to RabbitMQ OCF script
|
| |/ /
| |
| |
| |
| |
| | |
The function is extracted from check_timeouts to be re-used later
in other parts of the script. Also, swtich check_timeouts to use
existing ocf_update_private_attr function.
|
| |\ \
| | |
| | | |
Fix bashisms in rabbitmq OCF RA
|
| |/ /
| |
| |
| |
| |
| | |
Change "printf %b" to be passing the checkbashisms.
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
|
| |\ \
| | |
| | | |
Discard any unexpected messages, such as late replies from gen_server
|
| | | | |
|
| |/ / |
|
| |\ \ |
|
| | |\ \
|/ / /
| | |
| | | |
https://github.com/binarin/rabbitmq-server into binarin-rabbitmq-server-health-check-node-monitor
|
| | | |
| | |
| | |
| | |
| | | |
Tests + comment outlining the problem. The check itself is in separate
commit to `rabbitmq-common`.
|
| |\ \ \
| | | |
| | | | |
Update iptables calls with --wait
|
| |/ / /
| | |
| | |
| | |
| | |
| | |
| | | |
If iptables is currently being called outside of the ocf script, the
iptables call will fail because it cannot get a lock. This change
updates the iptables call to include the -w flag which will wait until
the lock can be established and not just exit with an error.
|
| | |/
|/| |
|
| |\ \
| | |
| | | |
Add support for listing only local queues
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Partially implements https://github.com/rabbitmq/rabbitmq-server/issues/851
- Made old `--online`/`--offline` options mutually exclusive between
themselves and the new `--local` option
- Added documentation both for the old and the new option
- Fixed some ugly indentation in generated usage (only `set_policy`
wrapped line remains unfixed)
- Added integration test suite for `rabbitmqctl list_queues`
|
| | | | |
|
| |\ \ \
| |_|/
|/| | |
Fix longname-mode on hosts without detectable FQDN
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Server startup and CLI tools fail in longnames-mode if erlang is not
able to determine host FQDN (with at least one dot in it).
E.g. this can happen when you want to assemble a cluster using only
IP-addresses, and you completely don't care about FQDNs.
And it was not possible to alleviate this situation using any options
from http://erlang.org/doc/apps/erts/inet_cfg.html
Fixes #890
|
| |\ \ \
| |_|/
|/| | |
Added resume after flow
|
| | | | |
|
| | | | |
|
| |/ / |
|
| |\ \ |
|
| |/ / |
|
| |\ \
| | |
| | | |
Fix some type specs
|
| |/ /
| |
| |
| | |
Forgot to update specs in #868
|
| | | |
|
| |\ \
| | |
| | | |
Bump default VM atom table limit to 5M
|
| | |/
| |
| |
| |
| |
| | |
See #895 for background and reasoning.
Fixes #895.
|
| |\ \
| |/
|/| |
Don't die in case of faulty node
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes TOCTOU issue introduced in the following commit:
* rabbitmq/rabbitmq-server@93b9e37c3ea0cade4e30da0aa1f14fa97c82e669
If the node was just removed from the cluster, then there is a small
window when it is still listed as a member of a Mnesia cluster
locally. We retrieve list of nodes by calling locally
```erlang
unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]).
```
However retrieving status from that particular failed node is no longer
available and throws an exception. See `alarms_by_node(Name)` function,
which is simply calls `unsafe_rpc(Name, rabbit, status, [])` for this
node.
This `unsafe_rpc/4` function is basically a wrapper over `rabbit_misc:rpc_call/4`
which translates `{badrpc, nodedown}` into exception. This exception
generated by `alarms_by_node(Name)` function call emerges on a very high
level, so rabbitmqct thinks that the entire cluster is down, while
generating a very bizarre message:
Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: unable to connect to node 'rabbit@overcloud-controller-0':
nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit@overcloud-controller-0']
rabbit@overcloud-controller-0:
* connected to epmd (port 4369) on overcloud-controller-0
* node rabbit@overcloud-controller-0 up, 'rabbit' application running
current node details:
- node name: 'rabbitmq-cli-31@overcloud-controller-0'
- home dir: /var/lib/rabbitmq
- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==
See - it reports that it failed to connect to node
'rabbit@overcloud-controller-0' (because it catches an exception from
`alarms_by_node(Name)`), but attempt to connect to this node was
successful ('rabbit' application running).
In order to fix that we should not throw exception during consequent
calls (`[alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]`), only
during the first one (`unsafe_rpc(Node, rabbit_mnesia, status, [])`).
Even more - we don't need to change `nodes_in_cluster(Node)`, because it
is called locally. The only function which must use
`rabbit_misc:rpc_call/4` is `alarms_by_node(Name)` because it is
executed remotely.
See this issue for further details and real world example:
* https://bugzilla.redhat.com/1356169
Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
|
| | |
|
| |\
| |
| | |
Tune scheduling bind flags for Erlang VM
|