diff options
| author | Peter Lemenkov <lemenkov@gmail.com> | 2016-07-22 17:15:02 +0200 |
|---|---|---|
| committer | Peter Lemenkov <lemenkov@gmail.com> | 2016-07-28 15:07:07 +0300 |
| commit | 5bcf3bafa558ec361002e87c89b3f0b6b3a14069 (patch) | |
| tree | e7d5bcf25225a8cd1aa9b476d6bf9e4f87e850c9 | |
| parent | 0c316fc3b6c135236fb0a267b31595dc5762f9eb (diff) | |
| download | rabbitmq-server-git-5bcf3bafa558ec361002e87c89b3f0b6b3a14069.tar.gz | |
Don't die in case of faulty node
This patch fixes TOCTOU issue introduced in the following commit:
* rabbitmq/rabbitmq-server@93b9e37c3ea0cade4e30da0aa1f14fa97c82e669
If the node was just removed from the cluster, then there is a small
window when it is still listed as a member of a Mnesia cluster
locally. We retrieve list of nodes by calling locally
```erlang
unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]).
```
However retrieving status from that particular failed node is no longer
available and throws an exception. See `alarms_by_node(Name)` function,
which is simply calls `unsafe_rpc(Name, rabbit, status, [])` for this
node.
This `unsafe_rpc/4` function is basically a wrapper over `rabbit_misc:rpc_call/4`
which translates `{badrpc, nodedown}` into exception. This exception
generated by `alarms_by_node(Name)` function call emerges on a very high
level, so rabbitmqct thinks that the entire cluster is down, while
generating a very bizarre message:
Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: unable to connect to node 'rabbit@overcloud-controller-0':
nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit@overcloud-controller-0']
rabbit@overcloud-controller-0:
* connected to epmd (port 4369) on overcloud-controller-0
* node rabbit@overcloud-controller-0 up, 'rabbit' application running
current node details:
- node name: 'rabbitmq-cli-31@overcloud-controller-0'
- home dir: /var/lib/rabbitmq
- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==
See - it reports that it failed to connect to node
'rabbit@overcloud-controller-0' (because it catches an exception from
`alarms_by_node(Name)`), but attempt to connect to this node was
successful ('rabbit' application running).
In order to fix that we should not throw exception during consequent
calls (`[alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]`), only
during the first one (`unsafe_rpc(Node, rabbit_mnesia, status, [])`).
Even more - we don't need to change `nodes_in_cluster(Node)`, because it
is called locally. The only function which must use
`rabbit_misc:rpc_call/4` is `alarms_by_node(Name)` because it is
executed remotely.
See this issue for further details and real world example:
* https://bugzilla.redhat.com/1356169
Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
| -rw-r--r-- | src/rabbit_control_main.erl | 9 |
1 files changed, 6 insertions, 3 deletions
diff --git a/src/rabbit_control_main.erl b/src/rabbit_control_main.erl index d2f0e8bcb0..ea9d6a2030 100644 --- a/src/rabbit_control_main.erl +++ b/src/rabbit_control_main.erl @@ -953,6 +953,9 @@ nodes_in_cluster(Node) -> unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]). alarms_by_node(Name) -> - Status = unsafe_rpc(Name, rabbit, status, []), - {_, As} = lists:keyfind(alarms, 1, Status), - {Name, As}. + case rpc_call(Name, rabbit, status, []) of + {badrpc,nodedown} -> {Name, [nodedown]}; + Status -> + {_, As} = lists:keyfind(alarms, 1, Status), + {Name, As} + end. |
