summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Nobody is using mochinum in this repo. Move it to common.Peter Lemenkov2016-08-261-358/+0
| | | | Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
* Merge branch 'rabbitmq-server-928' into stableMichael Klishin2016-08-243-51/+138
|\
| * Naming, wordingMichael Klishin2016-08-241-7/+7
| |
| * Handle late autoheal_finished messageDiana Corbacho2016-08-242-0/+17
| |
| * Merge branch 'stable' into rabbitmq-server-928Michael Klishin2016-08-231-231/+209
| |\ | |/ |/|
* | Merge pull request #916 from binarin/rabbitmq-server-new-shiny-ocf-health-checkMichael Klishin2016-08-231-22/+110
|\ \ | | | | | | Use new rabbitmqctl features for monitoring
| * | Monitor rabbitmq from OCF with less overheadAlexey Lebedeff2016-08-231-22/+110
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will stop wasting network bandwidth for monitoring. E.g. a 200-node OpenStack installation produces aronud 10k queues and 10k channels. Doing single list_queues/list_channels in cluster in this environment results in 27k TCP packets and around 12 megabytes of network traffic. Given that this calls happen ~10 times a minute with 3 controllers, it results in pretty significant overhead. To enable those features you shoud have rabbitmq containing following patches: - https://github.com/rabbitmq/rabbitmq-server/pull/883 - https://github.com/rabbitmq/rabbitmq-server/pull/911 - https://github.com/rabbitmq/rabbitmq-server/pull/915
* | Merge pull request #929 from dmitrymex/start-sequenceMichael Klishin2016-08-221-222/+112
|\ \ | | | | | | [OCF HA] Change master score computation & split-brain detection logic
| * | [OCF HA] Enhance split-brain detection logicDmitry Mescheryakov2016-08-221-56/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous split brain logic worked as follows: each slave checked that it is connected to master. If check fails, slave restarts. The ultimate flaw in that logic is that there is little guarantee that master is alive at the moment. Moreover, if master dies, it is very probable that during the next monitor check slaves will detect its death and restart, causing complete RabbitMQ cluster downtime. With the new approach master node checks that slaves are connected to it and orders them to restart if they are not. The check is performed after master node health check, meaning that at least that node survives. Also, orders expire in one minute and freshly started node ignores orders to restart for three minutes to give cluster time to stabilize. Also corrected the problem, when node starts and is already clustered. In that case OCF script forgot to start the RabbitMQ app, causing subsequent restart. Now we ensure that RabbitMQ app is running. The two introduced attributes rabbit-start-phase-1-time and rabbit-ordered-to-restart are made private. In order to allow master to set node's order to restart, both ocf_update_private_attr and ocf_get_private_attr signatures are expanded to allow passing node name. Finally, a bug is fixed in ocf_get_private_attr. Unlike crm_attribute, attrd_updater returns empty string instead of "(null)", when an attribute is not defined on needed node, but is defined on some other node. Correspondingly changed code to expect empty string, not a "(null)". This fix is a fix for Fuel bugs https://bugs.launchpad.net/fuel/+bug/1559136 https://bugs.launchpad.net/mos/+bug/1561894
| * | [OCF HA] Rank master score based on start timeDmitry Mescheryakov2016-08-221-166/+48
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now we assign 1000 to the oldest nodes and 1 to others. That creates a problem when Master restarts and no node is promoted until that node starts back. In that case the returned node will have score of 1, like all other slaves and Pacemaker will select to promote it again. The node is clean empty and afterwards other slaves join to it, wiping their data as well. As a result, we loose all the messages. The new algorithm actually ranks nodes, not just selects the oldest one. It also maintains the invariant that if node A started later than node B, then node A score must be smaller than that of node B. As a result, freshly started node has no chance of being selected in preference to older node. If several nodes start simultaneously, among them an older node might temporarily receive lower score than a younger one, but that is neglectable. Also remove any action on demote or demote notification - all of these duplicate actions done in stop or stop notification. With these removed, changing master on a running cluster does not affect RabbitMQ cluster in any way - we just declare another node master and that is it. It is important for the current change because master score might change after initial cluster start up causing master migration from one node to another. This fix is a prerequsite for fix to Fuel bugs https://bugs.launchpad.net/fuel/+bug/1559136 https://bugs.launchpad.net/mos/+bug/1561894
| * Improve tolerance to partial partitions in autohealDiana Corbacho2016-08-233-51/+121
|/ | | | * Also solves deadlocks when leader aborts autoheal in node down
* Merge branch 'Ayanda-D-rabbitmq-server-914' into stablerabbitmq_v3_6_6_milestone1Diana Corbacho2016-08-195-8/+74
|\
| * Test GM crash when group is deleted while processing a DOWN messageDiana Corbacho2016-08-191-1/+37
| |
| * Merge branch 'rabbitmq-server-914' of ↵Diana Corbacho2016-08-194-7/+37
| |\ | | | | | | | | | https://github.com/Ayanda-D/rabbitmq-server into Ayanda-D-rabbitmq-server-914
| | * Handle unexpected gm group alterations prior to removal ofAyanda Dube2016-08-153-6/+31
| | | | | | | | | | | | dead pids from queue
| | * Adds check_membership/2 clause for handling non-existant gm groupAyanda Dube2016-08-151-1/+3
| | |
| | * Safely handle (and log) anonymous info messages, most likelyAyanda Dube2016-08-151-1/+8
| | | | | | | | | | | | from the gm process' neighbours
* | | Merge pull request #926 from dmitrymex/get-private-attrMichael Klishin2016-08-181-10/+15
|\ \ \ | |/ / |/| | [OCF HA] Add ocf_get_private_attr function to RabbitMQ OCF script
| * | [OCF HA] Add ocf_get_private_attr function to RabbitMQ OCF scriptDmitry Mescheryakov2016-08-181-10/+15
|/ / | | | | | | | | | | The function is extracted from check_timeouts to be re-used later in other parts of the script. Also, swtich check_timeouts to use existing ocf_update_private_attr function.
* | Merge pull request #925 from bogdando/fix_bashismsMichael Klishin2016-08-181-4/+4
|\ \ | | | | | | Fix bashisms in rabbitmq OCF RA
| * | Fix bashisms in rabbitmq OCF RABogdan Dobrelya2016-08-181-4/+4
|/ / | | | | | | | | | | Change "printf %b" to be passing the checkbashisms. Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
* | Merge pull request #923 from rabbitmq/rabbitmq-server-922Michael Klishin2016-08-162-2/+14
|\ \ | | | | | | Discard any unexpected messages, such as late replies from gen_server
| * | TypoMichael Klishin2016-08-161-1/+1
| | |
| * | Discard any unexpected messages, such as late replies from gen_serverDiana Corbacho2016-08-162-2/+14
|/ /
* | Merge branch 'binarin-rabbitmq-server-health-check-node-monitor' into stableMichael Klishin2016-08-162-0/+30
|\ \
| * \ Merge branch 'rabbitmq-server-health-check-node-monitor' of ↵Michael Klishin2016-08-162-0/+30
| |\ \ |/ / / | | | | | | https://github.com/binarin/rabbitmq-server into binarin-rabbitmq-server-health-check-node-monitor
| * | Check rabbit_node_monitor during health-checkAlexey Lebedeff2016-08-102-0/+30
| | | | | | | | | | | | | | | Tests + comment outlining the problem. The check itself is in separate commit to `rabbitmq-common`.
* | | Merge pull request #920 from mwhahaha/iptables-stableMichael Klishin2016-08-161-4/+4
|\ \ \ | | | | | | | | Update iptables calls with --wait
| * | | Update iptables calls with --waitAlex Schultz2016-08-151-4/+4
|/ / / | | | | | | | | | | | | | | | | | | If iptables is currently being called outside of the ocf script, the iptables call will fail because it cannot get a lock. This change updates the iptables call to include the -w flag which will wait until the lock can be established and not just exit with an error.
* | | Docs wordingMichael Klishin2016-08-151-3/+2
| |/ |/|
* | Merge pull request #911 from binarin/rabbitmq-server-851D Corbacho2016-08-157-44/+257
|\ \ | | | | | | Add support for listing only local queues
| * | Add support for listing only local queuesAlexey Lebedeff2016-08-127-44/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Partially implements https://github.com/rabbitmq/rabbitmq-server/issues/851 - Made old `--online`/`--offline` options mutually exclusive between themselves and the new `--local` option - Added documentation both for the old and the new option - Fixed some ugly indentation in generated usage (only `set_policy` wrapped line remains unfixed) - Added integration test suite for `rabbitmqctl list_queues`
* | | Fix trivial typo noticed in error message.jerryk2016-08-121-1/+1
| | |
* | | Merge pull request #892 from binarin/rabbitmq-server-890Jean-Sébastien Pédron2016-08-102-3/+30
|\ \ \ | |_|/ |/| | Fix longname-mode on hosts without detectable FQDN
| * | Fix longname-mode on hosts without detectable FQDNAlexey Lebedeff2016-07-222-3/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Server startup and CLI tools fail in longnames-mode if erlang is not able to determine host FQDN (with at least one dot in it). E.g. this can happen when you want to assemble a cluster using only IP-addresses, and you completely don't care about FQDNs. And it was not possible to alleviate this situation using any options from http://erlang.org/doc/apps/erts/inet_cfg.html Fixes #890
* | | Merge pull request #910 from rabbitmq/rabbitmq-server-850Michael Klishin2016-08-101-1/+1
|\ \ \ | |_|/ |/| | Added resume after flow
| * | add resume after flowGabriele Santomaggio2016-08-081-1/+1
| | |
| * | fix loop in flow control stateGabriele Santomaggio2016-08-051-1/+1
| | |
* | | Commit .deb and .rpm change logsMichael Klishin2016-08-052-0/+9
|/ /
* | Merge branch 'rabbitmq-server-904' into stablerabbitmq_v3_6_5_milestone2rabbitmq_v3_6_5_milestone1rabbitmq_v3_6_5Michael Klishin2016-08-021-1/+3
|\ \
| * | added rabbit_registry requireGabriele Santomaggio2016-08-021-1/+3
|/ /
* | Merge pull request #898 from binarin/rabbitmq-server-868-secsMichael Klishin2016-07-301-1/+1
|\ \ | | | | | | Fix some type specs
| * | Fix some type specsAlexey Lebedeff2016-07-291-1/+1
|/ / | | | | | | Forgot to update specs in #868
* | Commit .deb and .rpm change logsMichael Klishin2016-07-292-0/+9
| |
* | Merge pull request #896 from rabbitmq/rabbitmq-server-895rabbitmq_v3_6_4D Corbacho2016-07-292-2/+2
|\ \ | | | | | | Bump default VM atom table limit to 5M
| * | Bump default VM atom table size to 5MMichael Klishin2016-07-282-2/+2
| |/ | | | | | | | | | | See #895 for background and reasoning. Fixes #895.
* | Merge pull request #894 from lemenkov/toctou_in_cluster_statusMichael Klishin2016-07-281-3/+6
|\ \ | |/ |/| Don't die in case of faulty node
| * Don't die in case of faulty nodePeter Lemenkov2016-07-281-3/+6
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes TOCTOU issue introduced in the following commit: * rabbitmq/rabbitmq-server@93b9e37c3ea0cade4e30da0aa1f14fa97c82e669 If the node was just removed from the cluster, then there is a small window when it is still listed as a member of a Mnesia cluster locally. We retrieve list of nodes by calling locally ```erlang unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]). ``` However retrieving status from that particular failed node is no longer available and throws an exception. See `alarms_by_node(Name)` function, which is simply calls `unsafe_rpc(Name, rabbit, status, [])` for this node. This `unsafe_rpc/4` function is basically a wrapper over `rabbit_misc:rpc_call/4` which translates `{badrpc, nodedown}` into exception. This exception generated by `alarms_by_node(Name)` function call emerges on a very high level, so rabbitmqct thinks that the entire cluster is down, while generating a very bizarre message: Cluster status of node 'rabbit@overcloud-controller-0' ... Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown DIAGNOSTICS =========== attempted to contact: ['rabbit@overcloud-controller-0'] rabbit@overcloud-controller-0: * connected to epmd (port 4369) on overcloud-controller-0 * node rabbit@overcloud-controller-0 up, 'rabbit' application running current node details: - node name: 'rabbitmq-cli-31@overcloud-controller-0' - home dir: /var/lib/rabbitmq - cookie hash: PB31uPq3vzeQeZ+MHv+wgg== See - it reports that it failed to connect to node 'rabbit@overcloud-controller-0' (because it catches an exception from `alarms_by_node(Name)`), but attempt to connect to this node was successful ('rabbit' application running). In order to fix that we should not throw exception during consequent calls (`[alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]`), only during the first one (`unsafe_rpc(Node, rabbit_mnesia, status, [])`). Even more - we don't need to change `nodes_in_cluster(Node)`, because it is called locally. The only function which must use `rabbit_misc:rpc_call/4` is `alarms_by_node(Name)` because it is executed remotely. See this issue for further details and real world example: * https://bugzilla.redhat.com/1356169 Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
* Update rabbitmq-components.mkrabbitmq_v3_6_4_rc1rabbitmq_v3_6_4_milestone2rabbitmq_v3_6_4_milestone1Michael Klishin2016-07-141-0/+1
|
* Merge pull request #873 from rabbitmq/rabbitmq-server-612Michael Klishin2016-07-144-12/+18
|\ | | | | Tune scheduling bind flags for Erlang VM