summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'master' into stableJean-Sébastien Pédron2015-03-1165-1081/+3284
|\
| * Deb changelogs for release 3.5.0 (rerolled)rabbitmq_v3_5_0Jean-Sébastien Pédron2015-03-111-1/+1
| |
| * Merge branch 'bug26638'Simon MacMullen2015-03-111-2/+1
| |\
| | * More reliable way to find the erl executable.Simon MacMullen2015-03-111-2/+1
| |/ | | | | | | | | | | The old way did not work if erl was not in ${ERLANG_HOME}/bin. The new way looks like it should only work if erl is on the ${PATH} - but in fact the Erlang VM arranges things so that is always true!
| * RPM/deb changelogs for release 3.5.0Jean-Sébastien Pédron2015-03-112-0/+9
| |
| * Ignore ./etc/*Michael Klishin2015-03-111-0/+1
| |
| * Merge branch 'bug26636'Simon MacMullen2015-03-102-6/+20
| |\
| | * Rename things to be slightly clearerSimon MacMullen2015-03-101-3/+3
| | |
| | * rabbit_node_monitor: Exclude Mnesia-synced nodes when clearing partitionsJean-Sébastien Pédron2015-03-101-1/+3
| | | | | | | | | | | | | | | Otherwise, in handle_dead_rabbit(), if the specified node is already back in the cluster, "partitions" will not be emptied.
| | * rabbit_node_monitor: In handle_dead_rabbit(), consider node down onceJean-Sébastien Pédron2015-03-091-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | ... in pause conditions, especially if the remote node is back in the cluster already. This gives the node a chance to enter a pause mode. Without this, partitions may never be fixed.
| | * rabbit_networking: In on_node_down(), don't remove listeners if node is backJean-Sébastien Pédron2015-03-091-1/+5
| |/ | | | | | | | | | | | | | | If the node is already back in the cluster at the time rabbit_networking:on_node_down() is called don't remove the specified node's listeners. Otherwise, we would loose the record from the entire cluster, creating an inconsistency between the running listeners and the recorded ones.
| * Merge branch 'bug26633'Simon MacMullen2015-03-061-2/+16
| |\
| | * Partial partition: Don't act if pause_if_all_down is about to pauseJean-Sébastien Pédron2015-03-051-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At least, this fixes the following crash, when the partial partition detection tries to close the connection while RabbitMQ is stopping: ** Reason for termination == ** {timeout,{gen_server,call, [application_controller, {set_env,kernel,dist_auto_connect,never}]}}
| * | Merge branch 'bug26632'Jean-Sébastien Pédron2015-03-062-19/+40
| |\ \ | | |/ | |/|
| | * Autoheal: Wait for current outside app. process to finishJean-Sébastien Pédron2015-03-052-19/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rabbit_node_monitor:run_outside_applications/2 gains a new argument to indicate if the caller is willing is let a possible current outside application process to finish before its own function can run. If this argument is false, the current behaviour is kept: the function is not executed. This fixes a bug where autoheal's outside application function was not executed if pause_if_all_down's outside application function was still running (ie. rabbit:start() didn't return yet). The bug caused the loser to not stop and thus the autoheal process to never complete. Now, the autoheal outside application function waits for the pause_if_all_down function to finish before stopping the loser.
| * | Merge branch 'bug26628'Simon MacMullen2015-03-051-43/+147
| |\ \
| | * | Autoheal: Add a restart_loser/2 functionJean-Sébastien Pédron2015-03-051-34/+38
| | | | | | | | | | | | | | | | ... and simplify the handle_msg({winner_is, _}, ...) clause.
| | * | Autoheal: Document the protocol changeJean-Sébastien Pédron2015-03-041-3/+66
| | | | | | | | | | | | | | | | While here, document the message flow.
| | * | Autoheal: The leader waits for "done!" message from the winnerJean-Sébastien Pédron2015-03-041-30/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, the leader was monitoring the losers itself (exactly like the winner). When they were all down, it was going back to the "not_healing" state. Therefore, there was a possibility that the leader and winner went out-of-sync regarding the autoheal state. Now, the leader simply waits for a confirmation from the winner that the autoheal process is over. If the leader is a loser too, the autoheal state is saved in the application environment to survive the restart. When the leader is back up, it asks the winner to possibly notify it again.
| * | | Add more explanatory text for timeout_waiting_for_tablesSimon MacMullen2015-03-051-2/+8
| | |/ | |/|
| * | Merge branch 'bug26631'Jean-Sébastien Pédron2015-03-042-1/+67
| |\ \ | | |/ | |/|
| | * Workaround "global" hangJean-Sébastien Pédron2015-03-042-1/+67
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In rabbit_node_monitor:disconnect/1, we change the "dist_auto_connect" kernel parameter to force a disconnection and give some time to all components to handle the subsequent "nodedown" event. "global" doesn't handle this situation very well. With an unfortunate sequence of messages and bad timings, this can trigger an inconsistency in its internal state. When this happens, global:sync() never returns. See bug 26556, comment #5 for a detailed description. The workaround consists of a process who parses the "global" internal state if global:sync/0 didn't return in 15 seconds. If the state contains in-progress synchronisation older than 10 seconds, the spawned process sends fake nodedown/nodeup events to "global" on both inconsistent nodes so they restart their synchronisation. This workaround will be removed once the real bugs are fixed and "dist_auto_connect" is left untouched.
| * Merge branch 'stable'Jean-Sébastien Pédron2015-03-031-21/+25
| |\ | |/ |/|
* | Merge branch 'bug26622' into stableJean-Sébastien Pédron2015-03-031-10/+12
|\ \
| * | rabbit_node_monitor: Cache pause_minority_guard() return valueJean-Sébastien Pédron2015-03-031-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the list returned by `nodes()` didn't change since last call to `pause_minority_guard()`, return the previous state, not `ok`. This fixes a bug where the first call to `pause_minority_guard()` could return `pausing` but the subsequent calls would return `ok`, leading to channels resuming the send of confirms even though the node is about to enter pause mode.
| | * Merge branch 'stable'Simon MacMullen2015-02-262-44/+29
| | |\ | |_|/ |/| |
* | | Fix tests, the exceptions are a bit cleaner now (bug26610)Simon MacMullen2015-02-261-9/+7
| | |
* | | Further cleanup of boot exception handling (bug26610)Simon MacMullen2015-02-261-35/+22
| | | | | | | | | | | | | | | | | | | | | OK, the previous attempt was a bit misguided. Rather than catch any exception "inside" starting an app, log it and rethrow a token saying no further logging is needed, let's just let the exception bubble out and catch it once at the top level.
| | * Merge branch 'stable'Simon MacMullen2015-02-261-42/+35
| | |\ | |_|/ |/| |
* | | Merge branch 'bug26610' into stableSimon MacMullen2015-02-261-42/+35
|\ \ \
| * | | Detangle boot error handling, step 2Simon MacMullen2015-02-261-16/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used to have a weird situation where an error inside a boot step would cause us to go through error logging twice. The exception would get caught in run_step/3, then go through boot_error() etc, and basic_boot_error/3 would exit again, leading to more stuff being logged as an application start failure. So to fix that, let's exit with a special atom which prevents further logging. Also rename functions to be a bit more meaningful.
| * | | Detangle boot error handling, step 1Simon MacMullen2015-02-261-20/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to clean up basic_boot_error/3, but these call sites make it harder. Really, trying to produce nicely formatted errors here is a waste of time, these errors can only be caused when adding boot steps. Normal users should never see them. So don't complicate things by handling them specially.
| * | | Fix clearer timeout_waiting_for_tables error messageSimon MacMullen2015-02-231-10/+11
|/ / / | | | | | | | | | | | | ...and replace {boot_step, _, _} wrapping with an extra parameter which might make it harder to make similar mistakes in future.
| | * Merge branch 'stable'Simon MacMullen2015-02-260-0/+0
| | |\ | |_|/ |/| |
* | | Ignore etagsMichael Klishin2015-02-211-0/+3
|/ / | | | | | | | | Conflicts: .gitignore
| * Merge pull request #52 from jeckersb/sd_notify-for-upstreamMichael Klishin2015-02-261-0/+5
| |\ | | | | | | Add systemd notification support
| | * Add systemd notification supportJohn Eckersberg2015-02-251-0/+5
| | |
| * | Merge pull request #55 from rabbitmq/bug26614Michael Klishin2015-02-253-3/+3
| |\ \ | | | | | | | | When ERLANG_HOME is not set, exit with code 1
| | * | When ERLANG_HOME is not set, exit with code 1Michael Klishin2015-02-253-3/+3
| |/ /
| * | Merge branch 'bug26465'Simon MacMullen2015-02-233-29/+77
| |\ \
| | * | pause_if_all_down: Remove configuration checkJean-Sébastien Pédron2015-02-201-32/+5
| | | | | | | | | | | | | | | | No part of RabbitMQ does this type of checking, as pointed out by Simon.
| | * | pause_if_all_down: Lower priority of "listed nodes not in the cluster" messageJean-Sebastien Pedron2015-02-031-2/+2
| | | | | | | | | | | | | | | | | | | | An admin could add nodes to the pause_if_all_down list before creating the nodes.
| | * | How to recover from partitioning after 'pause_if_all_down' is configurableJean-Sebastien Pedron2014-12-242-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that 'pause_if_all_down' accepts a list of preferred nodes, it is possible that these nodes are spread across multiple partitions. For example, suppose we have nodes A and B in datacenter #1 and nodes C and D in datacenter #2, and we set {pause_if_all_down, [A, C]}, If the link between both datacenters is lost, A/B and C/D forms two partitions. RabbitMQ continues to run at both sites because all nodes see at least one node from the preferred nodes list. When the link comes back, we need to handle the recovery. Therefore, a user can specify the strategy: o {pause_if_all_down, [...], ignore} (default) o {pause_if_all_down, [...], autoheal} This third parameter is mandatory. If the strategy is 'ignore', RabbitMQ is started again on paused nodes, as soon as they see another node from the preferred nodes list. This is the default behaviour. If the strategy is 'autoheal', RabbitMQ is started again, like in 'ignore' mode, but when all nodes are up, autohealing kicks in as well. Compared to plain 'autoheal' mode, the chance of loosing data is low because paused nodes never drifted away from the cluster. When they start again, they join the cluster and resume operations as any starting node.
| | * | Rename 'keep_preferred' to 'pause_if_all_down' and accept a list of nodesJean-Sebastien Pedron2014-12-191-30/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, a partition is paused if all nodes from the pause_if_all_down list are seen as down. If a fraction of the list is alive, the nodes in the partition remain up. Compared to the previous version, some of the listed nodes can be taken down for maintenance without risking service interruption. However, this raises the problem of listed nodes distributed in multiple partitions: we need to handle recovery. This will be addressed in a followup commit.
| | * | Add a new "keep_preferred" cluster partition handling methodJean-Sebastien Pedron2014-12-022-27/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The syntax is: {cluster_partition_management, {keep_preferred, node@domain}} The specified node name is used to determine which partition should run or be suspended. Nodes which can still reach the specified node continue to run. Nodes which can't are suspended. Compared to pause_minority, this allows the admin to determine which nodes to prioritize in case of partitions with an equal number of nodes.
| * | | Ignore etagsMichael Klishin2015-02-211-0/+1
| | | |
| * | | Sync CONTRIBUTING.md with the template oneMichael Klishin2015-02-201-1/+1
| | | |
| * | | Merge branch 'stable'Jean-Sébastien Pédron2015-02-192-37/+35
| |\ \ \ | |/ / / |/| | |
* | | | Convert .hgignore to .gitignoreJean-Sébastien Pédron2015-02-192-35/+33
| | | |
| * | | Merge remote-tracking branch 'origin/master'Simon MacMullen2015-02-191-1/+1
| |\ \ \