summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'bug26632'Jean-Sébastien Pédron2015-03-062-19/+40
|\
| * Autoheal: Wait for current outside app. process to finishJean-Sébastien Pédron2015-03-052-19/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rabbit_node_monitor:run_outside_applications/2 gains a new argument to indicate if the caller is willing is let a possible current outside application process to finish before its own function can run. If this argument is false, the current behaviour is kept: the function is not executed. This fixes a bug where autoheal's outside application function was not executed if pause_if_all_down's outside application function was still running (ie. rabbit:start() didn't return yet). The bug caused the loser to not stop and thus the autoheal process to never complete. Now, the autoheal outside application function waits for the pause_if_all_down function to finish before stopping the loser.
* | Merge branch 'bug26628'Simon MacMullen2015-03-051-43/+147
|\ \
| * | Autoheal: Add a restart_loser/2 functionJean-Sébastien Pédron2015-03-051-34/+38
| | | | | | | | | | | | ... and simplify the handle_msg({winner_is, _}, ...) clause.
| * | Autoheal: Document the protocol changeJean-Sébastien Pédron2015-03-041-3/+66
| | | | | | | | | | | | While here, document the message flow.
| * | Autoheal: The leader waits for "done!" message from the winnerJean-Sébastien Pédron2015-03-041-30/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, the leader was monitoring the losers itself (exactly like the winner). When they were all down, it was going back to the "not_healing" state. Therefore, there was a possibility that the leader and winner went out-of-sync regarding the autoheal state. Now, the leader simply waits for a confirmation from the winner that the autoheal process is over. If the leader is a loser too, the autoheal state is saved in the application environment to survive the restart. When the leader is back up, it asks the winner to possibly notify it again.
* | | Add more explanatory text for timeout_waiting_for_tablesSimon MacMullen2015-03-051-2/+8
| |/ |/|
* | Merge branch 'bug26631'Jean-Sébastien Pédron2015-03-042-1/+67
|\ \ | |/ |/|
| * Workaround "global" hangJean-Sébastien Pédron2015-03-042-1/+67
|/ | | | | | | | | | | | | | | | | | | | In rabbit_node_monitor:disconnect/1, we change the "dist_auto_connect" kernel parameter to force a disconnection and give some time to all components to handle the subsequent "nodedown" event. "global" doesn't handle this situation very well. With an unfortunate sequence of messages and bad timings, this can trigger an inconsistency in its internal state. When this happens, global:sync() never returns. See bug 26556, comment #5 for a detailed description. The workaround consists of a process who parses the "global" internal state if global:sync/0 didn't return in 15 seconds. If the state contains in-progress synchronisation older than 10 seconds, the spawned process sends fake nodedown/nodeup events to "global" on both inconsistent nodes so they restart their synchronisation. This workaround will be removed once the real bugs are fixed and "dist_auto_connect" is left untouched.
* Merge branch 'stable'Jean-Sébastien Pédron2015-03-031-21/+25
|\
| * Merge branch 'bug26622' into stableJean-Sébastien Pédron2015-03-031-10/+12
| |\
| | * rabbit_node_monitor: Cache pause_minority_guard() return valueJean-Sébastien Pédron2015-03-031-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the list returned by `nodes()` didn't change since last call to `pause_minority_guard()`, return the previous state, not `ok`. This fixes a bug where the first call to `pause_minority_guard()` could return `pausing` but the subsequent calls would return `ok`, leading to channels resuming the send of confirms even though the node is about to enter pause mode.
* | | Merge branch 'stable'Simon MacMullen2015-02-262-44/+29
|\ \ \ | |/ /
| * | Fix tests, the exceptions are a bit cleaner now (bug26610)Simon MacMullen2015-02-261-9/+7
| | |
| * | Further cleanup of boot exception handling (bug26610)Simon MacMullen2015-02-261-35/+22
| | | | | | | | | | | | | | | | | | | | | OK, the previous attempt was a bit misguided. Rather than catch any exception "inside" starting an app, log it and rethrow a token saying no further logging is needed, let's just let the exception bubble out and catch it once at the top level.
* | | Merge branch 'stable'Simon MacMullen2015-02-261-42/+35
|\ \ \ | |/ /
| * | Merge branch 'bug26610' into stableSimon MacMullen2015-02-261-42/+35
| |\ \
| | * | Detangle boot error handling, step 2Simon MacMullen2015-02-261-16/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used to have a weird situation where an error inside a boot step would cause us to go through error logging twice. The exception would get caught in run_step/3, then go through boot_error() etc, and basic_boot_error/3 would exit again, leading to more stuff being logged as an application start failure. So to fix that, let's exit with a special atom which prevents further logging. Also rename functions to be a bit more meaningful.
| | * | Detangle boot error handling, step 1Simon MacMullen2015-02-261-20/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to clean up basic_boot_error/3, but these call sites make it harder. Really, trying to produce nicely formatted errors here is a waste of time, these errors can only be caused when adding boot steps. Normal users should never see them. So don't complicate things by handling them specially.
| | * | Fix clearer timeout_waiting_for_tables error messageSimon MacMullen2015-02-231-10/+11
| |/ / | | | | | | | | | | | | ...and replace {boot_step, _, _} wrapping with an extra parameter which might make it harder to make similar mistakes in future.
* | | Merge branch 'stable'Simon MacMullen2015-02-260-0/+0
|\ \ \ | |/ /
| * | Ignore etagsMichael Klishin2015-02-211-0/+3
| |/ | | | | | | | | Conflicts: .gitignore
* | Merge pull request #52 from jeckersb/sd_notify-for-upstreamMichael Klishin2015-02-261-0/+5
|\ \ | | | | | | Add systemd notification support
| * | Add systemd notification supportJohn Eckersberg2015-02-251-0/+5
| | |
* | | Merge pull request #55 from rabbitmq/bug26614Michael Klishin2015-02-253-3/+3
|\ \ \ | | | | | | | | When ERLANG_HOME is not set, exit with code 1
| * | | When ERLANG_HOME is not set, exit with code 1Michael Klishin2015-02-253-3/+3
|/ / /
* | | Merge branch 'bug26465'Simon MacMullen2015-02-233-29/+77
|\ \ \
| * | | pause_if_all_down: Remove configuration checkJean-Sébastien Pédron2015-02-201-32/+5
| | | | | | | | | | | | | | | | No part of RabbitMQ does this type of checking, as pointed out by Simon.
| * | | pause_if_all_down: Lower priority of "listed nodes not in the cluster" messageJean-Sebastien Pedron2015-02-031-2/+2
| | | | | | | | | | | | | | | | | | | | An admin could add nodes to the pause_if_all_down list before creating the nodes.
| * | | How to recover from partitioning after 'pause_if_all_down' is configurableJean-Sebastien Pedron2014-12-242-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that 'pause_if_all_down' accepts a list of preferred nodes, it is possible that these nodes are spread across multiple partitions. For example, suppose we have nodes A and B in datacenter #1 and nodes C and D in datacenter #2, and we set {pause_if_all_down, [A, C]}, If the link between both datacenters is lost, A/B and C/D forms two partitions. RabbitMQ continues to run at both sites because all nodes see at least one node from the preferred nodes list. When the link comes back, we need to handle the recovery. Therefore, a user can specify the strategy: o {pause_if_all_down, [...], ignore} (default) o {pause_if_all_down, [...], autoheal} This third parameter is mandatory. If the strategy is 'ignore', RabbitMQ is started again on paused nodes, as soon as they see another node from the preferred nodes list. This is the default behaviour. If the strategy is 'autoheal', RabbitMQ is started again, like in 'ignore' mode, but when all nodes are up, autohealing kicks in as well. Compared to plain 'autoheal' mode, the chance of loosing data is low because paused nodes never drifted away from the cluster. When they start again, they join the cluster and resume operations as any starting node.
| * | | Rename 'keep_preferred' to 'pause_if_all_down' and accept a list of nodesJean-Sebastien Pedron2014-12-191-30/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, a partition is paused if all nodes from the pause_if_all_down list are seen as down. If a fraction of the list is alive, the nodes in the partition remain up. Compared to the previous version, some of the listed nodes can be taken down for maintenance without risking service interruption. However, this raises the problem of listed nodes distributed in multiple partitions: we need to handle recovery. This will be addressed in a followup commit.
| * | | Add a new "keep_preferred" cluster partition handling methodJean-Sebastien Pedron2014-12-022-27/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The syntax is: {cluster_partition_management, {keep_preferred, node@domain}} The specified node name is used to determine which partition should run or be suspended. Nodes which can still reach the specified node continue to run. Nodes which can't are suspended. Compared to pause_minority, this allows the admin to determine which nodes to prioritize in case of partitions with an equal number of nodes.
* | | | Ignore etagsMichael Klishin2015-02-211-0/+1
| | | |
* | | | Sync CONTRIBUTING.md with the template oneMichael Klishin2015-02-201-1/+1
| | | |
* | | | Merge branch 'stable'Jean-Sébastien Pédron2015-02-192-37/+35
|\ \ \ \ | | |_|/ | |/| |
| * | | Convert .hgignore to .gitignoreJean-Sébastien Pédron2015-02-192-35/+33
| | | |
* | | | Merge remote-tracking branch 'origin/master'Simon MacMullen2015-02-191-1/+1
|\ \ \ \
| * \ \ \ Merge pull request #53 from nilcons-contrib/fix-typoMichael Klishin2015-02-191-1/+1
| |\ \ \ \ | | |_|_|/ | |/| | | Fix a hilarious typo in the example config
| | * | | Fix a hilarious typo in the example configMihaly Barasz2015-02-191-1/+1
| |/ / /
* | | | Merge branch 'bug26603'Simon MacMullen2015-02-191-2/+4
|\ \ \ \ | |/ / / |/| | |
| * | | Fix O(n^2) time to ack / requeue multiple messages.Simon MacMullen2015-02-191-2/+4
|/ / / | | | | | | | | | | | | | | | orddict:append/2 uses ++ internally to build the list in the correct order, which is O(length). So it's O(n^2) to call it n times. Instead let's build the list backwards then reverse it - O(n).
* | | Merge branch 'bug26602'Simon MacMullen2015-02-182-11/+13
|\ \ \
| * | | Record the routing decision on published messages.Simon MacMullen2015-02-182-11/+13
|/ / /
* | | Sync CONTRIBUTING.md with the template oneMichael Klishin2015-02-181-5/+6
| | |
* | | Sync CONTRIBUTING.md with the template oneMichael Klishin2015-02-181-0/+50
| | |
* | | Merge branch 'stable'Jean-Sébastien Pédron2015-02-171-3/+0
|\ \ \ | |/ /
| * | Remove the "moved to GitHub" warning.Jean-Sébastien Pédron2015-02-171-3/+0
| | |
* | | Merge branch 'stable'Jean-Sébastien Pédron2015-02-171-2/+3
|\ \ \ | |/ /
| * | Merge branch 'bug25547' into stableJean-Sébastien Pédron2015-02-171-2/+3
| |\ \
| | * | rabbitmq.config.example: Send people to GitHub, not hg.rabbitmq.comJean-Sebastien Pedron2015-02-161-2/+3
| | | |