store more info about the cluster on disc, check other nodes before clustering

Now the `cluster_nodes.config; doesn't store a "config" anymore, but all the information we need about the cluster nodes. This file is updated whenever a new node comes up. Moreover, this file should be *always* present, and it's set up in `rabbit:prepare/0'. Now that we have this file, the various functions regarding the status of the cluster (`all_clustered_nodes/0', `running_clustered_nodes/0;, etc.) can "work" even when mnesia is down (and when the rode is a ram node). Given the assumption that the status file is up to date, with the status of the cluster (before the node went down if the node is down) a lot of operations become easier. I'd like a review at this point, the best thing is probably to diff directly with default since a lot of previous commits are not relevant anymore. The most noticeable changes compared to default: * `join_cluster' works differently. The nodes provided will be used to "discover" the cluster, and are not used to determine whether the node should be ram or not. The node will be a disc node by default, and can be initialized as RAM with the `--ram' flag. The old `cluster' command is preserved, adapted to work with the new functions while preserving similar semantics (we decide on whether the node is disc or not by looking at the list of nodes provided). * `force_cluster' has been removed. * The `join_cluster' operation will fail if: - The node is currently the only disc node of its cluster - We can't connect to any of the nodes provided - The node is currently already clustered with the cluster of the nodes provided * On default restart RAM nodes try to connect to the nodes that were first given by the user, and are not aware of the changes that might have occurred in the cluster when they were online. Since the cluster status is kept updated on disk now, the RAM node will be aware of changes in the cluster when restarted. * Before starting up mnesia, the node contacts the nodes it thinks it's clustered with, and if the nodes are not clustered with the node anymore the startup procedure fail. We fail only when we know for sure that something is wrong - e.g. it won't fail it it doesn't find any online node Things to do: * Implement `uncluster'/`leave_cluster' to kick out a node from a cluster from another node - this is easy. * Implement something like `change_node_type', since given how `join_cluster' works it is not possible right now. * Rewrite the tests regarding to clustering. * Think hard about what can go wrong regarding the cluster status file and the relevant functions. The stuff in `rabbit_upgrade' is particularly worrying, and I need to make sure that things will work when upgrading rabbitmq, by reading old file or upgrading them. * Split `init_db/4' in various functions. We have much stronger assumptions now, for example we should never need to reset or wipe the schema in there. In general it's an ugly function, expecially the optional upgrade part * Probably something else...
author: Francesco Mazzoli <francesco@rabbitmq.com> 2012-05-15 18:04:32 +0100
committer: Francesco Mazzoli <francesco@rabbitmq.com> 2012-05-15 18:04:32 +0100
commit: bdd62ebd05e37de6bf38eef43d03c4bbca63602f (patch)
tree: d4ab45498714379e82bce2aafdc05b7e91d1a54f /src/rabbit.erl
parent: 44db2ff0480de4ba4708f24fcdf2cc0c88003f26 (diff)
download: rabbitmq-server-git-bdd62ebd05e37de6bf38eef43d03c4bbca63602f.tar.gz
1 files changed, 7 insertions, 3 deletions
diff --git a/src/rabbit.erl b/src/rabbit.erl
index bff7af97d2..eff2fac282 100644
--- a/src/rabbit.erl
+++ b/src/rabbit.erl
@@ -285,6 +285,9 @@ split0([I | Is], [L | Ls]) -> split0(Is, Ls ++ [[I | L]]).
 
 prepare() ->
     ok = ensure_working_log_handlers(),
+    ok = rabbit_mnesia:ensure_mnesia_dir(),
+    ok = rabbit_mnesia:initialize_cluster_nodes_status(),
+    ok = rabbit_mnesia:check_cluster_consistency(),
     ok = rabbit_upgrade:maybe_upgrade_mnesia().
 
 start() ->
@@ -380,7 +383,7 @@ start(normal, []) ->
     end.
 
 stop(_State) ->
-    ok = rabbit_mnesia:record_running_nodes(),
+    ok = rabbit_mnesia:update_cluster_nodes_status(),
     terminated_ok = error_logger:delete_report_handler(rabbit_error_logger),
     ok = rabbit_alarm:stop(),
     ok = case rabbit_mnesia:is_clustered() of
@@ -509,11 +512,12 @@ sort_boot_steps(UnsortedSteps) ->
     end.
 
 boot_step_error({error, {timeout_waiting_for_tables, _}}, _Stacktrace) ->
+    {AllNodes, _, _} = rabbit_mnesia:read_cluster_nodes_status(),
     {Err, Nodes} =
-        case rabbit_mnesia:read_previously_running_nodes() of
+        case AllNodes -- [node()] of
             [] -> {"Timeout contacting cluster nodes. Since RabbitMQ was"
                    " shut down forcefully~nit cannot determine which nodes"
-                   " are timing out.~n"};
+                   " are timing out.~n", []};
             Ns -> {rabbit_misc:format(
                      "Timeout contacting cluster nodes: ~p.~n", [Ns]),
                    Ns}
author	Francesco Mazzoli <francesco@rabbitmq.com>	2012-05-15 18:04:32 +0100
committer	Francesco Mazzoli <francesco@rabbitmq.com>	2012-05-15 18:04:32 +0100
commit	bdd62ebd05e37de6bf38eef43d03c4bbca63602f (patch)
tree	d4ab45498714379e82bce2aafdc05b7e91d1a54f /src/rabbit.erl
parent	44db2ff0480de4ba4708f24fcdf2cc0c88003f26 (diff)
download	rabbitmq-server-git-bdd62ebd05e37de6bf38eef43d03c4bbca63602f.tar.gz