Remove consumer bias & allow queues under max load to drain quickly

Given a queue process under max load, with both publishers & consumers, if consumers are not **always** prioritised over publishers, a queue can take 1 day (or more) to fully drain. Even without consumer bias, queues can drain fast (i.e. 10 minutes in our case), or slow (i.e. 1 hour or more). To put it differently, this is what a slow drain looks like: ``` ___ <- 2,000,000 messages / \__ / \___ _ _ / \___/ \_____/ \___ / \ |-------------- 1h --------------| ``` And this is what a fast drain looks like: ``` _ <- 1,500,000 messages / \_ / \___ / \ |- 10 min -| ``` We are still trying to understand the reason behind different drain rates, but without removing consumer bias, this would **always** happen: ``` ______________ <- 2,000,000 messages / \_______________ / \______________ ________ / \__/ \______ / \ |----------------------------- 1 day ---------------------------------| ``` Other observations worth capturing: ``` | PUBLISHERS | CONSUMERS | READY MESSAGES | PUBLISH MSG/S | CONSUME ACK MSG/S | | ---------- | --------- | -------------- | --------------- | ----------------- | | 3 | 3 | 0 | 22,000 - 23,000 | 22,000 - 23,000 | | 3 | 3 | 1 - 2,000,000 | 5,000 - 8,000 | 7,000 - 11,000 | | 3 | 0 | 1 - 2,000,000 | 21,000 - 25,000 | 0 | | 3 | 0 | 2,000,000 | 5,000 - 15,000 | 0 | ``` * Empty queues are the fastest since messages are delivered straight to consuming channels * With 3 publishing channels, a single queue process gets saturated at 22,000 msg/s. The client that we used for this benchmark would max at 10,000 msg/s, meaning that we needed 3 clients, each with 1 connection & 1 channel to max the queue process. It is possible that a single fast client using 1 connection & 1 channel would achieve a slightly higher throughput, but we didn't measure on this occasion. It's highly unrealistic for a production, high-throughput RabbitMQ deployment to use 1 publishers running 1 connection & 1 channel. If anything, there would be many more publishers with many connections & channels. * When a queue process gets saturated, publishing channels & their connections will enter flow state, meaning that the publishing rates will be throttled. This allows the consuming channels to keep up with the publishing ones. This is a good thing! A message backlog slows both publishers & consumers, as the above table captures. * Adding more publishers or consumers slow down publishinig & consuming. The queue process, and ultimately the Erlang VMs (typically 1 per CPU), have more work to do, so it's expected for message throughput to decrease. Most relevant properties that we used for this benchmark: ``` | ERLANG | 19.3.6.2 | | RABBITMQ | 3.6.12 | | GCP INSTANCE TYPE | n1-standard-4 | | -------------------- | ------------ | | QUEUE | non-durable | | MAX-LENGTH | 2,000,000 | | -------------------- | ------------ | | PUBLISHERS | 3 | | PUBLISHER RATE MSG/S | 10,000 | | MSG SIZE | 1KB | | -------------------- | ------------ | | CONSUMERS | 3 | | PREFETCH | 100 | | MULTI-ACK | every 10 msg | ``` Worth mentioning, `vm_memory_high_watermark_paging_ratio` was set to a really high value so that messages would not be paged to disc. When messages are paged out, all other queue operations are blocked, including all publishes and consumes. Artefacts attached to rabbitmq/rabbitmq-server#1378 : - [ ] RabbitMQ management screenshots - [ ] Observer Load Chars - [ ] OS metrics - [ ] RabbitMQ definitions - [ ] BOSH manifest with all RabbitMQ deployment properties - [ ] benchmark app CloudFoundry manifests.yml [#151499632]
author: Gerhard Lazu <gerhard@lazu.co.uk> 2017-09-27 16:29:20 +0100
committer: Gerhard Lazu <gerhard@lazu.co.uk> 2017-09-28 09:54:52 +0100
commit: 155eb6b0bffe3126ab18ab228296821ce0dc1f8c (patch)
tree: e8ff76d720798a7c636b27a1fd6482bd8e018461 /src
parent: 1c81095486f56ca9dcfa19177594d6e5be1fbe0a (diff)
download: rabbitmq-server-git-155eb6b0bffe3126ab18ab228296821ce0dc1f8c.tar.gz
1 files changed, 5 insertions, 13 deletions
diff --git a/src/rabbit_amqqueue_process.erl b/src/rabbit_amqqueue_process.erl
index 16a5a70e13..cdadb5b1d4 100644
--- a/src/rabbit_amqqueue_process.erl
+++ b/src/rabbit_amqqueue_process.erl
@@ -22,7 +22,6 @@
 
 -define(SYNC_INTERVAL,                 200). %% milliseconds
 -define(RAM_DURATION_UPDATE_INTERVAL, 5000).
--define(CONSUMER_BIAS_RATIO,           1.1). %% i.e. consume 10% faster
 
 -export([info_keys/0]).
 
@@ -969,18 +968,18 @@ emit_consumer_deleted(ChPid, ConsumerTag, QName) ->
 
 %%----------------------------------------------------------------------------
 
-prioritise_call(Msg, _From, _Len, State) ->
+prioritise_call(Msg, _From, _Len, _State) ->
     case Msg of
         info                                       -> 9;
         {info, _Items}                             -> 9;
         consumers                                  -> 9;
         stat                                       -> 7;
-        {basic_consume, _, _, _, _, _, _, _, _, _} -> consumer_bias(State);
-        {basic_cancel, _, _, _}                    -> consumer_bias(State);
+        {basic_consume, _, _, _, _, _, _, _, _, _} -> 1;
+        {basic_cancel, _, _, _}                    -> 1;
         _                                          -> 0
     end.
 
-prioritise_cast(Msg, _Len, State) ->
+prioritise_cast(Msg, _Len, _State) ->
     case Msg of
         delete_immediately                   -> 8;
         {set_ram_duration_target, _Duration} -> 8;
@@ -988,7 +987,7 @@ prioritise_cast(Msg, _Len, State) ->
         {run_backing_queue, _Mod, _Fun}      -> 6;
         {ack, _AckTags, _ChPid}              -> 3; %% [1]
         {resume, _ChPid}                     -> 2;
-        {notify_sent, _ChPid, _Credit}       -> consumer_bias(State);
+        {notify_sent, _ChPid, _Credit}       -> 1;
         _                                    -> 0
     end.
 
@@ -1001,13 +1000,6 @@ prioritise_cast(Msg, _Len, State) ->
 %% about. Finally, we prioritise ack over resume since it should
 %% always reduce memory use.
 
-consumer_bias(#q{backing_queue = BQ, backing_queue_state = BQS}) ->
-    case BQ:msg_rates(BQS) of
-        {0.0,          _} -> 0;
-        {Ingress, Egress} when Egress / Ingress < ?CONSUMER_BIAS_RATIO -> 1;
-        {_,            _} -> 0
-    end.
-
 prioritise_info(Msg, _Len, #q{q = #amqqueue{exclusive_owner = DownPid}}) ->
     case Msg of
         {'DOWN', _, process, DownPid, _}     -> 8;
author	Gerhard Lazu <gerhard@lazu.co.uk>	2017-09-27 16:29:20 +0100
committer	Gerhard Lazu <gerhard@lazu.co.uk>	2017-09-28 09:54:52 +0100
commit	155eb6b0bffe3126ab18ab228296821ce0dc1f8c (patch)
tree	e8ff76d720798a7c636b27a1fd6482bd8e018461 /src
parent	1c81095486f56ca9dcfa19177594d6e5be1fbe0a (diff)
download	rabbitmq-server-git-155eb6b0bffe3126ab18ab228296821ce0dc1f8c.tar.gz