summaryrefslogtreecommitdiff
path: root/include/haproxy/thread.h
Commit message (Collapse)AuthorAgeFilesLines
* MEDIUM: quic: use a global CID trees listAmaury Denoyelle2023-04-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | Previously, quic_connection_id were stored in a per-thread tree list. Datagram were first dispatched to the correct thread using the encoded TID before a tree lookup was done. Remove these trees and replace it with a global trees list of 256 entries. A CID is using the list index corresponding to its first byte. On datagram dispatch, CID is lookup on its tree and TID is retrieved using new member quic_connection_id.tid. As such, a read-write lock protects each list instances. With 256 entries, it is expected that contention should be reduced. A new structure quic_cid_tree served as a tree container associated with its read-write lock. An API is implemented to ensure lock safety for insert/lookup/delete operation. This patch is a step forward to be able to break the affinity between a CID and a TID encoded thread. This is required to be able to migrate a quic_conn after accept to select thread based on their load. This should be backported up to 2.7 after a period of observation.
* CLEANUP: quic: remove unused QUIC_LOCK labelAmaury Denoyelle2023-04-181-1/+0
| | | | | | | QUIC_LOCK label is never used. Indeed, lock usage is minimal on QUIC as every connection is pinned to its owned thread. This should be backported up to 2.7.
* BUG/MEDIUM: event_hdl: clean soft-stop handlingAurelien DARRAGON2023-04-051-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | soft-stop was not explicitly handled in event_hdl API. Because of this, event_hdl was causing some leaks on deinit paths. Moreover, a task responsible for handling events could require some additional cleanups (ie: advanced async task), and as the task was not protected against abort when soft-stopping, such cleanup could not be performed unless the task itself implements the required protections, which is not optimal. Consider this new approach: 'jobs' global variable is incremented whenever an async subscription is created to prevent the related task from being aborted before the task acknowledges the final END event. Once the END event is acknowledged and freed by the task, the 'jobs' variable is decremented, and the deinit process may continue (including the abortion of remaining tasks not guarded by the 'jobs' variable). To do this, a new global mt_list is required: known_event_hdl_sub_list This list tracks the known (initialized) subscription lists within the process. sub_lists are automatically added to the "known" list when calling event_hdl_sub_list_init(), and are removed from the list with event_hdl_sub_list_destroy(). This allows us to implement a global thread-safe event_hdl deinit() function that is automatically called on soft-stop thanks to signal(0). When event_hdl deinit() is initiated, we simply iterate against the known subscription lists to destroy them. event_hdl_subscribe_ptr() was slightly modified to make sure that a sub_list may not accept new subscriptions once it is destroyed (removed from the known list) This can occur between the time the soft-stop is initiated (signal(0)) and haproxy actually enters in the deinit() function (once tasks are either finished or aborted and other threads already joined). It is safe to destroy() the subscription list multiple times as long as the pointer is still valid (ie: first on soft-stop when handling the '0' signal, then from regular deinit() path): the function does nothing if the subscription list is already removed. We partially reverted "BUG/MINOR: event_hdl: make event_hdl_subscribe thread-safe" since we can use parent mt_list locking instead of a dedicated lock to make the check gainst duplicate subscription ID. (insert_lock is not useful anymore) The check in itself is not changed, only the locking method. sizeof(event_hdl_sub_list) slightly increases: from 24 bits to 32bits due to the additional mt_list struct within it. With that said, having thread-safe list to store known subscription lists is a good thing: it could help to implement additional management logic for subcription lists and could be useful to add some stats or debugging tools in the future. If 68e692da0 ("MINOR: event_hdl: add event handler base api") is being backported, then this commit should be backported with it.
* BUG/MINOR: event_hdl: make event_hdl_subscribe thread-safeAurelien DARRAGON2023-04-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | List insertion in event_hdl_subscribe() was not thread-safe when dealing with unique identifiers. Indeed, in this case the list insertion is conditional (we check for a duplicate, then we insert). And while we're using mt lists for this, the whole operation is not atomic: there is a race between the check and the insertion. This could lead to the same ID being registered multiple times with concurrent calls to event_hdl_subscribe() on the same ID. To fix this, we add 'insert_lock' dedicated lock in the subscription list struct. The lock's cost is nearly 0 since it is only used when registering identified subscriptions and the lock window is very short: we only guard the duplicate check and the list insertion to make the conditional insertion "atomic" within a given subscription list. This is the only place where we need the lock: as soon as the item is properly inserted we're out of trouble because all other operations on the list are already thread-safe thanks to mt lists. A new lock hint is introduced: LOCK_EHDL which is dedicated to event_hdl The patch may seem quite large since we had to rework the logic around the subscribe function and switch from simple mt_list to a dedicated struct wrapping both the mt_list and the insert_lock for the event_hdl_sub_list type. (sizeof(event_hdl_sub_list) is now 24 instead of 16) However, all the changes are internal: we don't break the API. If 68e692da0 ("MINOR: event_hdl: add event handler base api") is being backported, then this commit should be backported with it.
* BUILD: thread: implement thread_harmless_end_sig() for threadless buildsWilly Tarreau2023-03-221-0/+4
| | | | | | | Building without thread support was broken in 2.8-dev2 with commit 7e70bfc8c ("MINOR: threads: add a thread_harmless_end() version that doesn't wait") that forgot to define the function for the threadless cases. No backport is needed.
* CLEANUP: listener/thread: remove now unused bind_conf's bind_tgroup/bind_threadWilly Tarreau2023-02-031-1/+1
| | | | Not needed anymore since last commit, let's get rid of it.
* MEDIUM: listener/config: make the "thread" parser rely on thread_setsWilly Tarreau2023-02-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of reading and storing a single group and a single mask for a "thread" directive on a bind line, we now store the complete range in a thread set that's stored in the bind_conf. The bind_parse_thread() function now just calls parse_thread_set() to complete the current set, which starts empty, and thread_resolve_group_mask() was updated to support retrieving thread group numbers or absolute thread numbers directly from the pre-filled thread_set, and continue to feed bind_tgroup and bind_thread. The CLI parsers which were pre-initialized to set the bind_tgroup to 1 cannot do it anymore as it would prevent one from restricting the thread set. Instead check_config_validity() now detects the CLI frontend and passes the info down to thread_resolve_group_mask() that will automatically use only the group 1's threads for these listeners. The same is done for the peers listeners for now. At this step it's already possible to start with all previous valid configs as well as extended ones supporting comma-delimited thread sets. In addition the parser already accepts large ranges spanning multiple groups, but since the underlying listeners infrastructure is not read, for now we're maintaining a specific check against this at the higher level of the config validity check. The patch is a bit large because thread resolution is performed in multiple steps, so we need to adjust all of them at once to preserve functional and technical consistency.
* MINOR: thread: add a simple thread_set APIWilly Tarreau2023-02-031-1/+2
| | | | | | | The purpose is to be able to store large thread sets, defined by ranges that may cross group boundaries, as well as define lists of groups and masks. The thread_set struct implements the storage, and the parser is in parse_thread_set(), with a focus on "bind" lines, but not only.
* MINOR: threads: add a thread_harmless_end() version that doesn't waitWilly Tarreau2023-01-191-0/+12
| | | | | | | | | | | | | | | | | thread_harmless_end() needs to wait for rdv_requests to disappear so that we're certain to respect a harmless promise that possibly allowed another thread to proceed under isolation. But this doesn't work in a signal handler because a thread could be interrupted by the debug handler while already waiting for isolation and with rdv_request>0. As such this function could cause a deadlock in such a signal handler. Let's implement a specific variant for this, thread_harmless_end_sig(), that just resets the thread's bit and doesn't wait. It must of course not be used past a check point that would allow the isolation requester to return and see the thread as temporarily harmless then turning back on its promise. This will be needed to fix a race in the debug handler.
* MINOR: ssl: Add a lock to the OCSP response treeRemi Tricot-Le Breton2022-12-211-0/+1
| | | | | | | | | | | | | | The tree that contains OCSP responses is never locked despite being used at runtime for OCSP stapling as well as the CLI through "set ssl cert" and "set ssl ocsp-response" commands. Everything works though because the certificate_ocsp structure is refcounted and the tree's entries are cleaned up when SSL_CTXs are destroyed (thanks to an ex_data entry in which the certificate_ocsp pointer is stored). This new lock will come to use when the OCSP auto update mechanism is fully implemented because this new feature will be based on another tree that stores the same certificate_ocsp members and updates their contents periodically.
* CLEANUP: threads: remove the now unused all_threads_mask and tid_bitWilly Tarreau2022-07-151-5/+0
| | | | | | Since these are not used anymore, let's now remove them. Given the number of places where we're using ti->ldit_bit, maybe an equivalent might be useful though.
* MINOR: fd/thread: get rid of thread_mask()Willy Tarreau2022-07-151-6/+0
| | | | | | | | | | Since commit d2494e048 ("BUG/MEDIUM: peers/config: properly set the thread mask") there must not remain any single case of a receiver that is bound nowhere, so there's no need anymore for thread_mask(). We're adding a test in fd_insert() to make sure this doesn't happen by accident though, but the function was removed and its rare uses were replaced with the original value of the bind_thread msak.
* MINOR: thread: add is_thread_harmless() to know if a thread already is harmlessWilly Tarreau2022-07-011-0/+11
| | | | | | | The harmless status is not re-entrant, so sometimes for signal handling it can be useful to know if we're already harmless or not. Let's add a function doing that, and make the debugger use it instead of manipulating the harmless mask.
* MAJOR: threads: change thread_isolate to support inter-group synchronizationWilly Tarreau2022-07-011-25/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | thread_isolate() and thread_isolate_full() were relying on a set of thread masks for all threads in different states (rdv, harmless, idle). This cannot work anymore when the number of threads increases beyond LONGBITS so we need to change the mechanism. What is done here is to have a counter of requesters and the number of the current isolated thread. Threads which want to isolate themselves increment the request counter and wait for all threads to be marked harmless (or idle) by scanning all groups and watching the respective masks. This is possible because threads cannot escape once they discover this counter, unless they also want to isolate and possibly pass first. Once all threads are harmless, the requesting thread tries to self-assign the isolated thread number, and if it fails it loops back to checking all threads. If it wins it's guaranted to be alone, and can drop its harmless bit, so that other competing threads go back to the loop waiting for all threads to be harmless. The benefit of proceeding this way is that there's very little write contention on the thread number (none during work), hence no cache line moves between caches, thus frozen threads do not slow down the isolated one. Once it's done, the isolated thread resets the thread number (hence lets another thread take the place) and decrements the requester count, thus possibly releasing all harmless threads. With this change there's no more need for any global mask to synchronize any thread, and we only need to loop over a number of groups to check 64 threads at a time per iteration. As such, tinfo's threads_want_rdv could be dropped. This was tested with 64 threads spread into 2 groups, running 64 tasks (from the debug dev command), 20 "show sess" (thread_isolate()), 20 "add server blah/blah" (thread_isolate()), and 20 "del server blah/blah" (thread_isolate_full()). The load remained very low (limited by external socat forks) and no stuck nor starved thread was found.
* MEDIUM: tinfo: add a dynamic thread-group contextWilly Tarreau2022-07-011-15/+13
| | | | | | | | | | | | | | | | | | The thread group info is not sufficient to represent a thread group's current state as it's read-only. We also need something comparable to the thread context to represent the aggregate state of the threads in that group. This patch introduces ha_tgroup_ctx[] and tg_ctx for this. It's indexed on the group id and must be cache-line aligned. The thread masks that were global and that do not need to remain global were moved there (want_rdv, harmless, idle). Given that all the masks placed there now become group-specific, the associated thread mask (tid_bit) now switches to the thread's local bit (ltid_bit). Both are the same for nbtgroups 1 but will differ for other values. There's also a tg_ctx pointer in the thread so that it can be reached from other threads.
* CLEANUP: thread: remove thread_sync_release() and thread_sync_maskWilly Tarreau2022-07-011-15/+1
| | | | | | | | | | | | This function was added in 2.0 when reworking the thread isolation mechanism to make it more reliable. However it if fundamentally incompatible with the full isolation mechanism provided by thread_isolate_full() since that one will wait for all threads to become idle while the former will wait for all threads to finish waiting, causing a deadlock. Given that it's not used, let's just drop it entirely before it gets used by accident.
* MINOR: thread: add a new all_tgroups_mask variable to know about active tgroupsWilly Tarreau2022-07-011-0/+2
| | | | | | In order to kill all_threads_mask we'll need to have an equivalent for the thread groups. The all_tgroups_mask does just this, it keeps one bit set per enabled group.
* MINOR: tinfo: add the tgid to the thread_info structWilly Tarreau2022-07-011-2/+2
| | | | | | | At several places we're dereferencing the thread group just to catch the group number, and this will become even more required once we start to use per-group contexts. Let's just add the tgid in the thread_info struct to make this easier.
* MINOR: tinfo: make tid temporarily still reflect global IDWilly Tarreau2022-07-011-1/+1
| | | | | | For now we still set tid_bit to (1UL << tid) because FDs will not work with more than one group without this, but once FDs start to adopt local masks this must change to thr->ltid_bit.
* MINOR: tinfo: remove the global thread ID bit (tid_bit)Willy Tarreau2022-06-141-2/+2
| | | | | | | | Each thread has its own local thread id and its own global thread id, in addition to the masks corresponding to each. Once the global thread ID can go beyond 64 it will not be possible to have a global thread Id bit anymore, so better start to remove it and use only the local one from the struct thread_info.
* MINOR: threads: add a new function to resolve config groups and masksWilly Tarreau2021-10-081-0/+1
| | | | | | | | | | | | | | | | | | | In the configuration sometimes we'll omit a thread group number to designate a global thread number range, and sometimes we'll mention the group and designate IDs within that group. The operation is more complex than it seems due to the need to check for ranges spanning between multiple groups and determining groups from threads from bit masks and remapping bit masks between local/global. This patch adds a function to perform this operation, it takes a group and mask on input and updates them on output. It's designed to be used by "bind" lines but will likely be usable at other places if needed. For situations where specified threads do not exist in the group, we have the choice in the code between silently fixing the thread set or failing with a message. For now the better option seems to return an error, but if it turns out to be an issue we can easily change that in the future. Note that it should only happen with "x/even" when group x only has one thread.
* MINOR: threads: add the current group ID in thread-local "tgid" variableWilly Tarreau2021-10-081-0/+5
| | | | | | This is the equivalent of "tid" for ease of access. In the future if we make th_cfg a pure thread-local array (not a pointer), it may make sense to move it there.
* MEDIUM: threads: replace ha_set_tid() with ha_set_thread()Willy Tarreau2021-10-081-14/+34
| | | | | | | | | | | | ha_set_tid() was randomly used either to explicitly set thread 0 or to set any possibly incomplete thread during boot. Let's replace it with a pointer to a valid thread or NULL for any thread. This allows us to check that the designated threads are always valid, and to ignore the thread 0's mapping when setting it to NULL, and always use group 0 with it during boot. The initialization code is also cleaner, as we don't pass ugly casts of a thread ID to a pointer anymore.
* MEDIUM: threads: automatically assign threads to groupsWilly Tarreau2021-10-081-0/+1
| | | | | | | This takes care of unassigned threads groups and places unassigned threads there, in a more or less balanced way. Too sparse allocations may still fail though. For now with a maximum group number fixed to 1 nothing can really fail.
* MINOR: threads: make tg point to the current thread's groupWilly Tarreau2021-10-081-1/+7
| | | | | | | | A the "tg" thread-local variable now always points to the current thread group. It's pre-initializd to the first one during boot and is set to point to the thread's one by ha_set_tid(). This last one takes care of checking whether the thread group was assigned or not because it may be called during boot before threads are initialized.
* REORG: thread/sched: move the task_per_thread stuff to thread_ctxWilly Tarreau2021-10-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The scheduler contains a lot of stuff that is thread-local and not exclusively tied to the scheduler. Other parts (namely thread_info) contain similar thread-local context that ought to be merged with it but that is even less related to the scheduler. However moving more data into this structure isn't possible since task.h is high level and cannot be included everywhere (e.g. activity) without causing include loops. In the end, it appears that the task_per_thread represents most of the per-thread context defined with generic types and should simply move to tinfo.h so that everyone can use them. The struct was renamed to thread_ctx and the variable "sched" was renamed to "th_ctx". "sched" used to be initialized manually from run_thread_poll_loop(), now it's initialized by ha_set_tid() just like ti, tid, tid_bit. The memset() in init_task() was removed in favor of a bss initialization of the array, so that other subsystems can put their stuff in this array. Since the tasklet array has TL_CLASSES elements, the TL_* definitions was moved there as well, but it's not a problem. The vast majority of the change in this patch is caused by the renaming of the structures.
* CLEANUP: thread: uninline ha_tkill/ha_tkillall/ha_cpu_relax()Willy Tarreau2021-10-071-36/+3
| | | | | | These ones are rarely used or only to waste CPU cycles waiting, and are the last ones requiring system includes in thread.h. Let's uninline them and move them to thread.c.
* REORG: thread: move ha_get_pthread_id() to thread.cWilly Tarreau2021-10-071-0/+6
| | | | | It's the last function which directly accesses the pthread_t, let's move it to thread.c and leave a static inline for non-thread.
* REORG: thread: move the thread init/affinity/stop to thread.cWilly Tarreau2021-10-071-0/+15
| | | | | | | | | haproxy.c still has to deal with pthread-specific low-level stuff that is OS-dependent. We should not have to deal with this there, and we do not need to access pthread anywhere else. Let's move these 3 functions to thread.c and keep empty inline ones for when threads are disabled.
* REORG: thread: uninline the lock-debugging codeWilly Tarreau2021-10-071-549/+36
| | | | | | | | | | | | | | | | | | | The lock-debugging code in thread.h has no reason to be inlined. the functions are quite fat and perform a lot of operations so there's no saving keeping them inlined. Worse, most of them are in fact not inlined, resulting in a significantly bigger executable. This patch moves all this part from thread.h to thread.c. The functions are still exported in thread.h of course. This results in ~166kB less code: text data bss dec hex filename 3165938 99424 897376 4162738 3f84b2 haproxy-before 2991987 99424 897376 3988787 3cdd33 haproxy-after In addition the build time with thread debugging enabled has shrunk from 19.2 to 17.7s thanks to much less code to be parsed in thread.h that is included virtually everywhere.
* MINOR: quic: Add a lock for RX packetsFrédéric Lécaille2021-09-231-0/+2
| | | | | We must protect from concurrent the tree which stores the QUIC packets received by the dgram I/O handler, these packets being also parsed by the xprt task.
* REORG: threads: move ha_get_pthread_id() to tinfo.hWilly Tarreau2021-09-171-36/+0
| | | | | This solely manipulates the thread_info struct, it ought to be in tinfo.h, not in thread.h.
* CLEANUP: Remove prototype for non-existent thread_get_default_count()Tim Duesterhus2021-09-151-1/+0
| | | | This is the only location of `thread_get_default_count` within the codebase.
* CLEANUP: tree-wide: fix prototypes for functions taking no arguments.Tim Duesterhus2021-09-151-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "f(void)" is the correct and preferred form for a function taking no argument, while some places use the older "f()". These were reported by clang's -Wmissing-prototypes, for example: src/cpuset.c:111:5: warning: no previous prototype for function 'ha_cpuset_size' [-Wmissing-prototypes] int ha_cpuset_size() include/haproxy/cpuset.h:42:5: note: this declaration is not a prototype; add 'void' to make it a prototype for a zero-parameter function int ha_cpuset_size(); ^ void This aggregate patch fixes this for the following functions: ha_backtrace_to_stderr(), ha_cpuset_size(), ha_panic(), ha_random64(), ha_thread_dump_all_to_trash(), get_exec_path(), check_config_validity(), mworker_child_nb(), mworker_cli_proxy_(create|stop)(), mworker_cleantasks(), mworker_cleanlisteners(), mworker_ext_launch_all(), mworker_reload(), mworker_(env|proc_list)_to_(proc_list|env)(), mworker_(un|)block_signals(), proxy_adjust_all_maxconn(), proxy_destroy_all_defaults(), get_tainted(), pool_total_(allocated|used)(), thread_isolate(_full|)(), thread(_sync|)_release(), thread_harmless_till_end(), thread_cpu_mask_forced(), dequeue_all_listeners(), next_timer_expiry(), wake_expired_tasks(), process_runnable_tasks(), init_acl(), init_buffer(), (de|)init_log_buffers(), (de|)init_pollers(), fork_poller(), pool_destroy_all(), pool_evict_from_local_caches(), pool_total_failures(), dump_pools_to_trash(), cfg_run_diagnostics(), tv_init_(process|thread)_date(), __signal_process_queue(), deinit_signals(), haproxy_unblock_signals()
* BUILD: threads: fix -Wundef for _POSIX_PRIORITY_SCHEDULING on libmuslWilly Tarreau2021-09-151-2/+2
| | | | | | | | | | | | Building with an old musl-based toolchain reported this warning: include/haproxy/thread.h: In function 'ha_thread_relax': include/haproxy/thread.h:256:5: warning: "_POSIX_PRIORITY_SCHEDULING" is not defined [-Wundef] #if _POSIX_PRIORITY_SCHEDULING ^ There were indeed two "#if" insteadd of #ifdef" for this macro, let's fix them.
* MEDIUM: threads: add a stronger thread_isolate_full() callWilly Tarreau2021-08-041-0/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current principle of running under isolation was made to access sensitive data while being certain that no other thread was using them in parallel, without necessarily having to place locks everywhere. The main use case are "show sess" and "show fd" which run over long chains of pointers. The thread_isolate() call relies on the "harmless" bit that indicates for a given thread that it's not currently doing such sensitive things, which is advertised using thread_harmless_now() and which ends usings thread_harmless_end(), which also waits for possibly concurrent threads to complete their work if they took this opportunity for starting something tricky. As some system calls were notoriously slow (e.g. mmap()), a bunch of thread_harmless_now() / thread_harmless_end() were placed around them to let waiting threads do their work while such other threads were not able to modify memory contents. But this is not sufficient for performing memory modifications. One such example is the server deletion code. By modifying memory, it not only requires that other threads are not playing with it, but are not either in the process of touching it. The fact that a pool_alloc() or pool_free() on some structure may call thread_harmless_now() and let another thread start to release the same object's memory is not acceptable. This patch introduces the concept of "idle threads". Threads entering the polling loop are idle, as well as those that are waiting for all others to become idle via the new function thread_isolate_full(). Once thread_isolate_full() is granted, the thread is not idle anymore, and it is released using thread_release() just like regular isolation. Its users have to keep in mind that across this call nothing is granted as another thread might have performed shared memory modifications. But such users are extremely rare and are actually expecting this from their peers as well. Note that that in case of backport, this patch depends on previous patch: MINOR: threads: make thread_release() not wait for other ones to complete
* MEDIUM: queue: use a dedicated lock for the queues (v2)Willy Tarreau2021-06-241-0/+2
| | | | | | | | | | | | | | | | | | | | | Till now whenever a server or proxy's queue was touched, this server or proxy's lock was taken. Not only this requires distinct code paths, but it also causes unnecessary contention with other uses of these locks. This patch adds a lock inside the "queue" structure that will be used the same way by the server and the proxy queuing code. The server used to use a spinlock and the proxy an rwlock, though the queue only used it for locked writes. This new version uses a spinlock since we don't need the read lock part here. Tests have not shown any benefit nor cost in using this one versus the rwlock so we could change later if needed. The lower contention on the locks increases the performance from 362k to 374k req/s on 16 threads with 20 servers and leastconn. The gain with roundrobin even increases by 9%. This is tagged medium because the lock is changed, but no other part of the code touches the queues, with nor without locking, so this should remain invisible.
* Revert "MEDIUM: queue: use a dedicated lock for the queues"Willy Tarreau2021-06-241-2/+0
| | | | | | | | | | This reverts commit fcb8bf8650ec6b5614d1b88db54f1200ebd96cbd. The recent changes since 5304669e1 MEDIUM: queue: make pendconn_process_next_strm() only return the pendconn opened a tiny race condition between stream_free() and process_srv_queue(), as the pendconn is accessed outside of the lock, possibly while it's being freed. A different approach is required.
* MEDIUM: queue: use a dedicated lock for the queuesWilly Tarreau2021-06-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Till now whenever a server or proxy's queue was touched, this server or proxy's lock was taken. Not only this requires distinct code paths, but it also causes unnecessary contention with other uses of these locks. This patch adds a lock inside the "queue" structure that will be used the same way by the server and the proxy queuing code. The server used to use a spinlock and the proxy an rwlock, though the queue only used it for locked writes. This new version uses a spinlock since we don't need the read lock part here. Tests have not shown any benefit nor cost in using this one versus the rwlock so we could change later if needed. The lower contention on the locks increases the performance from 491k to 507k req/s on 16 threads with 20 servers and leastconn. The gain with roundrobin even increases by 6%. The performance profile changes from this: 13.03% haproxy [.] fwlc_srv_reposition 8.08% haproxy [.] fwlc_get_next_server 3.62% haproxy [.] process_srv_queue 1.78% haproxy [.] pendconn_dequeue 1.74% haproxy [.] pendconn_add to this: 11.95% haproxy [.] fwlc_srv_reposition 7.57% haproxy [.] fwlc_get_next_server 3.51% haproxy [.] process_srv_queue 1.74% haproxy [.] pendconn_dequeue 1.70% haproxy [.] pendconn_add At this point the differences are mostly measurement noise. This is tagged medium because the lock is changed, but no other part of the code touches the queues, with nor without locking, so this should remain invisible.
* BUG/MEDIUM: shctx: use at least thread-based locking on USE_PRIVATE_CACHEWilly Tarreau2021-06-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Since threads were introduced in 1.8, the USE_PRIVATE_CACHE mode of the shctx was not updated to use locks. Originally it was meant to disable sharing between processes, so it removes the lock/unlock instructions. But with threads enabled, it's not possible to work like this anymore. It's easy to see that once built with private cache and threads enabled, sending violent SSL traffic to the the process instantly makes it die. The HTTP cache is very likely affected as well. This patch addresses this by falling back to our native spinlocks when USE_PRIVATE_CACHE is used. In practice we could use them also for other modes and remove all older implementations, but this patch aims at keeping the changes very low and easy to backport. A new SHCTX_LOCK label was added to help with debugging, but OTHER_LOCK might be usable as well for backports. An even lighter approach for backports may consist in always declaring the lock (or reusing "waiters"), and calling pl_take_s() for the lock() and pl_drop_s() for the unlock() operation. This could even be used in all modes (process and threads), even when thread support is disabled. Subsequent patches will further clean up this area. This patch must be backported to all supported versions since 1.8.
* MEDIUM: pools: remove the locked pools implementationWilly Tarreau2021-06-101-2/+0
| | | | | | Now that the modified lockless variant does not need a DWCAS anymore, there's no reason to keep the much slower locked version, so let's just get rid of it.
* REORG: threads: move all_thread_mask() to thread.hWilly Tarreau2021-05-081-0/+6
| | | | | It was declared in global.h, forcing plenty of source files to include it only for this while it's only based on definitions from thread.h.
* MINOR: thread: implement the detection of forced cpu affinityAmaury Denoyelle2021-04-231-0/+4
| | | | | | | Create a function thread_cpu_mask_forced. Its purpose is to report if a restrictive cpu mask is active for the current proces, for example due to a taskset invocation. It is only implemented for the linux platform currently.
* MINOR: threads: Only consider running threads to end a thread harmeless periodChristopher Faulet2021-04-171-1/+1
| | | | | | | | | | | When a thread ends its harmeless period, we must only consider running threads when testing threads_want_rdv_mask mask. To do so, we reintroduce all_threads_mask mask in the bitwise operation (It was removed to fix a deadlock). Note that for now it is useless because there is no way to stop threads or to have threads reserved for another task. But it is safer this way to avoid bugs in the future.
* BUG/MEDIUM: threads: Ignore current thread to end its harmless periodChristopher Faulet2021-04-171-1/+1
| | | | | | | | | | | A previous patch was pushed to fix a deadlock when an isolated thread ends its harmless period (a9a9e9aac ["BUG/MEDIUM: thread: Fix a deadlock if an isolated thread is marked as harmless"]). But, unfortunately, the fix is incomplete. The same must be done in the outer loop, in thread_harmless_end() function. The current thread must be ignored when threads_want_rdv_mask mask is tested. This patch must also be backported as far as 2.0.
* CLEANUP: atomic/tree-wide: replace single increments/decrements with inc/decWilly Tarreau2021-04-071-17/+17
| | | | | | This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1) or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
* MEDIUM: streams: do not use the streams lock anymoreWilly Tarreau2021-02-241-2/+0
| | | | | | | | | | | | | The lock was still used exclusively to deal with the concurrency between the "show sess" release handler and a stream_new() or stream_free() on another thread. All other accesses made by "show sess" are already done under thread isolation. The release handler only requires to unlink its node when stopping in the middle of a dump (error, timeout etc). Let's just isolate the thread to deal with this case so that it's compatible with the dump conditions, and remove all remaining locking on the streams. This effectively kills the streams lock. The measured gain here is around 1.6% with 4 threads (374krps -> 380k).
* DEBUG: thread: add 5 extra lock labels for statistics and debuggingWilly Tarreau2021-02-181-0/+13
| | | | | | | | | Since OTHER_LOCK is commonly used it's become much more difficult to profile lock contention by temporarily changing a lock label. Let's add DEBUG1..5 to serve only for debugging. These ones must not be used in committed code. We could decide to only define them when DEBUG_THREAD is set but that would complicate attempts at measuring performance with debugging turned off.
* MEDIUM: connection: protect idle conn lists with locksAmaury Denoyelle2021-02-121-0/+2
| | | | | | | | | | | | | | | | | This is a preparation work for connection reuse with sni/proxy protocol/specific src-dst addresses. Protect every access to idle conn lists with a lock. This is currently strictly not needed because the access to the list are made with atomic operations. However, to be able to reuse connection with specific parameters, the list storage will be converted to eb-trees. As this structure does not have atomic operation, it is mandatory to protect it with a lock. For this, the takeover lock is reused. Its role was to protect during connection takeover. As it is now extended to general idle conns usage, it is renamed to idle_conns_lock. A new lock section is also instantiated named IDLE_CONNS_LOCK to isolate its impact on performance.
* MINOR: ssl: add SSL_SERVER_LOCK label in threads.hWilliam Lallemand2021-02-101-0/+1
| | | | | | | | | | | Amaury reported that the commit 3ce6eed ("MEDIUM: ssl: add a rwlock for SSL server session cache") introduced some warning during compilation: include/haproxy/thread.h|411 col 2| warning: enumeration value 'SSL_SERVER_LOCK' not handled in switch [-Wswitch] This patch fix the issue by adding the right entry in the switch block. Must be backported where 3ce6eed is backported. (2.4 only for now)