Contact Us

Address: 1st Floor,Building 4, 1088th, Huyi highway, Shanghai
TEL:021-31080981
Email:soline@soline.com.cn
P.C.:201802

Nginx solution

Conclusion of thundering effect

Regardless of whether it is multi-process or multi-threaded, there is a thrilling group effect. This article uses multi-process analysis.


After the Linux 2.6 version, the thrilling group effect of the system call accept has been solved (provided that the event notification mechanism such as select, poll, epoll, etc. is not used).


At present, Linux has partially solved the thrilling group effect of epoll (epoll is before fork), but Linux 2.6 has not solved it.


The epoll created after the fork still has the thrilling group effect, and Nginx uses its own mutual exclusion lock to solve the thrilling group effect.


What is the thriller effect

The thundering herd is when multiple processes (multithreaded) are simultaneously blocking and waiting for the same event (sleep state). If the waiting event occurs, then he will wake up all the waiting processes (or threads). But in the end, only one process (thread) can obtain "control" of this time and handle the event, while other processes (threads) fail to obtain "control" and can only reenter the dormant state. This phenomenon and Performance waste is called the thrilling group effect.


What does the thriller effect consume?

The Linux kernel frequently does invalid scheduling and context switching on user processes (threads), which greatly reduces system performance. Context switch (context switch) is too high will cause the cpu to be like a porter, frequently running between the register and the run queue, more time is spent in process (thread) switching, rather than in the real working process (thread) Above. Direct consumption includes cpu registers to be saved and loaded (for example, the program counter), and the code of the system scheduler needs to be executed. The indirect consumption lies in the shared data between the multi-core caches.


In order to ensure that only one process (thread) gets the resource, the resource operation needs to be locked and protected, which increases the overhead of the system. At present, some common server software is solved through the lock mechanism, such as Nginx (its lock mechanism is turned on by default and can be turned off); some think that the shock group has little effect on the system performance and do not deal with it, such as Lighttpd.


Accept for Linux solutions

Before Linux 2.6, processes listening to the same socket would hang on the same waiting queue, and when a request came, all waiting processes would be awakened.


After Linux 2.6 version, by introducing a flag bit WQ_FLAG_EXCLUSIVE, the Accept shock group effect is solved.


The specific analysis will be in the code comments, and the accept code implementation snippet is as follows:

// 当accept的时候,如果没有连接则会一直阻塞(没有设置非阻塞)// 其阻塞函数就是:inet_csk_accept(accept的原型函数)  
struct sock *inet_csk_accept(struct sock *sk, int flags, int *err){
    ...  
    // 等待连接 
    error = inet_csk_wait_for_connect(sk, timeo); 
    ...  }static int inet_csk_wait_for_connect(struct sock *sk, long timeo){
    ...
    for (;;) {  
        // 只有一个进程会被唤醒。        // 非exclusive的元素会加在等待队列前头,exclusive的元素会加在所有非exclusive元素的后头。        prepare_to_wait_exclusive(sk_sleep(sk), &wait,TASK_INTERRUPTIBLE);  
    }  
    ...}void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)  {
    unsigned long flags;  
    // 设置等待队列的flag为EXCLUSIVE,设置这个就是表示一次只会有一个进程被唤醒,我们等会就会看到这个标记的作用。  
    // 注意这个标志,唤醒的阶段会使用这个标志。    wait->flags |= WQ_FLAG_EXCLUSIVE;  
    spin_lock_irqsave(&q->lock, flags);  
    if (list_empty(&wait->task_list))  
    // 加入等待队列  
    __add_wait_queue_tail(q, wait);  
    set_current_state(state);  
    spin_unlock_irqrestore(&q->lock, flags);  }

The accept code snippet for waking up blocking is as follows:

// 当有tcp连接完成,就会从半连接队列拷贝socket到连接队列,这个时候我们就可以唤醒阻塞的accept了。int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb){
    ...
    // 关注此函数    if (tcp_child_process(sk, nsk, skb)) { 
        rsk = nsk;  
        goto reset;  
    }
    ...}int tcp_child_process(struct sock *parent, struct sock *child, struct sk_buff *skb){
    ...
    // Wakeup parent, send SIGIO 唤醒父进程    if (state == TCP_SYN_RECV && child->sk_state != state)  
        // 调用sk_data_ready通知父进程        // 查阅资料我们知道tcp中这个函数对应是sock_def_readable        // 而sock_def_readable会调用wake_up_interruptible_sync_poll来唤醒队列        parent->sk_data_ready(parent, 0);  
    }
    ...}void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, void *key)  {  
    ...  
    // 关注此函数    __wake_up_common(q, mode, nr_exclusive, wake_flags, key);  
    spin_unlock_irqrestore(&q->lock, flags);  
    ...  }static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key){
    ...
    // 传进来的nr_exclusive是1    // 所以flags & WQ_FLAG_EXCLUSIVE为真的时候,执行一次,就会跳出循环    // 我们记得accept的时候,加到等待队列的元素就是WQ_FLAG_EXCLUSIVE的    list_for_each_entry_safe(curr, next, &q->task_list, task_list) {  
        unsigned flags = curr->flags;  
        if (curr->func(curr, mode, wake_flags, key) 
        && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
        break; 
    }
    ...}

Epoll for Linux solutions

When using IO multiplexing such as select, poll, epoll, kqueue, etc., multi-process (thread) processing links are more complicated.


Therefore, when discussing the thrilling group effect of epoll, it needs to be divided into two situations:

epoll_create在fork之前创建epoll_create在fork之后创建

epoll_create is created before fork

Similar to the reason for accept shocking group, when an event occurs, all processes (threads) waiting for the same file descriptor will be awakened, and the solution idea is the same as accept.


Why do we need to wake up all? Because the kernel does not know whether you are waiting for the file descriptor to call the accept() function, or do other things (signal processing, timing events).


The shocking group effect of this situation has been resolved.


epoll_create is created after fork

If epoll_create is created before fork, all processes share an epoll red and black number.


If we only need to deal with the accept event, it looks like the world is a better place. But epoll does not only process accept events, accept subsequent read and write events need to be processed, as well as timing or signal events.


When the connection comes, we need to select a process to accept. At this time, any accept is possible. When the connection is established, subsequent read and write events are associated with the process. After a request establishes a connection with process a, subsequent reads and writes should also be done by process a.


When a read or write event occurs, which process should be notified? Epoll doesn't know, therefore, the event may notify another process by mistake, which is wrong. Therefore, an epoll event loop mechanism is generally created again in each process (thread), and the read and write events of each process are only registered in the epoll species of its own process.


We know that epoll's repair of the thrilling group effect is based on sharing the same epoll structure. epoll_create is executed after the fork, and each process has a separate epoll red-black tree, waiting queue, and ready event list. Therefore, the thrilling group effect reappeared. Sometimes all processes are awakened, and sometimes some processes are awakened. It may be because the event has been processed by some processes, so there is no need to notify other processes that have not yet been notified.


The design of the lock of the Nginx solution

First of all, we need to know the principle of inter-process lock implementation in user space. The initial principle is very simple, that is, we can make something shared by all processes, such as mmap memory, such as files, and then use this thing to control the mutual exclusion of processes.


The locks used in Nginx are implemented by themselves. The implementation of locks here is divided into two cases. One is the case that supports atomic operations, which is controlled by the macro NGX_HAVE_ATOMIC_OPS, and the other is that atomic operations are not supported. Yes, it is achieved using file locks.


Lock structure

If atomic operations are supported, we can use mmap directly, and then lock saves the address of the memory area of mmap.


If atomic operations are not supported, we use file locks to achieve this, where fd represents the file handle shared between processes, and name represents the file name.

typedef struct {  
    #if (NGX_HAVE_ATOMIC_OPS)  
        ngx_atomic_t  *lock;  
    #else  
        ngx_fd_t       fd;  
        u_char        *name;  
    #endif  
} ngx_shmtx_t;

Atomic lock creation

// If atomic operations are supported, it is very simple, that is, the address of the shared memory is paid to the loc fieldngx_int_t ngx_shmtx_create(ngx_shmtx_t *mtx, void *addr, u_char *name)  {  
    mtx->lock = addr;
    return NGX_OK;  }

Atomic lock acquisition

trylock is non-blocking, which means that it will try to acquire the lock, and if it is not acquired, it will directly return an error.


It will also try to acquire the lock, and when it is not acquired, it will not return immediately. Instead, it will start to enter the loop and then keep acquiring the lock, knowing that it is acquired. However, Nginx also uses a trick here, that is, every time the current process is placed in the last position of the cpu's run queue, that is, the cpu is automatically abandoned.


Atomic lock implementation

If the system library supports the situation, call OSAtomicCompareAndSwap32Barrier or CAS directly at this time.

#define ngx_atomic_cmp_set(lock, old, new)                                   
    OSAtomicCompareAndSwap32Barrier(old, new, (int32_t *) lock)

If the system library does not support this instruction, Nginx itself has implemented one in assembly.

static ngx_inline ngx_atomic_uint_t ngx_atomic_cmp_set(ngx_atomic_t *lock, ngx_atomic_uint_t old,  
    ngx_atomic_uint_t set)  {  
    u_char  res;

    __asm__ volatile (

         NGX_SMP_LOCK  
    "    cmpxchgl  %3, %1;   "  
    "    sete      %0;       "

    : "=a" (res) : "m" (*lock), "a" (old), "r" (set) : "cc", "memory");

    return res;  }

Atomic lock release

Unlock is relatively simple. Compare it with the current process id. If it is equal, change the lock to 0, indicating that the lock is abandoned.

#define ngx_shmtx_unlock(mtx) 
    (void) ngx_atomic_cmp_set((mtx)->lock, ngx_pid, 0)

The shocking group effect of Nginx solutions

Variable analysis
// 如果使用了master worker,并且worker个数大于1 // 同时配置文件里面有设置使用accept_mutex.的话,设置ngx_use_accept_mutex  
 if (ccf->master && ccf->worker_processes > 1 && ecf->accept_mutex) 
 { 
    ngx_use_accept_mutex = 1;  
    // 下面这两个变量后面会解释。  
    ngx_accept_mutex_held = 0;  
    ngx_accept_mutex_delay = ecf->accept_mutex_delay;  
 } else {  
    ngx_use_accept_mutex = 0;  
 }ngx_use_accept_mutex:这个变量,如果有这个变量,说明nginx有必要使用accept互斥体,这个变量的初始化在ngx_event_process_init中。ngx_accept_mutex_held:表示当前是否已经持有锁。ngx_accept_mutex_delay:表示当获得锁失败后,再次去请求锁的间隔时间,这个时间可以在配置文件中设置的。ngx_accept_disabled = ngx_cycle->connection_n / 8  
                              - ngx_cycle->free_connection_n;ngx_accept_disabled:这个变量是一个阈值,如果大于0,说明当前的进程处理的连接过多。

Whether to use lock

// 如果有使用mutex,则才会进行处理。  
if (ngx_use_accept_mutex) {  
    // 如果大于0,则跳过下面的锁的处理,并减一。  
    if (ngx_accept_disabled > 0) {  
        ngx_accept_disabled--; 
    } else {  
        // 试着获得锁,如果出错则返回。  
        if (ngx_trylock_accept_mutex(cycle) == NGX_ERROR) {  
            return;  
        }  
        // 如果ngx_accept_mutex_held为1,则说明已经获得锁,此时设置flag,这个flag后面会解释。        if (ngx_accept_mutex_held) {  
            flags |= NGX_POST_EVENTS;  
        } else {  
            // 否则,设置timer,也就是定时器。接下来会解释这段。  
            if (timer == NGX_TIMER_INFINITE  
                 || timer > ngx_accept_mutex_delay) {  
                timer = ngx_accept_mutex_delay;  
            }  
        }  
    }  }// 如果ngx_posted_accept_events不为NULL,则说明有accept event需要nginx处理。  
if (ngx_posted_accept_events) {  
        ngx_event_process_posted(cycle, &ngx_posted_accept_events);  }

NGX_POST_EVENTS flag. Setting this flag means that when the socket is awakened by data, we will not accept or read it immediately, but save the event, and then accept or read after we release the lock. This handle.


If the NGX_POST_EVENTS flag is not set, nginx will immediately accept or read the handle.


Timer Here, if nginx does not obtain the lock, it will not immediately obtain the lock, but set the timer, and then sleep in epoll (if there is no other thing to wake up). At this time, if a connection arrives, the current sleep process will be advanced Wake up, then immediately accept. Otherwise, sleep for ngx_accept_mutex_delay time, and then continue to try lock.

Get the lock to solve the shock group

ngx_int_t ngx_trylock_accept_mutex(ngx_cycle_t *cycle)  { 
    // 尝试获得锁  
    if (ngx_shmtx_trylock(&ngx_accept_mutex)) {  
        // 如果本来已经获得锁,则直接返回Ok  
        if (ngx_accept_mutex_held  
            && ngx_accept_events == 0  
            && !(ngx_event_flags & NGX_USE_RTSIG_EVENT))  
        {  
            return NGX_OK;  
        }

        // 到达这里,说明重新获得锁成功,因此需要打开被关闭的listening句柄。  
        if (ngx_enable_accept_events(cycle) == NGX_ERROR) {  
            ngx_shmtx_unlock(&ngx_accept_mutex);  
            return NGX_ERROR;  
        }

        ngx_accept_events = 0;  
        // 设置获得锁的标记。  
        ngx_accept_mutex_held = 1;

        return NGX_OK;  
    }

    // 如果我们前面已经获得了锁,然后这次获得锁失败    // 则说明当前的listen句柄已经被其他的进程锁监听    // 因此此时需要从epoll中移出调已经注册的listen句柄    // 这样就很好的控制了子进程的负载均衡  
    if (ngx_accept_mutex_held) {  
        if (ngx_disable_accept_events(cycle) == NGX_ERROR) {  
            return NGX_ERROR;  
        } 
        // 设置锁的持有为0.  
        ngx_accept_mutex_held = 0;  
    }

    return NGX_OK;  }

As in the above code, when a connection comes, there is the fd in the epoll event list of each process at this time. The process that grabs the connection releases the lock first, and then accepts. The process that has not been grabbed removes the fd from the event list and does not need to call accept again, resulting in a waste of resources. At the same time, due to the control of the lock (and the timer for obtaining the lock), each process can accept the handle relatively fairly, which is a better solution to the load balancing of the child processes.