====== 随机数发生器的性能问题(proposal；已实现) ====== ===== 现象 ===== 创建多个线程之后，Chacha20 PRNG （arc4random_buf）只吃了一个CPU核心的计算能力。 ===== 分析 ===== arc4random 在进行较为耗时的操作，如arc4random_buf时是持锁进行的。因此，两个线程同时调用 arc4random_buf 生成较大尺寸的数据时，调用是串行执行的。 ===== 目前想到的解法 ===== 总体来说是将目前只有一个随机状态改为可以有最多CPU数量的随机状态。将这些随机状态放在一个队列中，每次使用时取队头（如果不存在已知的随机状态，将 rs0 填入队列，如果存在且队列为空，则尝试分配新的随机状态并将其放进队列；如果队列仍为空，则等待其他线程唤醒cv_queue）。放回的逻辑比较简单，将取出第一步，需要将直接引用 Chacha20 状态的代码改为使用指针而不是全局变量，并将现有的这些全局变量作为 rs0。增加一对用于获取和放回随机状态的函数，其中get部分拿锁并返回rs0，put部分放锁。将原有的拿锁、放锁改为获取和放回随机状态。第二步是改写获取和放回随机状态的函数。增加一个全局队列 rsq、一个计数器 rscount。大致逻辑如下，获取部分：


    pthread_mutex_lock(&rsq->mutex);
    
    while ((qe = TAILQ_FIRST(&rsq->queue_head)) == NULL) {
        if (rscount < NCPU) {
            /*
             * special case for first one.
             *
             * We use the statically allocated state
             * to make sure that we have at least one
             * available state.
             */
            if (rscount == 0)
                qe = &rs0;
            else
                qe = new rstate;
            if (qe != NULL) {
                rscount++;
                break;
            }
        }
        /* All (NCPU) states are in use, sleep and wait for a wakeup */
        pthread_cond_wait(&rsq->cv_queue, &rsq->mutex);
    }
    /* assert(qe != NULL); */
    
    /*
     * If the queue is not empty, we must got the state from
     * there, and the state has to be removed from the queue.
     */
    if (!TAILQ_EMPTY(&rsq->queue_head))
        TAILQ_REMOVE(&rsq->queue_head, qe, entries);
    pthread_mutex_unlock(&rsq->mutex);
    
    return (qe);

放回部分：


    pthread_mutex_lock(&rsq->mutex);
    TAILQ_INSERT_HEAD(&rsq->queue_head, qe, entries);
    pthread_cond_signal(&rsq->cv_queue);
    pthread_mutex_unlock(&rsq->mutex);

预计需要大约1-1.5个小时左右的工作量。