Most Powerful Open Source ERP

Multi-threaded coroutines with work stealing in Cython

After making a proof of concept multi-core HTTP server for Python using Lwan and Cython, we decided to extract Lwan's coroutines and evaluate them on their own. This experiment shows that system programming in Cython is possible.
  • Last Update:2019-12-28
  • Version:001
  • Language:en

Quite recently, we made a proof of concept multi-core HTTP server for Python using Lwan and Cython. Initially, the goal was to combine a performant C/C++ coroutine library with Cython's parallel module to produce multi-threaded coroutines. During that process, we evaluated several popular coroutine libraries and realized that none could fulfill our needs. This is how we finally found Lwan, a powerful HTTP server written in C, and it was the perfect fit.

LWAN coroutines

Lwan uses its own crafted coroutines. Since we were looking for a performant coroutine library, we decided to extract them and evaluate them on their own.

Work stealing

Coroutines on their own are slightly useless without a scheduler, we thus needed to implement a suitable scheduling algorithm. Work stealing is the algorithm used by Go's scheduler, which explains its ability to scale efficiently beyond 8 cores.

Some variants of the work stealing algorithm exist. I am going to describe the one in our case (heavily inspired from Juliusz Chroboczek's system programming project):

The scheduler contains n threads (called workers). Each worker has a double-ended queue (deque) of coroutines. Deques sides are labelled by "front" and "back".

Each worker resumes its coroutines one by one, by popping them from the front of its deque. If its own deque is empty, the worker will try to steal one coroutine from another worker (hence the algorithm's name). If it still has not managed to find work, it will fall asleep for 1 ms.

A worker that is in the work stealing phase behaves as follows. It starts by randomly picking a worker k, and tries to steal one coroutine from the back of the victim's deque. If this operation succeeds, it resumes the stolen coroutine and then continues running the normal algorithm. Otherwise, it iterates through all the other workers starting at k + 1 (modulo the number of workers) and tries to steal work from one of them. If all the deques were empty, the work stealing failed.

When a coroutine yields, it is pushed to the back of its associated worker's deque. The scheduler ends when all the workers are idle. Note that when a coroutine is stolen, data about its associated worker must be updated.

Implementation

Cython already provides parallelism through OpenMP. But OpenMP is not the best solution for threads in this context because it poses the following problems:

  • When an idle worker (i.e. thread) does not find a coroutine to steal, it falls asleep for a constant time t (this falling asleep time could be optimized in several ways). The scheduler ends when all the workers are idle: they are all sleeping after having failed to steal work. But OpenMP does not offer a trivial way to put to sleep / wake up a thread; it thus becomes complicated to know when the scheduler can stop.
  • OpenMP also does not propose to create threads "on the fly": this can be a problem if we want to detach a coroutine to a new thread (for a blocking call, e.g. I/O).

This is why we chose to use posix threads instead because they solve the problems mentioned above. However, pthread is not currently integrated with Cython and I had to interface it myself. Semaphores were used to find out how many threads are asleep, and thus to know when to end the scheduler.

To implement the double-ended queues, I decided to use the standard C++ deques. Cython already provides wrappers for most (all?) C++ standard data structures, including deques.

The source code is available in this repository.

Here is an example snippet showing how to use the coroutines, in Cython:

def main():
  cdef:
    scheduler_t s
    unsigned int i

  with nogil:
    scheduler_init(&s)

    for i in range(1000):
      scheduler_coro_add(&s, foo)

    scheduler_run(&s)
    scheduler_destroy(&s)

# Example task
cdef int foo(coro_t *coroutine, void *arg) nogil:
  cdef int foo = 5
  coro_yield(coroutine, coro_yield_value.MAY_RESUME)
  foo *= 2
  coro_yield(coroutine, coro_yield_value.FINISHED)

By default, scheduler_init() will spawn as many workers (threads) as they are CPU cores. Note that this function can receive a second argument to control the number of workers it should spawn:

# Initialize the scheduler with 12 workers.
scheduler_init(&s, 12)

Comparison and benchmarks

Having a true reference point to compare our experiment to is quite difficult. To our knowledge, they has never been an efficient attempt to multithreaded coroutines in Cython-only before. There might be gevent, but although it uses Cython, it is meant to be used with Python code.

Nevertheless, we compared our coroutines and work-stealing scheduler to several technologies with a concurrent model. The following table shows the time it takes to spawn 500 000 empty coroutines (or similar concurrent components, e.g. goroutines for Go) and execute all of them. This test was performed on my chromebook with a 2-core CPU and 4 GB of memory.

500 000 empty coroutines in several technologies with a concurrency model
Name Language Spawn time (sec) Run time (sec) Total time (sec)
asyncio Python 1.11 9.70 10.82
asyncio (with uvloop) Python 1.11 6.83 7.91
gevent Python X X 8.30
goroutines Go X X 0.39
Lwan coroutines with work-stealing Cython 1.15 0.27 1.49

Here is a bit more precision on what the actual measured times are:

  • The spawn time is the time measured to only spawn the coroutines (e.g. allocate memory);
  • The run time is the time it takes to execute all of them;
  • The total time counts both previous times plus the potential time needed to initialize the loop or scheduler.

Go and gevent are a little different from others: goroutines and greenlets are directly executed in parallel after being created, while in the other cases, coroutines are waiting for the whole batch to be created before running.

As we can see in the case of Lwan, 75% percent of the total time is spent to only spawn the coroutines. Now, let me explain why this is. In Lwan, coroutines are individually allocated: spawning 500 000 coroutines means calling a malloc 500 000 times in a row. Allocating a lot of small chunks of memory in such a way is really inefficient. To solve this issue, a memory management system could be used to allocate a big chunk of memory, and redistribute it in smaller pieces to the coroutines. If such a thing is implemented, and combined with the ability to run coroutines directly in parallel when they are spawned, Lwan's coroutines with Cython could obtain similar performances to Go's goroutines.

In this example, Go and Cython were limited to 1 thread to run goroutines/coroutines. To see how they both scale when multiple cores are available, let's increase the number of threads they're allowed to use. For an example workload, I chose to compute the 100 000th Fibonacci's number in each goroutine/coroutine. The following test was done on a more powerful machine: a desktop with a 4-core CPU and 8 GB of memory:

Closing words

Note that the idea behind this project was not to simply wrap a coroutine library and make it multi-threaded, but instead to add a notion of coroutines to Cython itself. The idea was to give the illusion of serial coroutines to the user, while in fact, they can be spread on multiple cores. Before starting this project, I had never used Cython and I didn't know what a coroutine was. This project is still immature and could be improved in a lot of different ways. However, it demonstrates that system programming in Cython is possible and must not be negliged.