Gottlieb t Freitag

Why cooperative multitasking is a bad idea

August 19, 2019 - lutz

Maybe you are already familiar with coroutines and cooperative multitasking as it exists in python. The upcoming C++20 extensions to C++ will include coroutines with which something equivalent can be implemented quite easily. In my recreational C++ playtime I’ve recently played with coroutines and generators and discovered that cooperative multitasking is unlikely to outperform threaded multitasking.

Some History

In the old times™ (before the invention of preemptive schedulers) applications had to hand control back to the kernel (yield) every once in a while to create the illusion of multitasking. The kernel would then chose one of the currently runnable applications to be executed further. If every application hands control to the kernel quickly enough it feels like as if the system executes multiple processes simultaneously. However, since the introduction of preemptive schedulers - think about a recurring hardware interrupt that allows the kernel to suspend a process - application developers are not exactly encouraged to implement their software in a fashion where they have to yield their software regularly.

In the programming language Python the memory management heavily relies on reference counting and unfortunately the reference counting is not implemented as C++’s std::shared_ptr. In Python’s reference counting no atomic modifications are utilized which renders multithreaded operations on references tricky. If no atomic operations are at hand other means of synchronization have to be employed. Mutexes (or futexes) would be a good idea but a mutex lock has the reputation of being overly costly and (pthread-) mutexes are 40 bytes in size which is not proportional as they would have to be integrated into every reference-counted object. Also the reasoning of python’s designers is not to impose this level of platform specifics into the language (if that was a smart decision is not for me to decide, for all I know they could have simply left that part out of the language specification and let the interpreter designers decide). In the end the designers decided to employ a single mutex aka. Global Interpreter Lock (GIL) that enables only one thread to execute the interpreter simultaneously. The detailed story behind it can be read here. In short: Even if you have multithreaded (standard) python: your code will never be executed concurrently.

Anyways, even with the GIL in place python enables programmers to do something similar to multitasking: Cooperative Multitasking.

What is Cooperative Multitasking?

Here is some very basic python example that should clarify what I’m talking about:

from asyncio import Event, run, gather

async def foo(event):
    print("foo start")
    await event.wait()
    print("foo end")

async def bar(event):
    print("bar start")
    event.set()
    print("bar end")

async def main():
    event = Event()
    await gather(foo(event), bar(event))

if __name__ == '__main__':
    run(main())

The important piece of this code is the await keyword. await suspends the current task and returns the control to the currently active event loop (which is hidden in the call to run in main). Return of control in this case means that the current context is saved and control is switched to the context of the event loop (i.e., to the root context of the script).

There is one obviously nice thing about this type of multitasking:

The first point is quite obvious but does not really affect the savvy coder as synchronization is not that complicated after all. But the second point is worth looking into!

The Costs of a Kernel Call

A kernel call involves an interrupt that enables the CPU to execute in privileged mode. This is necessary to be able to change the address mapping to access kernel memory. Also all registers need to be saved so the application that was interrupted can be continued. This also means that the FPU and possibly other CPU extension registers have to be saved since the interrupt could happen during the execution of a set of floating point or fancy SIMD operations. Also when switching the CPU caches are very likely to be stale as the kernel and application memories are disjoint so there is another overhead. That reads very scary BUT: Kernel developers are putting a lot of sweat into optimizing especially this part of OSes. It reads like a lot of of overhead but on modern systems a simple kernel call should be in the ballpark of some 100e-9s. That reads: A couple of hundred nanoseconds. You can easily do more than a million kernel calls per second.

When context switching happens in the application then there is no need to change the address mapping or save the fancy registers. So here is where you might see something to gain performance. At least I did.

Coroutines in C++

Coroutines will be included in C++ itself with the next language iteration but there is no need to wait. With ucontext there is already a UNIX function ready to implement userland context switches.

Note that context switches with ucontext also save all CPU registers, so it does more than necessary and is not ideal for actual performance optimizaions. For that refer to boost fiber’s implementation of cooperative multitasking. Boost has a custom implementation fcontext in place for ucontext which is significantly faster. But for the sake of this article the performance of the context switching is not of importance.

Here is what I’ve implemented until I hit a wall:
https://github.com/nerdmaennchen/recreational-asyncio

And here are the important bits and pieces:

main.cpp

#include "asyncio/AsyncIO.h"

int main() {
    asyncio::Scheduler scheduler;
    using namespace std::chrono_literals;
    
    simplyfile::ServerSocket ss{simplyfile::makeUnixDomainHost("mysock")};
    ss.listen();
    auto server = [](simplyfile::ServerSocket ss) -> void {
        while (true) {
            asyncio::await(asyncio::readable(ss));
            simplyfile::ClientSocket cs = ss.accept();
            cs.setFlags(O_NONBLOCK);
            std::cout << "client connected" << std::endl;
            std::vector<std::byte> buf{4096};
            while (true) {
                try {
                    asyncio::await(asyncio::readable(cs));
                    auto r = read(cs, buf.data(), buf.size());
                    if (r == 0) {
                        break;
                    }
                    while (r) {
                        asyncio::await(asyncio::writable(cs));
                        auto written = write(cs, buf.data(), r);
                        if (written == -1) {
                            throw std::runtime_error("client gone away");
                        }
                        r -= written;
                    }
                } catch (std::exception const& e) {
                    std::cout << e.what() << std::endl;
                    break;
                }
            }
            std::cout << "client disconnected" << std::endl;
        }
    };

    auto client = [&]() -> void {
        while (true) {
            {
                simplyfile::ClientSocket cs{simplyfile::makeUnixDomainHost("mysock")};
                cs.setFlags(O_NONBLOCK);
                cs.connect();
                asyncio::await(asyncio::writable(cs));
                std::string buf = "Hello World";
                write(cs, buf.data(), buf.size());

                asyncio::await(asyncio::readable(cs));
                auto r = read(cs, buf.data(), buf.size());
                buf.resize(r);
                std::cout << buf << std::endl;
            }
            asyncio::await(asyncio::readable(simplyfile::Timer{1s}));
        }
    };

    auto servTask = scheduler.run(server, std::move(ss));
    auto clientTask = scheduler.run(client);

    auto tasks_done = [](auto const&... t) {
        return (... and (t.wait_until(std::chrono::system_clock::time_point{}) == std::future_status::ready));
    };

    while (not tasks_done(servTask, clientTask)) {
        scheduler.work();
    }
	std::cout << "done" << std::endl;
	return 0;
}

The code sets up an echoing unix-serversocket and has a client connect to it every second to shout Hello World (of course) into it. The lambdas representing the server and client logic seemingly never return but a context switch happens when asyncio::await is called.

asyncio.h

namespace asyncio {
namespace detail {
void await(TesterBase const& tester);
}

template<typename RetType>
RetType await(std::future<RetType> future) {
    detail::await(detail::Tester{[&]() {
        auto wait_res = future.wait_until(std::chrono::system_clock::time_point{});
        return wait_res != std::future_status::timeout;
    }});
    return future.get();
}
}

asyncio.cpp

namespace asyncio {
void detail::await(TesterBase const& tester) {
    if (not stack_guard_valid()) {
        return;
    }
    if (tester.test()) {
        return;
    }
    stackGuard.state->enque_tester(tester, stackGuard.info);
    swapcontext(&stackGuard.info->context, stackGuard.callerContext);
}
}

Under the hood asyncio::await creates a detail::Tester which is just a glorified type erased mutable lambda that tests a condition. And it actually just tests if the future has a value or not. Note that when passing a timepoint of 0 (std::chrono::system_clock::time_point{}) to wait_until and having optimizations enabled this call does not block and is significantly faster than future.wait_for(0s). Don’t ask me about details here.

Why There is no Benefit in Cooperative Multitasking

The only thing that needs to be passed to asyncio::await is an std::future. But what is it that we are waiting for? Usually applications await IO operations. Timeouts can be implemented using timerfds and then asyncio::await waits for the timerfd to become readable. Likewise with sockets: A server waits for an incoming connection. When a client is connected the server socket becomes readable and accept is called. Then the server waits for the client to send data (asyncio::await(asyncio::readable(cs));) and echos the received data back after the client socket became writable (asyncio::await(asyncio::writable(cs));). There are very few surprizes here.

The point is:
Off the top of my head, anything that can be implemented with asyncio::await relies up to a certain degree on file descriptors. Calls that work on file descriptors have to go through the kernel and might block there. Having two threads instead of cooperative multitasking for the above example does not reduce the applications performance in any way. Here is why: Any point in the code where await is called is followed by an IO call in which the kernel would do a context switch anyway. In the cooperative multitasking code as well in a threaded code read and write would be called. In my implementation it is even worse: All futures passed to await are tested before each context switch and each test might involve a kernel call. This introduces a greater overhead than a simple implementation using pthreads.

But what if your synchronization needs are the same as in the python example above where there was no IO involved? Well, technically this is the situation where cooperative multitasking shines. BUT: Equivalent implementations can be done by utilizing callbacks thus eliminating any context switches.

TL;DR

If you have access to threads: Use them. Don’t expect any speedups from cooperative multitasking.