Simplyfile

August 09, 2019 - lutz

To simplify and RAII-fy linux file handles I’ve created a super small C++ library “simplyfile” some time ago. The library is actually not special and you could imagine how the code looks like (close on destruction and some moves). However, there are two reasons I prefer dealing with files encapsulated in simplyfile’s handles:

They are closed uppon destruction (no shit… its RAII)
file handles in Linux are ints which have arythmetic semantics to it. That just feels wrong.

Ever since I’ve build the wrappers they proved to be very useful for a number of applications. Therefore I’d like to introduce simplyfile here.

Special file types

There is quite a number of functions mapped to file descriptors in the Linux world. You can use file descriptors to realize communications (sockets), semaphores (eventfd), timers (timerfd) and many other things. The point is that file descriptors don’t necessarily mean “files” as something that is stored on disk. The most file descriptor types (IMHO) functions are easily explained:

timerfd:
A file that represents a timer and becomes readable as soon as the timer expires. A read (with a buffer of the appropriate size) on the file descriptor will return how often the timer expired. In case the timerfd is non-blocking then a read can return 0 otherwise it will block until the timer expires.
eventfd:
A file that stores a 64bit unsigned integer that can be incremented (as if it were atomic) by a call to write or decremented by a call to read. Example: If the internal value is 1 and a write with 2 is called then the value will be 3 when write returned. A read will either return the value of the eventfd and reset it to 0 (this is the default behavior) or decrement the value by 1 if the eventfd was created with EFD_SEMAPHORE. When the eventfd is in EFD_NONBLOCK mode a read can return 0 otherwise a read will block until the internally held value is incremented.
epollfd:
Epoll is very similar to poll as it allows monitoring of multiple file descriptors. A file descriptor can be added to an epollfd with flags indicating how the monitoring shall behave. As an example: Think about a timerfd that was set up to expire in a second and is running. When that timerfd is added to an epollfd (with the flag EPOLLIN) and epoll_wait is then called it will block until the timerfd becomes readable thus until the timer expired. The main difference between poll and epoll is that it allows for an edge-sensitive (as well as level sensitive) monitoring of a file’s state. The way epoll is implemented allows to be used some quite cool multithreaded load balancing applications.

Anyways, simplyfile comes with convenience functions for all file types it wraps and should allow for a significant decrease in boilerplate file descriptor related code and the certainty of not leaking resources. Two wrappers for file descriptors proved to be especially useful as well as having the greatest impact in code reduction.

Sockets

Unix already comes with a great deal of helpers to abstract most common usages of sockets. A call to socket is already capable of creating a server socket, a client socket (over TCP, UDP, UNIX domain or something super obscure). The interface remains the same though. However when searching the internet for example code about how to set up sockets properly you come across a wild variety of boilerplate C code to get it done. Almost the entirety of that boilerplate deals with setting up host related information. That is information regarding how (and where) the socket connects. And for that UNIX comes with a very neat but very underappreciated getaddrinfo. This function looks up how a service can be connected to and generates appropriate values to pass to socket. Host names as well as service names (as strings) can be looked up and even srv records in zonefiles can be understood. Also the resulting host information works for IPv4 as well as for IPv6. Dual stack never has been easier. The only thing getaddrinfo can not be used for is to create host information for unix domain sockets.

Simplyfile comes with a wrapper around host related information. This wrapper simply encapsules the socktype, family, protocol and a struct sockaddr_storage to allow for all types of sockets to work. Also simplyfile provides two helpers to create Host related information (basically wrappers aroung getaddrinfo and a custom tailored equivalent for UNIX sockets):

std::vector<Host> getHosts(std::string const& node, std::string const& service, int socktype=SOCK_STREAM, int family=AF_UNSPEC);
Host makeUnixDomainHost(std::string const& path, int socktype=SOCK_STREAM);

And the actual Sockets are simply created by passing a Host to the constructor.

struct ClientSocket : FileDescriptor {
	using FileDescriptor::FileDescriptor;

	ClientSocket(Host const& host);
[...]
	void connect();
[...]
};

struct ServerSocket : FileDescriptor {
	ServerSocket(Host const& host, bool reusePort=true);
[...]
	ClientSocket accept();
	void listen();
};

With that a Server is very easily implemented: Here is an example how to implement an echo server that serves a single client.

simplyfile::ServerSocket ss{simplyfile::makeUnixDomainHost("mysock")};
while(true) {
    simplyfile::ClientSocket cs = ss.accept();
    while(true) {
        auto r = read(cs, buf.data(), buf.size());
        if (r == 0) {
            break;
        }
        while (r) {
            auto written = write(cs, buf.data(), r);
            if (written == -1) {
                break;
            }
            r -= written;
        }
    }
}

The client side can be implemented just as easily:

simplyfile::ClientSocket cs{simplyfile::makeUnixDomainHost("mysock")};
cs.connect();
std::string buf = "hallo Welt";
write(cs, buf.data(), buf.size());
auto r = read(cs, buf.data(), buf.size());
buf.resize(r);
std::cout << buf << std::endl;

Epoll

While implementing the wrapper for epollfd I realized that using epoll follows a rather simple pattern. Add a file descriptor with flags to be monitored, have a thread wait for any file descriptor to become ready and then process it. The processing part is easily realized by callbacks. What remained to be implemented was to have the wrapper manage an std::map to remember the association of file descriptors to callbacks. There is very little magic going on here.

However, epoll is a very nice complement to the sockets as it easily allows managing multiple connections. The above server example can be improved into a server that handles an arbitrary amount of clients like so:

simplyfile::ServerSocket ss{simplyfile::makeUnixDomainHost("mysock")};
simplyfile::Epoll epoll;
std::map<int, simplyfile::ClientSocket> clients;
bool work{true};
epoll.addFD(ss, [&](int){
    // client connected
    simplyfile::ClientSocket _cs = ss.accept();
    int cs = static_cast<int>(_cs);
    clients.emplace(cs, std::move(_cs));
    epoll.addFD(cs, [&, cs](int flags) {
        if (flags & EPOLLERR) {
            epoll.rmFD(cs, false);
            clients.erase(cs);
            return;
        }
        std::vector<std::byte> buf{4096};
        auto r = read(cs, buf.data(), buf.size());
        if (r == -1) {
            epoll.rmFD(cs, false);
            clients.erase(cs);
            return;
        }
        while (r) {
            auto written = write(cs, buf.data(), r);
            if (written == -1) {
                epoll.rmFD(cs, false);
                clients.erase(cs);
                return;
            }
            r -= written;
        }
    }, EPOLLIN|EPOLLET);
}, EPOLLIN|EPOLLET);

ss.listen();
while (true) {
    epoll.work();
}

Within the call to epoll.work() is the call to epoll_wait and the dispatch logic. In case an error happened (or when a file is about to be closed) it needs to be removed from epoll’s watch list of epoll before it is closed (the documentation emphasizes this). For that purpose epoll offers a function to do exactly that (Epoll::rmFD). The bool argument indicates if the call shall block until all active calls to the callback shall be finished before Epoll::rmFD returns or not. When a callback removes itself it (obviously) should not wait for it’s own termination thus the second parameter being false.

Also please note that in this example there is a std::map storing all client connections. This is due to the fact that there are no std::function like objects for move-only lambdas. To still have a single point of storage for the connections a std::map is used.

Eventually I realized that the true power of epoll comes from the behavior of edge sensitive triggering in multithreaded environments. If multiple threads are waiting for a file descriptor to become ready whose behavior is edge sensitive then only one thread will be woken up per edge. If no threads are waiting but an edge happens, then the next call to epoll_wait will return the file descriptor where that edge occurred. Some file descriptors exhibit special behaviors when used with epoll. This might appear weird because it is pretty spotty documented (if at all!).

The adjustment of the server example to make it multithreaded is like so:

std::vector<std::future<void>> futs;
for (int i{0}; i < 5; ++i) {
    futs.emplace_back(std::async(std::launch::async, [&]{
        while (true) {
            epoll.work();
        }
    }));
}

Or stated easily: simply call epoll.work() from as many threads as you like – or as you platform allows. The actual workload will automatically be shared to the threads by epoll’s scheduling policy.

The epoll wrapper started as a utility to ease the implementation of servers but ended being a very versatile tool to create a very powerful load balancer for entire applications.

A word about Buffering

Simlyfile does no buffering whatsoever. The reason behind that is that libc already comes with very reasonable buffering strategies with fread and frwite respectively. In the context of simplyfile those mechanisms were ignored since a reimplementation would not benefit anybody and simplyfile actually focusses on file descriptors that do not represent classical files (e.g., files on disk).