I taught myself some concurrent programming lately. I did it as part of my operating systems studies. In this essay, I will illustrate it using an example problem.
I will be illustrating concurrent programming using operating system primitives. [1] There are several different ways to do concurrent programming that does not involve interacting with the operating system kernel. Even if we restrict ourselves to using kernel primitives, there are many variants. I will mainly be concerned with concurrency using mutexes and condition variables. In literature, it is sometimes called concurrent programming using shared objects.
I developed this program on a Vagrant VM using Virtualbox that ran FreeBSD 10.2. Even though I have a lot more experience with Linux [2], I feel I am more comfortable using FreeBSD, even though I don’t have a lot of experience with it. There are a couple of reasons.
I am choosing a contrived problem so that we can focus on concurrency, without getting distracted by incidental details. The problem is this.
We have a fixed number of integers. Our objective is to calculate the Fibonacci number of all these integers. [3] The constraint is, for each integer, we will calculate the Fibonacci number exactly once.
We can solve this problem in many different ways. A straightforward way is to put all the integers in an array, loop over the array, and calculate all Fibonacci numbers. This is fine but we won’t be taking advantage of multiple cores. [4] If we can make our program concurrent using threads, it might increase its throughput, the number of jobs processed per second.
Concurrent programming is hard because the data structures are shared between multiple threads. We need to ensure that the threads access the shared data structures safely without causing any inconsistency or corruption.
Since the shared data structure is the central problem here, we will tackle that first. What data structure should we choose for this problem? Remember the problem is we have got integers and each integer has to be processed exactly once.
For this problem, we can use a simple circular queue. It will look like this. We can have producer threads enqueuing the queue with integers and we can have consumer threads dequeuing integers from queue and process it.
The struct for a sequential queue looks like this. [5][6]
Of course, we will have problems with this struct
if multiple threads access
it simultaneously. We will address this later.
The code that inserts and deletes elements from this queue, as well as mutates its state variables, would look like this.
Now it is very important that all state changes happen inside these functions and nowhere else. Its the responsibility of these functions to do insertion/deletion of elements and do state changes (for example, changing front and rear properties) safely. For simplicity, I will call these functions “public functions” from now on. [7]
Also, note that we have got two simple predicate functions, is_queue_full
and
is_queue_empty
. They will come in use in a bit.
The requirements of concurrency are as follows. We will refine these as we go along.
First, we will look at how to create and manipulate threads. The standard
function to create a thread in C is pthread_create
. It is used like this.
Here, thread_id
is a variable of type pthread_t
, thread_attr
is a
variable of type pthread_attr_t
, thread_routine
is a function that
takes an argument of type void *
and returns void *
and thread_arg
is a variable of type void *
that is passed to thread_routine
function.
Let’s take an example. Suppose we want to execute a function fib
in a
separate thread. The declaration of fib
looks like this.
Suppose we want to calculate fib(3)
. We will call it in a separate thread
like this.
Notice that the second argument, which was supposed to be a thread_attr
, is
NULL. This is what we do when we want default attributes. Also, notice that the
pthread_create
spec said that we should use a function that takes a void *
and
returns a void *
and I am instead passing a function that takes an int
and
returns an int
. I want to keep things simple and I am relying on the compiler
to do the type coercions for me. In my experience, these particular coercions
are safe and do not lead to any corruption or information loss.
The same holds for the fourth argument thread_arg
.
Suppose in our main
function, we create a thread that executes fib
.
It will look like this.
The problem is main
can return before the thread finishes execution. To
wait for the current thread to finish, we use the function thread_join
.
pthread_join
returns immediately if the thread has finished its execution.
Otherwise, it will block till the thread finishes, and then it will return. By
induction, if you create n separate threads, you can wait for all n of them to
complete by doing pthread_join
on all the thread_ids. This is what I did for
all producer and consumer threads.
There is no guarantee that a thread created by pthread_create
will start
immediately. Whenever we call pthread_create
, the operating system puts the
thread into something called the ready list of the schedular. The kernel
schedular then picks up a thread to run from the ready list based on some
scheduling policy. So, pthread_create
creates a thread and puts it into the
ready list to be run later.
Now let’s get back to our queue
struct. We want to make do some changes so that
this data structure can be used by multiple threads safely. Let’s revisit our
criteria for concurrency.
Now let’s discuss these requirements in more detail and implement them.
The primitive that provides mutual exclusion is called the mutex. Its a variable
of type pthread_mutex_t
. We will consider two functions that operate on mutexes.
These are pthread_mutex_lock
and pthread_mutex_unlock
. They are used like this.
Now, /* Some code */
can only be executed by that one thread who has called
pthread_mutex_lock
and hasn’t called pthread_mutex_unlock
yet. If a thread
calls pthread_mutex_lock
while another thread is still using it, the thread
is stopped and put into a waitlist associated with my_mutex
. Once my_mutex
becomes free, one thread is popped from the waitlist of my_mutex
and made
available to run by putting it back into the ready list.
Our queue struct will look like this now. We add a new field in struct
called
mutex
.
We have got functions that modify the state variables of the queue
. They are
the enqueue
and the dequeue
. There is no state change outside of these functions.
So we acquire the lock at the beginning of these functions and release that at the
end. We modify our enqueue
and dequeue
functions like this.
This is where restricting state changes to these functions start paying off. We only need to worry about mutual exclusion in these functions. Now it is very important to note that we hold the lock at the beginning and release the lock at the end of these functions. That is, don’t acquire and release the locks anywhere else in the code.
You might think that this is inefficient and coarse-grained. Or you might think that holding locks like this will increase lock contention (“It changes state only in those three lines so let’s just wrap those inside lock”). The problem with that approach is it makes it very hard to understand and modify existing code safely. Furthermore, it is not inefficient and won’t lead to lock contention as we will see right now.
Suppose a thread is executing enqueue
and finds that the queue is full. What
should we do here?
We should give up the mutex we are holding and go to the waiting state. The object that helps us achieve this is called a condition variable.
Condition variables give us the mechanism to atomically release the lock
and go to the waiting state. A condition variable is a variable of type
pthread_cond_t
Each condition variable has associated with it
a waitlist of threads.
We have a function, pthread_cond_wait
, that takes a mutex and a condition variable
as arguments and releases the mutex and puts the running thread to the waitlist of the condition variable.
The greatest value provided by pthread_cond_wait
is it does these two operations atomically.
So if we have code like this.
The thread running this snippet will give up code_mutex
and go to the
waitlist of code_cond_var
.
When would we want to call pthread_cond_wait
? We want to call it
whenever we can’t make progress. Whenever we design a shared object and
deciding on condition variables, we ask this question: When can a thread
wait? For our problem, the answers will be:
enqueue
and the queue is full.dequeue
and the queue is empty.Based on that, we would need two condition variables queue_has_space
and queue_has_element
for 1) and 2) respectively.
Our queue struct looks like this now.
We added two new fields of type pthread_cond_t
: queue_has_space
and
queue_has_element
. Our state-changing functions look like this now.
Let’s look at enqueue
. pthread_cond_wait
atomically releases queue_object->mutex
and go to the
waitlist of queue_object->queue_has_space
. It won’t return until we chose
to wake up the threads in the waiting list of queue_object->queue_has_space
.
We will discuss how to wake up threads that are waiting in the next section.
Notice that we wrapped pthread_cond_wait
calls inside a while loop. This is
important. We should always wait on a condition variable inside a while
loop. This is because if later, we return from pthread_cond_wait, there is no
guarantee that the current condition will still allow us to progress. Let’s
consider an example.
Suppose we only have space for one element in our queue and we have got an
enqueue thread E1 and two dequeue threads D1 and D2. D1 runs first, sees that
there is no element to dequeue, releases the lock, and go to the waitlist.
Now, E1 acquires the lock and fills up the queue with one element. Suppose after that,
instead of D1, D2 runs and dequeues the thread and we switch to D1. Now, if
we have our pthread_cond_wait
inside a while
loop, we will check the
condition again and go to the waitlist. If instead, we had our
pthread_cond_wait
call inside an if
statement, we would have progressed
further and might have caused data corruption.
Finally, we want to notify the threads in the waitlist of condition variables
if they can make progress. For example, after dequeue
, there is a
chance that some enqueue
thread was waiting for it. For this, we use a
function pthread_cond_signal
. It wakes up one thread from the waitlist
of a condition variable and puts it into the ready list. [8] Again, as above,
there is no guarantee that the signaled thread will run immediately. It is put
into the ready list and its the scheduler’s job to run it according to
scheduling policy. Its called like this.
Our enqueue
and dequeue
functions will look like this now.
Concurrent programming has a reputation that we have to reason about the global
structure of the program. If we wrap our condition variable checks inside a
while
loop as we did above, we avoid some of that. In our case,
pthread_cond_signal
can be regarded as a hint for other threads to check if
they can run again.
Thus, we can liberally do pthread_cond_signal
calls whenever we think we changed
something that might allow others to progress. We hope that others will check the condition
again on their end.
The full implementation can be found here.
Thread programming using locks has got a bad reputation in recent years. Especially for I/O concurrency, if you propose a solution that involves threads, others would probably think you have gone nuts. Or maybe fire you while they are at it.
However, I still think I would use threads even for I/O concurrency. The reasons are
epoll
to kqueue
. [9]I think this whole sentiment of threads sucks originated from Dan Kegel’s landmark C10K paper. [10] It has been updated from time to time but I think the kernel implementations have come a long way since then. For example, how much do you think it costs to acquire an uncontended lock. It can be as fast as 10 instructions. In some implementations, it is as low as 2 instructions. [11]
What about processes? fork
is definitely expensive right?
After all, you have to copy every open file for the child process. Well, in recent
Copy-On-Write (COW) implementations, [12] you don’t even copy those.
It seems to me that despite all the modernity of developers (using the latest frameworks, latest libraries, etc.), they are still programming using old ideas about the platforms on which their applications run.
That’s for I/O concurrency. For CPU concurrency, I don’t think there is any debate. Thread programming is the way to go.
However, these abstractions of mutexes and condition variables were developed for operating system purposes. Are they suitable for application programs? I am not sure. For example, Erlang doesn’t use operating system threads or processes and its model sounds promising. There have been successful companies/projects built on it. I would love to learn Erlang and maybe after that I will conclude that all these locks, condition variables, etc. are not good abstractions for normal programs. Or maybe I will conclude that there is a place for both kinds of abstractions. I am certainly looking forward to that.
Incidentally, while I was writing this program, I spent much more time implementing sequential queue then concurrency aspects. It is pretty clear to me how much I suck in data structures and algorithms. In fact, until last year, I used to think of algorithms as a topic that people study to crack job interviews. Oh well, better late than never.
If you are interested, these are good books to learn more about the topics covered here.
[1] By kernel primitives, I mean C standard library functions that are
interfaces to system calls.
[2] I basically used it in all my previous projects, official or personal.
[3] Fibonacci functions is useful to get a wide range of computational time.
For example, there’s a noticeable time difference between calculating
fibonacci of 1 and fibonacci of 32. I got this idea from one of the David
Beazley’s talk.
[4] There are number of gotchas with using synchronization variables in multiple
cores as well. But I will delve into that in some other essay.
[5] Its often said that you cannot turn a sequential program into a
concurrent one without changing program design in a major way. I haven’t
faced this problem here.
[6] I am using C here. I have omitted details about memory allocation to focus
on the ideas. I will give a link to the full implementation in the end.
[7] If you using an object oriented language, you can think of it as a queue
class having a set of public methods and private state variables.
[8] There is another function pthread_cond_broadcast
that wakes up all threads
in the waiting list at once. We won’t be using it for this problem. But
they are useful in some problems, for example in a Readers-Writers lock
implementation.
[9] I really like kqueue
though and I am looking forward to learning and using
it.
[10] http://www.kegel.com/c10k.html
[11] I am not able to find specific code in FreeBSD that achieve this. However,
the basic idea is if the lock is uncontested, we don’t have to go through
the schedular, don’t have to acquire schedular’s spinlock and we don’t have
to disable interrupts in order to acquire a lock. If the lock is busy
though, we get a scheduling overhead.
[12] They might be pretty old now.