Most code written today is sequential. What do we mean by the term
sequential or serialized? Simply put, code is executed one instruction
after the next in a monolithic fashion, with no regard to the many possible
resources available to the program. Overall performance can be serverely
degraded if the program performs a blocking call.
Why is it that most programs are sequential? One guess could be the
relative dominance of uniprocessing machines available to programmers.
Multithreading a program on a uniprocessor machine in most cases does not
yield enough performance gains to merit days, weeks, or months worth of
work to thread code. Another guess is that most of us think in a
sequential manner. Parallelizing our thoughts does not come naturally
nor is it an easy task.
However, times have changed and many papers have been written on
multithreading. Some advocate the use of threads, while others do not.
With the increasing popularity of Symmetric-Multiprocessing machines,
programming multithreaded code is a skill worth learning.
We will dive into the world of threads with some a little bit of "theory"
first. We will examine thread synchronization primitives and then a
tutorial on how to use POSIX pthreads. Finally, we will finish off with
thread performance and a brief overview of multiprocess programming.
Isn't that something you put through an eye of a sewing needle?
Yes.
How does it relate to programming then?
Think of sewing needles as the CPUs (or LWPs) and the threads in a program as the fiber. If you had two needles but only one thread, it would take longer to finish the job than if you split the thread into two and used both needles at the same time. Taking this analogy a little further, if one needle had to sew on a button (blocking I/O), the other needle could continue doing other useful work even if the other needle took 4 hours to sew on a single button. If you only used one needle, you would be ~4 hours behind!
Now that we have a real world analogy, let's establish something more concrete. A thread is a sequence of instructions that can be executed in parallel with other threads [wikipedia.com]. They are not processes, but rather lightweight threads of execution. Threads of a program are not full-blown processes, but are smaller portions of the process running concurrently (or in parallel). Hence, the term lightweight is used.
You cannot expect a multithreaded program to run on a kernel that does
not support threads. Fortunately most modern Operating Systems support
threads, either with their own thread library or through POSIX pthreads.
Sun Solaris, FreeBSD, Linux, AIX, HP-UX, IRIX, and Windows NT, just to
name a few, support multithreaded programs. However, each Operating
System uses a different technique to support threads.
Before we can dive into the details of how threads are supported, we need
to get familiarized with a few terms.
Solaris uses the many-to-many model. All CPUs are mapped to any
number of LWPs which are then mapped to any number of threads. The kernel
schedules the LWPs for slices of CPU time.
Linux uses the one-to-one model. Each thread is mapped to a single
LWP. Why is this approach favored in Linux over the many-to-many
model? Linux LWPs are really lightweight and thus LWP creation is not as
expensive as it is in Solaris. Another bonus to make your program
multithreaded rather than multiprocess in Linux is that the scheduler
(2.4) gives a 1 point boost to "processes" scheduled which are in the
same thread family as the currently running process.
Moreover, creating bound or unbound threads can greatly impact performance
of your multithreaded program. There is no general rule when it comes to
using either one. Each scenario demands a thorough analysis to select
the right thread type for the job.
Yes, threads are great... for the right tasks! Don't waste your time multithreading a program that isn't worth multithreading. Sometimes just plain, sequential code can do the job just right.
Now that we have a little basic background on threads, let's discuss how we can correctly use threads that best suits our task at hand.
One thread dispatches other threads to do useful work which are usually part of a worker thread pool. This thread pool is usually pre-allocated before the boss begins dispatching threads to work. Although threads are lightweight, they still incur overhead when they are created. This is the classic and one of the more popular thread models.
The peer model is similar to the boss/worker model except once the worker pool has been created, the boss becomes the another thread in the thread pool, and is thus, a peer to the other threads.
Similar to how pipelining works in a processor, each thread is part of a long chain in a processing factory. Each thread works on data processed by the previous thread and hands it off to the next thread. You must be careful to equally distribute work and take extra steps to ensure non-blocking behavior in this thread model or you could experience pipeline "stalls."
Because most programs are more complex than a matrix-multiplication problem (which can be completely parallelized with speedup close to 100%, assuming you have enough processors) we must synchronize our threads and protect globally shared data across multiple threads.
Enter mutal exclusion. Mutual exclusion is the method of
serializing access to shared resources. You do not want a thread
to be modifying a variable that is already in the process of being
modified by another thread! Another scenario would be a dirty read
where the value is in the process of being updated and another thread
reads an old value.
Mutual exclusion (most often referred to as mutex) allows the
programmer to "attach" locks to resources. If a thread wishes
to modify or read a value from a shared resource, the thread must first
gain the lock. Once it has the lock it may do what it wants with the
shared resource while it has the lock because no other thread should
have access to that variable. Once the thread finishes using the shared
resource, it unlocks the mutex, which allows other threads to
access the resource. This is referred to as serializing access to the
shared resource. You can think of a mutex as a treasure chest, and the
resource it is protecting lies within the chest. Only one person can
have the key to the chest at any time, therefore, is the only person
allowed to look or modify the contents of the chest at that time.
The code between the lock and unlock calls to the mutex, is referred to
as the critical section. Minimizing time spent in the critical
section allows for greater concurrency because it reduces the time other
threads must wait to gain the lock. Therefore, it is important for a
thread programmer to minimize critical sections.
An important problem associated with mutexes is the possibility of
deadlock. A program can deadlock if two (or more) threads have
stopped execution or are spinning permanently. The simplest deadlock
situation: thread 1 locks lock A, thread 2 locks lock B, thread 1 wants
lock B and thread 2 wants lock A. Instant deadlock. You can prevent
this from happening by making sure threads acquire locks in an agreed
order (lock ordering). Deadlock can also happen if threads do
not unlock mutexes properly.
Race conditions occur when multiple threads share data and at least
one of the threads accesses the data without going through a defined
synchronization mechanism (Nichols 203). This could result in erroneous
results even in an inconsistent manner which makes race conditions
particularly difficult to debug. Library calls outside of your program's
control are common culprits. Make sure you take steps within your
program to enforce serial access to shared file descriptors and other
external resources. On most Solaris man pages, you can find out if
your library call is safe to use in reentrant code. Towards the
bottom of the man page, you will see Categories of MT Library Calls.
MT Safe means that the function can be called concurrently
from different threads. MT Hot are "fast" MT Safe functions
(usually not found on man pages). MT Unsafe means that the
function cannot be called concurrently. Alternative means that
there are MT Safe equivalents (e.g. gethostbyname() and
gethostbyname_r()).
Another problem with mutexes is that contention for a mutex can lead to
priority inversion. A higher priority thread can wait behind a lower
priority thread if the lower priority thread holds a lock for which the
higher priority thread is waiting. This can be eliminated/reduced by
limiting the number of shared mutexes between different priority threads.
Mutexes are one method of synchronizing threads, however, there are many other ways.
A popular method of synchronizing multiple threads is through the use of
condition variables. Condition variables allow threads to synchronize to
a value of a shared resource. Condition variables provide a kind of
notification system among threads (Nichols 79).
For example, you could have a global counter, and once it reaches a certain
count a thread activates. The thread that activates once the counter
reaches the limit would wait on the condition variable. Other
threads signal this condition variable if you want threads
waiting/sleeping on this condition variable to wakeup. You can also use
broadcast if you want to signal all threads waiting on
the condition variable to wakeup. This sounds a bit confusing, but the
pthread example below will clarify how condition variables work.
When waiting on condition variables, the wait should be inside a loop, not
in a simple if statement because of spurious wakeups. You are not
guaranteed that if a thread wakes up, it is the result of a signal or
broadcastcall.
It is sometimes useful to allow multiple readers (threads) to read a shared
resource variable without having to wait for locks if no thread is writing
to the resource. With a multiple reader lock this is possible. Readers can
enter the lock and do their business while writers wait until there are no
readers left in the lock. Then the writer can modify the value while
blocking all new readers from entering the lock.
Writer starvation is possible with the many reader/single writer lock
technique. A priority system can eliminate writer starvation. For more
information on r/w locks, see Nichols pgs. 84-89.
Spinlocks are less commonly used at the user-level. Spinlocks are used
frequently in the Linux kernel itself. A spinlock will basically spin on
a mutex. If a thread cannot obtain the mutex, it will keep polling the lock
until it is free. The advantages to this is that if a thread is about to
give up a mutex, you don't have to context switch to another thread. This
situation is a bit tricky because you don't know when this might occur, and
long spin times will result in poor performance.
Spinlocks should never be used on uniprocessor machines.
Why is this?
Semaphores are another type of synchronization primitive that come in two
flavors: binary and counting. Binary semaphores act much like mutexes,
while counting semaphores can behave as recursive mutexes.
Counting semaphores can be initialized to any arbitrary value which
should depend on how many resources you have available for that particular
shared data. Many threads can obtain the lock simultaneously until the
limit is reached. This is referred to as lock depth.
Semaphores are more common in multiprocess programming (i.e. it's usually
used as a synch primitive between processes).
Note: It is assumed that you have a good understanding of the C
programming language. If you do not or need to brush up, please review
basic C (including pointers and dynamic memory allocation). Here are
some resources.
Now that we have a good foundation of thread concepts, lets talk about a
particular thread library, POSIX pthreads. The pthread library can be found
on almost any modern OS.
A few preliminary steps you should take before beginning any pthread
coding is to:
#include <pthread.h> in your .c or .h
header file(s)#define _REENTRANT macro
somewhere in a common .h or .c fileMakefile make sure gcc links against
-lpthread-D_POSIX_PTHREAD_SEMANTICS to
your Makefile (gcc flag)
for certain function calls like sigwait()
Now let's begin our journey into pthreads...
A thread is represented by the type pthread_t. Let's begin
by examining most of the pthread creation and initializing functions:
int pthread_create(pthread_t *thread, pthread_attr_t *attr,
void *(*start_routine)(void *), void *arg);
int pthread_attr_init(pthread_attr_t *attr);
int pthread_mutex_init(pthread_mutex_t *mutex,
const pthread_mutexattr_t *mutexattr);
int pthread_cond_init(pthread_cond_t *cond,
pthread_condattr_t *cond_attr);
pthread_create() example:
pthread_create(&pt_worker, &thread_attributes,
thread_function, (void *)thread_args);
The above will create a pthread pt_worker with thread
attributes defined in thread_attributes (this argument can
be NULL if you want default thread attributes). The thread
code is contained in the function thread_function and is
passed in a arguments stored in thread_args. The
thread_function prototype would look like this:
void *thread_function(void *args);
Immediately after the pthread_create call completes, the
thread_function will begin executing.
pthread_XXXX_init() functions initialize thread attributes,
mutexes, and condition variables. mutexattr and
cond_attr can be NULL if you are using defaults. Mutexes
and condition variables can be initialized to default values using the
INITIALIZER macros as well. For example:
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t count_cond = PTHREAD_COND_INITIALIZER;
To perform locks and unlocks the following functions are available for you:
int pthread_mutex_lock(pthread_mutex_t *mutex); int pthread_mutex_trylock(pthread_mutex_t *mutex); int pthread_mutex_unlock(pthread_mutex_t *mutex);
pthread_mutex_lock() is a blocking call. If the thread
cannot gain the lock, then the thread will block the thread from
proceeding until it obtains the lock. pthread_mutex_trylock()
will return immediately if the mutex cannot be locked. To unlock
a mutex, simply call pthread_mutex_unlock(). An example
on using Pthread mutexes:
pthread_mutex_lock(&count_lock); count++; pthread_mutex_unlock(&count_lock);
In the above example, we are incrementing the globally shared
count variable. The code between the lock and unlock calls
is the critical section. Always try to minimize this section!
Here are the pthread condition variable function prototypes:
int pthread_cond_wait(pthread_cond_t *cond,
pthread_mutex_t *mutex);
int pthread_cond_signal(pthread_cond_t *cond);
int pthread_cond_broadcast(pthread_cond_t *cond);
pthread_cond_wait() puts the current thread to sleep. It
requires a mutex of the associated shared resource value it is waiting
on. pthread_cond_signal() signals one thread out of the
possibly many sleeping threads to wakeup.
pthread_cond_broadcast() signals all threads waiting
on the cond condition variable to wakeup. Here is an
example on using pthread condition variables:
pthread_mutex_lock(&count_lock);
while (count < MAX_COUNT) {
pthread_cond_wait(&count_cond, &count_lock);
}
pthread_mutex_unlock(&count_lock);
We are locking the count_lock mutex so we can read the
value of count without entering a potential race condition. Notice how
the we use a while loop instead of an if
statement. This is because of spurious wakeups problem previously mentioned.
Just because a thread has been woken does not mean it was due to a
pthread_cond_signal() or
pthread_cond_broadcast() call. The
pthread_cond_wait() call takes the count mutex and
condition variable. Why does pthread_cond_wait() require
the mutex as well as the conditon variable? It's because it needs to
unlock the mutex when going to sleep or you would potentially enter into
a deadlock! pthread_cond_wait() if awoken, automatically
tries to reacquire the mutex, and will block if it cannot. Using the
signal and broadcast functions is self-explanatory. It is recommended
that you release any locks that other threads could be waiting on
before you signal or broadcast.
Here are some suggestions and issues you should consider when creating using pthreads:
pthread_XXXX_destroy() calls.pthread_join() can be used to wait on other threads
to finish (if the JOINABLE attribute is set). This is useful in creating
barriers or other synchronization windows/points.pthread_attr_setXXXX() functions. scope, schedpolicy, and
detachstate are only some of the useful attributes you can set on your
threads.pthread_kill() can be used to deliver signals to
specific threads.pthread_self() returns a handle on the calling
thread.pthread_once() can be used to ensure that an
initializing function within a thread is only run once.Try google or either of the recommended books below :) Better yet, get your feet wet! Try coding a multithreaded program.
The performance gains from using threads is great, but can it be even better? You should consider the following when analyzing your program for potential bottlenecks:
We have explored the very basics of multithreaded programming. What
about multiprocess programming? For example, Apache for Linux is a
multiprocess program (however Apache for Windows is multithreaded). How
are you supposed to synchronize between processes?
These topics are beyond the scope of this document, but to perform
cross-process synchronization, one would use some form of IPC: pipes,
semaphores, message queues, or shared memory. Of all of the forms of
IPC, shared memory is the fastest (excluding doors). You can use
either POSIX or System V semantics when dealing with cross-process
resource management, IPC, and synchronization.
So you may be thinking. What's the better solution, multithreaded,
multiprocess, or mixed multithreaded/multiprocess? Again it all depends!
Pick the right tool for the task!
It is impossible to cover more than an introduction to threads with this
short tutorial and overview. For more in-depth coverage on threads (like
thread scheduling-classes, thread-specific data (TSD), and thread
cancelling) and pthread programming I recommend these books:
Lewis, Bill and Daniel J. Berg.
Multithreaded Programming with Pthreads. California: Prentice
Hall, 1998.
Nichols, Bradford, et. al. Pthreads Programming. Beijing:
O'Reilly & Associates, Inc., 1998.
GNU Pth (portable threads) is the "next generation" threads library and
may be the future of multithreaded event-driven programming. Take a look
here.
