Anything Technical

Aug 09, 2015

[intro-to] Bash Parallel Programming

Parallel programming basically means writing scripts that can process multiple tasks at the same time. This is in contrast to running scripts in a serial order which requires the previous tasks to be completed before moving onto the next.

Parallelization can be useful for boosting the performance, but it requires the tasks that are going to be parallelized to be independent of one another. Therefore, in this tutorial, we will refer to the function do-something as a task that can be safely run in parallel. For didactic purpose, we define do-something as follows:

1
2
3
4
function do-something(){
    sleep $1
    echo $1
}

There are several ways of parallelizing a bash script, and each has its own pros and cons.

1.

1
2
3
for i in $(seq 5 1); do
    do-something $i &
done

This is the simplest form of parallelization in bash. & forks each do-something task to the background every time it's called.

If you run this parallel script, your output should look something like this:

1
2
3
4
5

Even though our script runs from 5 to 1, because all five tasks start at the same time due to parallelization, do-something task with the shortest sleeping time gets finished first.

Also, note the speed improvement. Our serial script runs in 5+4+3+2+1=15 seconds while our parallel should theoretically run in 5 seconds. That's 3 times performance improvement by just attaching &!

However, there is one critical problem. Imagine, instead of looping through 5 iterations, your script has to process 5 billion different tasks, and your do-something does something far more intensive. Then, when you run your script, it will attempt to process all those tasks at once, which will easily overload the machine. Our next method will help us prevent that.

ADVATAGE: simple

DISADVANTAGE: can potentially overload the system

2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
wait_turn=4
j=1
for i in $(seq 10 1); do
    do-something $i &
    if ((j++ % wait_turn == 0)); then
        wait
    fi
done
wait
echo "after loop"

Running this, our result should look something like this:

7
8
9
10
3
4
5
6
1
2
after loop

Can you see the pattern?

Once our wait function is called, the shell "waits" until all background tasks are finished. In this case, a batch of four tasks are forked into the background at a time as specified by wait_turn, and the script waits until all four processes are complete.

By doing this, we limit the number of tasks that will be processed. While this will also limit the time it takes to process the all tasks, it won't take down your machine.

The value of wait_turn has been chosen arbitrarily, so you can fiddle with it to find the right number for your own script. Usually, the value is a multiple of the number of the cores you have.

A keen observer might have noted the another wait function after the function. To illustrate its need, consider a similar script without the after-loop wait:

1
2
3
4
5
6
7
8
9
wait_turn=4
j=1
for i in $(seq 10 1); do
    do-something $i &
    if ((j++ % wait_turn == 0)); then
        wait
    fi
done
echo "after loop"

Before scrolling down, think what this would output.

7
8
9
10
3
4
5
6
after loop
1
2

Notice how our "after loop" string comes before do-something 1 and do-something 2 is processed. This could be critical if you have an after-loop code that needs the for-loop to completely finish. Likewise, apply wait function after the for-loop for the first method if necessary.

Yet, even this method can be restrictive. Consider a case where the total time of task completion varies greatly. For our example above, say every fifth tasks (do-something 5, do-something 10), for some odd reason, takes 5 and 10 miniutes to complete instead of mere 5 or 10 seconds, respectively. If we run our script again under this new condition, we can see that our script runs very inefficiently because for our first batch, even though tasks 7, 8, 9 are already completed, they would have to wait for one task, do-something 10 to finish! Likewise, for our second batch, the threads that were working on tasks 3, 4, and 6 are idling because they cannot move forward until do-something 5 completes. In other words, if the batch contains an unusually slow task, it becomes the bottleneck and slows down the entire process, almost defeating the purpose of parallelism.

ADVANTAGE: parallelization without overloading the system

DISADVANTAGE: bottlenecks occur and can potentially slow down the entire process

3.

Fortunately, UNIX/LINUX comes with a command called xargs which whips idling tasks back to work. xargs is not specifically designed to be used for parallelization, but it comes with a handy -P flag which defines the maximum number of parallel processes that can run at a time. xargs is extremely flexible, so I will not go into too much details in this article, but I will cover the basics:

1
seq 10 1 | xargs -n1 -P4 -I {} bash -c "sleep {}; echo {};"

To break it down, I give the input ranging from 10 to 1. Then, xargs separates the input into indiviual pieces (as specified by -n1, meaning take at most one argument). The big -P flag as mentioned is what gives the parallelization ability. Without it, the xargs runs the command in a serial manner. After we set the variable to {}, -I replaces all ocurrences of {} in the command with the input chunk, which in our case are individual numbers. Finally, xargs runs the bash, which, in turn, executes the string with all {} replaced with input chunks.

The above snippet produces:

7
8
9
10
5
4
3
6
1
2

If you take a look at the first batch, as we have seen with #2, only upto 4 tasks are processed at a time since we have limited to the four processes. However, if you take a look at the second batch, you will see that the numbers are not ordered as before. This is because xargs assigns tasks as soon as processes becomes available.

Indeed, if we give as many processes as there are inputs, all processes will be handled at the same time as we would expect.

1
seq 10 1 | xargs -n1 -P10 -I {} bash -c "sleep {}; echo {};"

which rightly produces

1
2
3
4
5
6
7
8
9
10

xargs should be sufficient for most users under most usage cases. Other than the fact that it's a tad bit more complicated than the first two methods I introduced above, xargs provides an efficient, easy way of implementing a parallel script.

ADVANTAGE: eliminates bottleneck

DISADVANTAGE: none for most users. See #4 for more

4.

However, for some users, they might need something more powerful, in which case, I introduce you GNU Parallel. Thankfully, GNU Parallel already detailed reasons why you might want to switch over from xargs.

The parallel equivalent of

1
seq 10 1 | xargs -P4 -I {} bash -c "sleep {}; echo {};"

is

1
seq 10 1 | parallel -P4 "sleep {}; echo {};"

You can see that parallelsyntax is a lot simpler than xargs.

ADVANTAGE: refer to the comparison between xargs and GNU Parallel; to name a few, flexibility and simplicity

DISADVANTAGE: external dependency