Last Updated: February 25, 2016

OpenMP: improve reduction techniques

Introduction

The reduction keyword is one of the many clauses in the OpenMP paradigm programming, it's often source of degradations rather than accelerations. This tutorial describe how to avoid the common mistake and how and when use it.

What is the problem ?

Each reduction is placed by the compiler into an atomic section, that cause a small overhead. The problem starts when this small overhead begins to dramatically grow.

Example of Standard Use

The standard use looks like this code:

void standard_reduction() {
    <type> my_global_sum = 0;
    /* The parallel part of your Algorithm: k-threads works here */
 #pragma omp parallel for reduction(+: my_global_sum)
     for(int i = 0; i < N; ++i) {
         <type> formula = f(i);
         my_global_sum += formula;
    }
    /* The serial part of your Algorithm: 1-thread works here */
     my_final_result = g(my_global_sum);
}

The standard_reduction() may be the source of the overall degradation of your application. 
If N is 'big' the inner atomic clause will produces unexpected results : each atomic clause has a cost that doesn't pay off if the mechanism used to synchronize every steps of algorithm has an high complexity order.

Note on 'Big' : The 'big' concept is absolutely relative to the architecture and algorithms involved, but you can easily get the idea that 'big' is a high magnitude order. 
Note on 'Small': Everything is relative, so the atomic operation has a 'small' overhead respect to the lock/unlock mechanism that can be used to replace the atomic operation.

Proposal Improvement

Using local variable in each thread resolve all the most internal overhead of the algorithm.
The improve version of standard_reduction() is:

void improve_reduction() {
    <type> my_global_sum = 0;
    /* The parallel part of your Algorithm: k-threads works here */
#pragma omp parallel
{
    <type> my_local_sum = 0;
#pragma omp for nowait
    for(int i = 0; i < N; ++i) {
        <type> formula = f(i);
        my_local_sum += formula;
    }          
#pragma omp atomic
    my_global_sum += my_local_sum;
} 
    /* The serial part of your Algorithm: 1-thread works here */
    my_final_result = g(my_global_sum);
}

Improve factor

Let's call H the average overhead complexity for a single atomic operation. 
In the standard_reduction() the total overhead complexity is: O(H * N * number_threads). 
Instead, in the improve_reduction() the formula becomes: O(H * number_threads).
This is negligible because, for instance, you can't set the number_threads ~ O(10^6). 
So the improve factor is exactly O(N) and that is relevant.

A Real Example

Suppose that you write a library with OpenMP support enabled by configuration option.
So your code may looks like this:

sum = 0;
#ifdef OPENMP_ENABLE
#pragma omp parallel for reduction(+:sum)
#endif
for(i = 0; i < N; ++i) {
   int formula = f(i);
   sum += formula;
}

That’s why OpenMP is cool, because it’s not intrusive! 
In two lines #ifdef,#endif we add the support option for the OpenMP feature, and in one line #pragma omp for reduction(+:sum) we create the parallelism. 
Suppose now that you want improve the above version removing reduction and introducing a local variable, the code looks like this:

sum = 0;
#ifdef OPENMP_ENABLE
#pragma omp parallel
{ 
    <type> my_local_sum = 0;
#pragma omp for nowait
#endif
    for(i = 0; i < N; ++i) {
        int formula = f(i);

#ifdef OPENMP_ENABLE
        my_local_sum += formula;
#else
        sum += formula;
#endif
}

#ifdef OPENMP_ENABLE
#pragma omp atomic
    sum += my_local_sum;
} 
#endif

The above code is not readable, hard to change, not elegant...

Changing Strategy

There is an alternative way to totally remove the overhead due to reduction or atomic clauses. Here the code:

void advanced_reduction() {
    <type> my_global_sum = 0;
    <type> *vectorization;
    int threads;

#pragma omp parallel
    #pragma omp master    
        threads = omp_get_num_threads();

    vectorization = (<type>*) _mm_malloc(threads*sizeof(<type>), 64);
    assert(vectorization != NULL);

    /* The parallel part of your Algorithm: k-threads works here */
#pragma omp parallel
{
    <type> my_local_sum = 0;
#pragma omp for nowait
    for(int i = 0; i < N; ++i) {
        <type> formula = f(i);
        my_local_sum += formula;
    }          

    *(vectorization + omp_get_thread_num()) = my_local_sum;
} 

    /* Serial part of your Algorithm: 1-thread works here */
    for (int i = 0; i < threads; ++i)
        my_global_sum += *(vectorization + i);

    my_final_result = g(my_global_sum);
}

Keep in mind: the above code show an alternative way of the improve_reduction(), but it's not always offer a better performance, this is because:

the atomic clause has a small overhead
the atomic clause is out the inner loop
as it mentioned before O(H * number_threads) is negligible

Threads Overhead

Every time a group of threads is created by #pragma omp parallel, #pragma omp parallel for etc...There will be always an overhead due to the creation of the threads by the OpenMP library. That's why is important knows the order of complexity of the algorithms, often a serial version may be more efficient than the parallel implementation of the same algorithm due to the lack of threads overhead.

Conclusion

Use the reduction clause only when you need code readability and your problem has a low magnitude order, otherwise, spend few minutes and improve your algorithm.

#threads

#openmp

#reduction

#parallel programming