OpenMP: improve reduction techniques
Introduction
The reduction
keyword is one of the many clauses in the OpenMP paradigm programming, it's often source of degradations rather than accelerations. This tutorial describe how to avoid the common mistake and how and when use it.
What is the problem ?
Each reduction
is placed by the compiler into an atomic
section, that cause a small overhead. The problem starts when this small overhead begins to dramatically grow.
Example of Standard Use
The standard use looks like this code:
void standard_reduction() {
<type> my_global_sum = 0;
/* The parallel part of your Algorithm: k-threads works here */
#pragma omp parallel for reduction(+: my_global_sum)
for(int i = 0; i < N; ++i) {
<type> formula = f(i);
my_global_sum += formula;
}
/* The serial part of your Algorithm: 1-thread works here */
my_final_result = g(my_global_sum);
}
The standard_reduction()
may be the source of the overall degradation of your application. <br>
If N
is 'big' the inner atomic clause will produces unexpected results : each atomic clause has a cost that doesn't pay off if the mechanism used to synchronize every steps of algorithm has an high complexity order.
Note on 'Big' : The 'big' concept is absolutely relative to the architecture and algorithms involved, but you can easily get the idea that 'big' is a high magnitude order.<br>
Note on 'Small': Everything is relative, so the atomic
operation has a 'small' overhead respect to the lock/unlock mechanism that can be used to replace the atomic
operation.
Proposal Improvement
Using local variable in each thread resolve all the most internal overhead of the algorithm.
The improve version of standard_reduction()
is:
void improve_reduction() {
<type> my_global_sum = 0;
/* The parallel part of your Algorithm: k-threads works here */
#pragma omp parallel
{
<type> my_local_sum = 0;
#pragma omp for nowait
for(int i = 0; i < N; ++i) {
<type> formula = f(i);
my_local_sum += formula;
}
#pragma omp atomic
my_global_sum += my_local_sum;
}
/* The serial part of your Algorithm: 1-thread works here */
my_final_result = g(my_global_sum);
}
Improve factor
Let's call H the average overhead complexity for a single atomic operation.<br>
In the standard_reduction()
the total overhead complexity is: O(H * N * number_threads).<br>
Instead, in the improve_reduction()
the formula becomes: O(H * number_threads).
This is negligible because, for instance, you can't set the number_threads ~ O(10^6).<br>
So the improve factor is exactly O(N) and that is relevant.
A Real Example
Suppose that you write a library with OpenMP support enabled by configuration option.
So your code may looks like this:
sum = 0;
#ifdef OPENMP_ENABLE
#pragma omp parallel for reduction(+:sum)
#endif
for(i = 0; i < N; ++i) {
int formula = f(i);
sum += formula;
}
That’s why OpenMP is cool, because it’s not intrusive!<br>
In two lines #ifdef,#endif
we add the support option for the OpenMP feature, and in one line #pragma omp for reduction(+:sum)
we create the parallelism.<br>
Suppose now that you want improve the above version removing reduction and introducing a local variable, the code looks like this:
sum = 0;
#ifdef OPENMP_ENABLE
#pragma omp parallel
{
<type> my_local_sum = 0;
#pragma omp for nowait
#endif
for(i = 0; i < N; ++i) {
int formula = f(i);
#ifdef OPENMP_ENABLE
my_local_sum += formula;
#else
sum += formula;
#endif
}
#ifdef OPENMP_ENABLE
#pragma omp atomic
sum += my_local_sum;
}
#endif
The above code is not readable, hard to change, not elegant...
Changing Strategy
There is an alternative way to totally remove the overhead due to reduction
or atomic
clauses. Here the code:
void advanced_reduction() {
<type> my_global_sum = 0;
<type> *vectorization;
int threads;
#pragma omp parallel
#pragma omp master
threads = omp_get_num_threads();
vectorization = (<type>*) _mm_malloc(threads*sizeof(<type>), 64);
assert(vectorization != NULL);
/* The parallel part of your Algorithm: k-threads works here */
#pragma omp parallel
{
<type> my_local_sum = 0;
#pragma omp for nowait
for(int i = 0; i < N; ++i) {
<type> formula = f(i);
my_local_sum += formula;
}
*(vectorization + omp_get_thread_num()) = my_local_sum;
}
/* Serial part of your Algorithm: 1-thread works here */
for (int i = 0; i < threads; ++i)
my_global_sum += *(vectorization + i);
my_final_result = g(my_global_sum);
}
Keep in mind: the above code show an alternative way of the improve_reduction()
, but it's not always offer a better performance, this is because:
- the
atomic
clause has a small overhead - the
atomic
clause is out the inner loop - as it mentioned before O(H * number_threads) is negligible
Threads Overhead
Every time a group of threads is created by #pragma omp parallel
, #pragma omp parallel for
etc...There will be always an overhead due to the creation of the threads by the OpenMP library. That's why is important knows the order of complexity of the algorithms, often a serial version may be more efficient than the parallel implementation of the same algorithm due to the lack of threads overhead.
Conclusion
Use the reduction
clause only when you need code readability and your problem has a low magnitude order, otherwise, spend few minutes and improve your algorithm.