Last Updated: February 25, 2016

Use std::function for all your function-passing needs.

But don't forget to turn on your compiler optimizations! Or use C++11's native lambda support.

While architecting a user-space filesystem that will run on Mac OS (using libfuse) and Windows (using CBFS) I wanted to set up an easy way to bind the filesystem callbacks (i.e. createFile, openFile, readFile, writeFile, etc). My first thought was "Hey, lets use C++11's awesome std::function and std::bind." These would allow me to very easily pass in callback methods from arbitrary classes.

My second thought was "Hey, wouldn't adding extra layers between the callee and caller on every filesystem access be slow?" So I decided to test it out and see. You can find the code in this gist. My testing framework is quite simple: in a loop call the function 1 billion times and see how long that takes. I did that with 4 different types of functions, 4 different ways of executing the function, and 2 different compiler optimization levels.

The function performed some arithmetic that isn't too easy for the compiler to completely optimize away (mainly the square root).

int func( int a, int b ){
    int c = a * b;
    c *= a;
    double root = std::sqrt( (double)c );
    return (int)root;
}

The loop looked like this:

const int INTERATIONS = 1000000000; // 1 billion
for( int i = 0; i < ITERATIONS; ++i ){
    func( i, ITERATIONS - 1 );
}

The four different function types were:

Inline function from header file.
External function from another compilation unit.
Inline class member function from header file.
External class member function from another compilation unit.

The four different methods of calling were:

Directly call the function.
Call the function through a pointer to it.
Wrap the function in std::function and call that.
Wrap the function in a lambda and call that.

The two different optimization levels were:

-O0 No optimizations.
-O3 All non-experimental optimizations.

The results were a bit unexpected, but quite nice in the optimized case. These tests were run on my work-issued MacBook Pro running OS X 10.9 with a 2.3 GHz Intel Core i7 and 8 GiB of DDR3 RAM. They were compiled with g++ version 4.7.1 from MacPorts.

natalie@WorkBook:funcSpeed$ g++ -std=c++11 test.cpp main.cpp -o funcSpeed && ./funcSpeed
   ---   Direct Call Tests   ---   
testInline                                                     9317ms.
testExternal                                                   9433ms.
tester.testInlineMember                                        9252ms.
tester.testExternalMember                                      9328ms.
   ---   Pointer Call Tests   ---   
(&testInline)                                                  9401ms.
(&testExternal)                                                9642ms.
(tester.*(&Test::testInlineMember))                            9515ms.
(tester.*(&Test::testExternalMember))                          9382ms.
   ---   std::function Call Tests   ---   
funcTestInline                                               134101ms.
funcTestExternal                                             134797ms.
funcTestInlineMember                                         153735ms.
funcTestExternalMember                                       154390ms.
   ---   Lambda Call Tests   ---   
[&]( int a, int b ){ testInline( a, b ); }                    10586ms.
[&]( int a, int b ){ testExternal( a, b ); }                  10441ms.
[&]( int a, int b ){ tester.testInlineMember( a, b ); }       10996ms.
[&]( int a, int b ){ tester.testExternalMember( a, b ); }     11231ms.
natalie@WorkBook:funcSpeed$ g++ -std=c++11 -O3 test.cpp main.cpp -o funcSpeed && ./funcSpeed # Optimized!
   ---   Direct Call Tests   ---   
testInline                                                     7693ms.
testExternal                                                   8297ms.
tester.testInlineMember                                        7790ms.
tester.testExternalMember                                      7986ms.
   ---   Pointer Call Tests   ---   
(&testInline)                                                  7617ms.
(&testExternal)                                                8031ms.
(tester.*(&Test::testInlineMember))                            7752ms.
(tester.*(&Test::testExternalMember))                          8119ms.
   ---   std::function Call Tests   ---   
funcTestInline                                                 9866ms.
funcTestExternal                                              10128ms.
funcTestInlineMember                                           8346ms.
funcTestExternalMember                                         8331ms.
   ---   Lambda Call Tests   ---   
[&]( int a, int b ){ testInline( a, b ); }                     7793ms.
[&]( int a, int b ){ testExternal( a, b ); }                   8211ms.
[&]( int a, int b ){ tester.testInlineMember( a, b ); }        7957ms.
[&]( int a, int b ){ tester.testExternalMember( a, b ); }      7920ms.

The first set is the unoptimized case. The unoptimized direct-call case was about what I expected, they all take the same time since with no optimizations enabled the testInline function isn't inlined. Same for the pointer-based calls. One small surprise there was that the pointer-to-external-member test always performed 200 - 300 milliseconds faster than the pointer-to-inline-member test in the unoptimized code (an unimportant difference as we're talking about a difference of 200-300 nanoseconds per call).

The first real point of interest comes with the std::function tests. The combination of std::function and std::bind increases execution time ~16x for member functions and ~14x for non-member functions. A lot of that overhead is coming from std::bind. Running the non-member tests with just std::function gave me results of 33693ms and 33591ms for the inline and external tests respectively. That's only ~3.6x multiplier. This wrapper is adding an extra 143,000 nanoseconds (0.143 milliseconds) for every call. While this is still considerably less than anything noticeable by a human it obviously adds up.

The lambda functions, however, are considerably faster which rather surprised me. I had always assumed that under the hood the compiler was just translating the lambdas into std::function objects, but clearly this is not the case. Here the difference between the direct call and the lambda-wrapped call is a mere 1,200 nanoseconds (0.0012 milliseconds) per call. A decent overhead, but not bad considering the programmer productivity and project maintainability gains of these features.

Things may seem grim for std::bind, but compiler optimizations come in to save the day. First, we see that when the -finline-functions flag is set (it's implicitly set with -O3) that inline functions are a decent clip faster than external ones. No big surprise there. The pointer call test also looks remarkably like the direct call test, which is likely because the compiler is optimizing away my dereferencing, but that is not important for this discussion.

More interestingly we see that the std::function and std::bind combination optimize beautifully. We are now down to an ~1.08x and ~1.23x multiplier for the member and non-member functions respectively. It seems with optimizations enabled the std::bind/std::function combo performs better for member functions than it does for non-member ones. Also interestingly the non-member optimized performance is the same with and without the std::bind, suggesting that it is entirely optimized out in these cases.

The lambda case is a bit useless in the optimized code. The differences in speed are so minor it may have been optimized out entirely by the compiler.

#std

#lambda functions

#natelillich