Use std::function for all your function-passing needs.
But don't forget to turn on your compiler optimizations! Or use C++11's native lambda support.
While architecting a user-space filesystem that will run on Mac OS (using libfuse) and Windows (using CBFS) I wanted to set up an easy way to bind the filesystem callbacks (i.e. createFile, openFile, readFile, writeFile, etc). My first thought was "Hey, lets use C++11's awesome std::function
and std::bind
." These would allow me to very easily pass in callback methods from arbitrary classes.
My second thought was "Hey, wouldn't adding extra layers between the callee and caller on every filesystem access be slow?" So I decided to test it out and see. You can find the code in this gist. My testing framework is quite simple: in a loop call the function 1 billion times and see how long that takes. I did that with 4 different types of functions, 4 different ways of executing the function, and 2 different compiler optimization levels.
The function performed some arithmetic that isn't too easy for the compiler to completely optimize away (mainly the square root).
int func( int a, int b ){
int c = a * b;
c *= a;
double root = std::sqrt( (double)c );
return (int)root;
}
The loop looked like this:
const int INTERATIONS = 1000000000; // 1 billion
for( int i = 0; i < ITERATIONS; ++i ){
func( i, ITERATIONS - 1 );
}
The four different function types were:
- Inline function from header file.
- External function from another compilation unit.
- Inline class member function from header file.
- External class member function from another compilation unit.
The four different methods of calling were:
- Directly call the function.
- Call the function through a pointer to it.
- Wrap the function in
std::function
and call that. - Wrap the function in a lambda and call that.
The two different optimization levels were:
-
-O0
No optimizations. -
-O3
All non-experimental optimizations.
The results were a bit unexpected, but quite nice in the optimized case. These tests were run on my work-issued MacBook Pro running OS X 10.9 with a 2.3 GHz Intel Core i7 and 8 GiB of DDR3 RAM. They were compiled with g++ version 4.7.1 from MacPorts.
natalie@WorkBook:funcSpeed$ g++ -std=c++11 test.cpp main.cpp -o funcSpeed && ./funcSpeed
--- Direct Call Tests ---
testInline 9317ms.
testExternal 9433ms.
tester.testInlineMember 9252ms.
tester.testExternalMember 9328ms.
--- Pointer Call Tests ---
(&testInline) 9401ms.
(&testExternal) 9642ms.
(tester.*(&Test::testInlineMember)) 9515ms.
(tester.*(&Test::testExternalMember)) 9382ms.
--- std::function Call Tests ---
funcTestInline 134101ms.
funcTestExternal 134797ms.
funcTestInlineMember 153735ms.
funcTestExternalMember 154390ms.
--- Lambda Call Tests ---
[&]( int a, int b ){ testInline( a, b ); } 10586ms.
[&]( int a, int b ){ testExternal( a, b ); } 10441ms.
[&]( int a, int b ){ tester.testInlineMember( a, b ); } 10996ms.
[&]( int a, int b ){ tester.testExternalMember( a, b ); } 11231ms.
natalie@WorkBook:funcSpeed$ g++ -std=c++11 -O3 test.cpp main.cpp -o funcSpeed && ./funcSpeed # Optimized!
--- Direct Call Tests ---
testInline 7693ms.
testExternal 8297ms.
tester.testInlineMember 7790ms.
tester.testExternalMember 7986ms.
--- Pointer Call Tests ---
(&testInline) 7617ms.
(&testExternal) 8031ms.
(tester.*(&Test::testInlineMember)) 7752ms.
(tester.*(&Test::testExternalMember)) 8119ms.
--- std::function Call Tests ---
funcTestInline 9866ms.
funcTestExternal 10128ms.
funcTestInlineMember 8346ms.
funcTestExternalMember 8331ms.
--- Lambda Call Tests ---
[&]( int a, int b ){ testInline( a, b ); } 7793ms.
[&]( int a, int b ){ testExternal( a, b ); } 8211ms.
[&]( int a, int b ){ tester.testInlineMember( a, b ); } 7957ms.
[&]( int a, int b ){ tester.testExternalMember( a, b ); } 7920ms.
The first set is the unoptimized case. The unoptimized direct-call case was about what I expected, they all take the same time since with no optimizations enabled the testInline
function isn't inlined. Same for the pointer-based calls. One small surprise there was that the pointer-to-external-member test always performed 200 - 300 milliseconds faster than the pointer-to-inline-member test in the unoptimized code (an unimportant difference as we're talking about a difference of 200-300 nanoseconds per call).
The first real point of interest comes with the std::function
tests. The combination of std::function
and std::bind
increases execution time ~16x for member functions and ~14x for non-member functions. A lot of that overhead is coming from std::bind
. Running the non-member tests with just std::function
gave me results of 33693ms and 33591ms for the inline and external tests respectively. That's only ~3.6x multiplier. This wrapper is adding an extra 143,000 nanoseconds (0.143 milliseconds) for every call. While this is still considerably less than anything noticeable by a human it obviously adds up.
The lambda functions, however, are considerably faster which rather surprised me. I had always assumed that under the hood the compiler was just translating the lambdas into std::function
objects, but clearly this is not the case. Here the difference between the direct call and the lambda-wrapped call is a mere 1,200 nanoseconds (0.0012 milliseconds) per call. A decent overhead, but not bad considering the programmer productivity and project maintainability gains of these features.
Things may seem grim for std::bind
, but compiler optimizations come in to save the day. First, we see that when the -finline-functions
flag is set (it's implicitly set with -O3
) that inline functions are a decent clip faster than external ones. No big surprise there. The pointer call test also looks remarkably like the direct call test, which is likely because the compiler is optimizing away my dereferencing, but that is not important for this discussion.
More interestingly we see that the std::function
and std::bind
combination optimize beautifully. We are now down to an ~1.08x and ~1.23x multiplier for the member and non-member functions respectively. It seems with optimizations enabled the std::bind
/std::function
combo performs better for member functions than it does for non-member ones. Also interestingly the non-member optimized performance is the same with and without the std::bind
, suggesting that it is entirely optimized out in these cases.
The lambda case is a bit useless in the optimized code. The differences in speed are so minor it may have been optimized out entirely by the compiler.
Written by Natalie Wolfe
Related protips
1 Response
But what about build times?