Performance Optimization

Back

The rules of the game.

The two rules of optimization:

  1. Don't do it.
  2. (For experts only!) Don't do it yet.
- Michael A. Jackson

Avoid premature optimization.

Before you attempt performance optimization for a project:

Avoid pessimism.

Don't write code that is obviously inefficient.

False Optimizations

There are many coding practices which are false optimizations. These may seem to help performance, but are actually have no effect or are even detrimental.

Macros

Some developers use macros to avoid the overhead of a function call. Macros are the bluntest tool for defining functions. Use the C++ language feature of inline functions instead. Instead of
#define DISTANCE(x,y) (std::sqrt((x - y) * (x - y)))
use
inline
double
computeDistance(const double x, const double y) {
  return std::sqrt((x - y) * (x - y));
}

Psychological Optimizations

Using single-character variable names will not improve performance! If you are guilty of this sin, then write the previous sentence on a board and bang it against your head until the idea sinks in. Short formulas look more efficient. When you use descriptive variable names you may subconsiously assume that the computer will take longer to "read" them when it is executing the program. Of course, the variable name has no effect on performance. Also, formatting your program to fit on fewer lines will not help performance. Omitting braces only makes your program harder to read and makes the programming more error prone. The following two examples are functionally equivalent, but the former is easier to understand.

template<typename ForwardIterator>
ForwardIterator
min_element(ForwardIterator first, ForwardIterator last) {
  if (first == last) {
    return first;
  }
  ForwardIterator result = first;
  while (++first != last) {
    if (*first < *result) {
      result = first;
    }
  }
  return result;
}
template<typename I>
I min_element(I a, I b) {
  if (a == b) return a;
  I r = a;
  while (++a != b)
    if (*a < *r) r = a;
  return r;
}

Iterators Versus Indexing

Using iterators is usually not more or less efficient than using array indexing. You should prefer one over the other for ease of implementation and clarity of presentation, but not for efficiency. Consider the following three examples.

sum = 0;
for (int i = 0; i != array.size(); ++i) {
  sum += array[i];
}
sum = 0;
for (ConstIterator i = array.begin(); i != array.end(); ++i) {
  sum += *i;
}
sum = std::accumulate(array.begin(), array.end(), 0);

The three examples have essentially the same efficiency. The final one is prefered because of its simplicity and use of a standard idiom.

Repeated Expressions

Optimizing compilers recognize repeated expressions. They will store temporary values as necessary to most efficiently execute a section of code. Don't try to tell the compiler how to do this. It is better at optimizing an expression than you. Don't limit its flexibility by introducing unecessary temporary variables. Use

c = (a + b) * (a + b);
instead of
temp = a + b;
c = temp * temp;

Unrolling Loops of Known Length

Optimizing compilers will unroll small loops of fixed length to avoid the overhead of the loop. That is, they replace the expression

for (int i = 0; i != 5; ++i) {
  x[i] = i;
}
with
x[0] = 0;
x[1] = 1;
x[2] = 2;
x[3] = 3;
x[4] = 4;
Thus, there is no need for you to do this by hand. Let the compiler do its job.

Polynomial Evaluation

Sometimes you can rewrite an arithmetic expression in a form that is more efficient. In these cases, an optimizing compiler will not transform the expression by itself. An example of this is Horner's method of evaluating polynomials. One can evaluate a polynomial term-by-term in the following manner.

result = a0 + a1 * x + a2 * x * x + a3 * x * x * x;
Each additional term requires two multiplication and one addition. Horner's method is more efficient.
result = ((a3 * x + a2) * x + a1) * x + a0;
Now each additional term requires one multiplication and one addition. Some architectures can perform a fused multiply-add in one clock cycle. (An FMA is an expression of the form y = a * x + b.) For these architectures, Horner's method is especially efficient.

Mathematical Functions

Addition, subtraction, and multiplication are cheap. Each costs the same and most processors can perform a couple operations per cycle. (With pen and paper, multiplication takes more work than addition. But for modern processors each takes a singe cycle.) Mathematical functions such as exp(), log(), and sqrt() are much more expensive. This is sensible, but what may be surprising is that floor(), ceil() and division are also expensive. They are not as expensive as exp(), but they are much more expensive than the intrinsic operations of addition and multiplication.

Note that although most processors can perform a couple additions or multiplications per cycle, you should expect a speed of a couple cycles per operation. There is a significant gap between what is theoretically possible and what performance you can obtain.

Mixing Types

Addition and multiplication are fast for both integer types and floating-point types. Changing the size of the number type typically has little impact on performance. Whether you use a 8, 16, 32, or 64-bit integer will probably not affect the running time. Likewise for float and double. (Of course the size of your data structures will be affected. If you are getting cache misses, using a smaller type could help.)

You can mix integer types in an expression without hurting performance. Note however you will pay a significant penalty for mixing integers and floating-point numbers or for mixing floats and doubles. The penalty comes from converting an integer to a floating-point number and from converting a float to a double, respectively. A mixed-type operation costs several times more than a single-type operation.

Inline Functions

Declaring functions to be inline can save on overhead for function calls. Note that the inline qualification is only a suggestion to the compiler. You will have to use the appropriate optimization flags for the compiler to actually inline the functions. For the GNU compiler, -O3 will turn on the -finline-functions option. You can check which functions are inlined with Shark, or another trace application. Note that you can make more functions inline by using the -finline-limit option.

Branch Prediction and Speculation

Most modern super-scalar processors perform branch prediction and speculation. Based on previous evaluations, the processor predicts whether the condition in an if statement will be true or false. Then it fills the pipeline with the instructions that it expects to execute. If it guesses incorrectly, it has to flush the pipeline which incurs a performance penalty. Branch prediction and speculation usually improves the overall performance of a code. With it, predictable branches become less expensive and unpredictable branches more expensive. When you are considering potential optimizations, keep in mind that unpredictable branches may incur a significant penalty.

Default Arguments

Don't use default arguments with performance-critical functions. By performance-critical I mean functions which are called many times and comprise a significant portion of the computational cost of the program. Consider the following function:

double
f(const double x, const double r = 0);
You will get better performance when calling the functon with a single argument if you split the function into two versions:
double
f(const double x, const double r);
double
f(const double x);
Note that the former method is cleaner and avoids code duplication. Ideally there would not be a performance penalty for using the former method. Unless you have an ideal compiler, use the latter method.

STL

Sorted Associative Containers

Each of set, multi_set, map, and multi_map are sorted associative containers. You can improve the performance of insertions if you know where the element should go. Use the insert member function that takes a position as the first argument.

iterator
insert(iterator position, const value_type& x);

The position should be the location that immediately precedes the insert location.


http://www.its.caltech.edu/~sean/ / at(sean, dot(caltech, edu)) / Last Modified: September 3, 2009