The two rules of optimization:
Before you attempt performance optimization for a project:
Don't write code that is obviously inefficient.
#define DISTANCE(x,y) (std::sqrt((x - y) * (x - y)))use
inline
double
computeDistance(const double x, const double y) {
return std::sqrt((x - y) * (x - y));
}
Using single-character variable names will not improve performance! If you are guilty of this sin, then write the previous sentence on a board and bang it against your head until the idea sinks in. Short formulas look more efficient. When you use descriptive variable names you may subconsiously assume that the computer will take longer to "read" them when it is executing the program. Of course, the variable name has no effect on performance. Also, formatting your program to fit on fewer lines will not help performance. Omitting braces only makes your program harder to read and makes the programming more error prone. The following two examples are functionally equivalent, but the former is easier to understand.
template<typename ForwardIterator>
ForwardIterator
min_element(ForwardIterator first, ForwardIterator last) {
if (first == last) {
return first;
}
ForwardIterator result = first;
while (++first != last) {
if (*first < *result) {
result = first;
}
}
return result;
}
template<typename I>
I min_element(I a, I b) {
if (a == b) return a;
I r = a;
while (++a != b)
if (*a < *r) r = a;
return r;
}
Using iterators is usually not more or less efficient than using array indexing. You should prefer one over the other for ease of implementation and clarity of presentation, but not for efficiency. Consider the following three examples.
sum = 0;
for (int i = 0; i != array.size(); ++i) {
sum += array[i];
}
sum = 0;
for (ConstIterator i = array.begin(); i != array.end(); ++i) {
sum += *i;
}
sum = std::accumulate(array.begin(), array.end(), 0);
The three examples have essentially the same efficiency. The final one is prefered because of its simplicity and use of a standard idiom.
Optimizing compilers recognize repeated expressions. They will store temporary values as necessary to most efficiently execute a section of code. Don't try to tell the compiler how to do this. It is better at optimizing an expression than you. Don't limit its flexibility by introducing unecessary temporary variables. Use
c = (a + b) * (a + b);instead of
temp = a + b; c = temp * temp;
Optimizing compilers will unroll small loops of fixed length to avoid the overhead of the loop. That is, they replace the expression
for (int i = 0; i != 5; ++i) {
x[i] = i;
}
with
x[0] = 0; x[1] = 1; x[2] = 2; x[3] = 3; x[4] = 4;Thus, there is no need for you to do this by hand. Let the compiler do its job.
Sometimes you can rewrite an arithmetic expression in a form that is more efficient. In these cases, an optimizing compiler will not transform the expression by itself. An example of this is Horner's method of evaluating polynomials. One can evaluate a polynomial term-by-term in the following manner.
result = a0 + a1 * x + a2 * x * x + a3 * x * x * x;Each additional term requires two multiplication and one addition. Horner's method is more efficient.
result = ((a3 * x + a2) * x + a1) * x + a0;Now each additional term requires one multiplication and one addition. Some architectures can perform a fused multiply-add in one clock cycle. (An FMA is an expression of the form
y = a * x + b.)
For these architectures, Horner's method is especially efficient.
Addition, subtraction, and multiplication are cheap. Each costs the same and
most processors can perform a couple operations per cycle.
(With pen and paper, multiplication takes more work than addition. But
for modern processors each takes a singe cycle.)
Mathematical functions such as exp(), log(), and
sqrt() are much more expensive. This is sensible, but what
may be surprising is that floor(), ceil() and
division are also expensive. They are not as expensive as exp(),
but they are much more expensive than the intrinsic operations of
addition and multiplication.
Note that although most processors can perform a couple additions or multiplications per cycle, you should expect a speed of a couple cycles per operation. There is a significant gap between what is theoretically possible and what performance you can obtain.
Addition and multiplication are fast for both integer types and floating-point types. Changing the size of the number type typically has little impact on performance. Whether you use a 8, 16, 32, or 64-bit integer will probably not affect the running time. Likewise for float and double. (Of course the size of your data structures will be affected. If you are getting cache misses, using a smaller type could help.)
You can mix integer types in an expression without hurting performance. Note however you will pay a significant penalty for mixing integers and floating-point numbers or for mixing floats and doubles. The penalty comes from converting an integer to a floating-point number and from converting a float to a double, respectively. A mixed-type operation costs several times more than a single-type operation.
Declaring functions to be inline can save on overhead for function calls. Note that the inline qualification is only a suggestion to the compiler. You will have to use the appropriate optimization flags for the compiler to actually inline the functions. For the GNU compiler, -O3 will turn on the -finline-functions option. You can check which functions are inlined with Shark, or another trace application. Note that you can make more functions inline by using the -finline-limit option.
Most modern super-scalar processors perform branch prediction and speculation.
Based on previous evaluations, the processor predicts whether the condition
in an if statement will be true or false. Then it fills the
pipeline with the instructions that it expects to execute. If it guesses
incorrectly, it has to flush the pipeline which incurs a performance penalty.
Branch prediction and speculation usually improves the overall performance
of a code. With it, predictable branches become less expensive and
unpredictable branches more expensive. When you are considering
potential optimizations, keep in mind that unpredictable branches
may incur a significant penalty.
Don't use default arguments with performance-critical functions. By performance-critical I mean functions which are called many times and comprise a significant portion of the computational cost of the program. Consider the following function:
double f(const double x, const double r = 0);You will get better performance when calling the functon with a single argument if you split the function into two versions:
double f(const double x, const double r); double f(const double x);Note that the former method is cleaner and avoids code duplication. Ideally there would not be a performance penalty for using the former method. Unless you have an ideal compiler, use the latter method.
Each of set, multi_set, map, and
multi_map are sorted associative containers. You can improve the
performance of insertions if you know where the element should go. Use the
insert member function that takes a position as the first
argument.
iterator insert(iterator position, const value_type& x);
The position should be the location that immediately precedes the insert location.