
The purpose of this brief article is to illustrate how GNU GCC handles optimizations flags (-O0, -O1, -O2, -O3, -O4) by looking at the assembler code it produces for simple mathematical functions.
I chose to keep things simple: so no printf, no I/O, no real function calls.
My GCC version: gcc (GCC) 4.5.2 20110127 (prerelease)
Platform: Linux x86_64
Here's the code used for testing:
#include <math.h>
// Compile with gcc -S and -O0, -O1, -O2 ...
unsigned int multiply(unsigned int a, unsigned int b) {
return (unsigned int) a*b;
}
unsigned int half(unsigned int x) {
return (unsigned int) x/2;
}
unsigned int twice(unsigned int x) {
return (unsigned int) x*2;
}
float radice(float x) {
return sqrt(x);
}
int main() {
return 0;
}
Optimizations >= -O2 returned the exact same ASM.
-O0:
.globl multiply
.type multiply, @function
multiply:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
imull -8(%rbp), %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size multiply, .-multiply
Lots of garbage, here. Explicit operations on the stack. Not good.
Interesting fact: GCC uses IMUL (intended for signed integers) even for unsigned ones. I suppose because Intel and AMD worked harder on optimizing IMUL instead of MUL: IMUL is more flexible (allows arbitrary destination), so more feaseble for compilers.
The fact is that IMUL seems to be faster.
.globl half
.type half, @function
half:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
shrl %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size half, .-half
Explicit operations on the stack, but right shift implemented with -O0, too :).
.globl twice
.type twice, @function
twice:
.LFB2:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
addl %eax, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2:
.size twice, .-twice
No left shift with -O0, using sum (addl).
.globl radice
.type radice, @function
radice:
.LFB3:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
movss %xmm0, -4(%rbp)
movss -4(%rbp), %xmm0
cvtps2pd %xmm0, %xmm0
movapd %xmm0, %xmm1
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L6
ucomisd %xmm0, %xmm0
je .L5
.L6:
movapd %xmm1, %xmm0
call sqrt
.L5:
unpcklpd %xmm0, %xmm0
cvtpd2ps %xmm0, %xmm0
leave
.cfi_def_cfa 7, 8
ret
As you can see, lots of interesting things, here. Let's analyze in detail.
First of all, we are using Intel SSE (128-bits XMM registers).
We are using cvtps2pd to convert packed single-precision floating-point values to double precision, then invoking sqrtsd, which computes the square root of a double precision float. We could have used sqrtps, which computes the square root of a single precision float, without converting it.
Then we are using the compare instruction ucomisd (Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS) ucomisd %xmm0, %xmm0, then a JP (Jump if Parity) that overwrites the previous calculation and calls external sqrt (defined in math.h). Then retests, if equals, unpacks the double-precision float with unpcklpd and finally converts it to a single precision float with cvtpd2ps.
Wow.
-O1:
.globl multiply
.type multiply, @function
multiply:
.LFB3:
.cfi_startproc
movl %esi, %eax
imull %edi, %eax
ret
.cfi_endproc
.LFE3:
.size multiply, .-multiply
Much cleaner! :) . No wasted stack arithmetics, still using IMUL.
.globl half
.type half, @function
half:
.LFB4:
.cfi_startproc
movl %edi, %eax
shrl %eax
ret
.cfi_endproc
.LFE4:
.size half, .-half
No wasted stack arithmetics, still using SHR.
.globl twice
.type twice, @function
twice:
.LFB5:
.cfi_startproc
leal (%rdi,%rdi), %eax
ret
.cfi_endproc
.LFE5:
.size twice, .-twice
No wasted stack arithmetics, GCC prefers the LEA instruction, which can multiply registers content by small powers of two, too.
.globl radice
.type radice, @function
radice:
.LFB6:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movaps %xmm0, %xmm1
sqrtss %xmm0, %xmm0
ucomiss %xmm0, %xmm0
jnp .L5
movaps %xmm1, %xmm0
call sqrtf
.L5:
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE6:
.size radice, .-radice
Still using SSE registers, finally no "silly" Single Precision -> Double Precision and then Double Precision -> Single Precision conversions. SQRTSS is used directly, without float packing. Same test as with -O0, but this time external sqrtf is called.
-O2 and superior optimizations:
Same as -O1.
Same as -O1, but with JNP/JP inversions.