Changing default CFLAGS on i386

Thu Mar 6 20:14:04 UTC 2014

hi,

On Thu, Mar 6, 2014, at 2:23, Adam Conrad wrote:
> I wouldn't be entirely against this option, if the performance hit is
> measurably not awful in general purpose usage.

So I did some measurements against the 'radial-perf-test' in pixman,
compiled with all of the special asm/mmx/sse2/etc. backends disabled
(ie: plain C floating point code).  I have no idea what this code is
doing, but I figured it might be a good test.  I might have accidentally
picked something hideously non-representative.  I only wanted to get a
rough idea, without spending too much time on this.

The baseline for 32bit with i686 march is "Average time to composite:
0.037647".

Adding -fexcess-precision=standard gives 0.040273 (+ 7%).  That's a
reasonable hit on FP-heavy code.

SSE2 beats -fexcess-precision but it doesn't really improve on the
baseline -- in fact, 
-march=pentium4\ -mfpmath=sse\ -mtune=generic gives almost exactly the
same result as where we are today: Average time to composite: 0.037669. 
The advantage here is that we now have a standards-compliant C compiler.

We get a slight improvement if we turn on -march=pentium4\
-mtune=generic without forcing the compiler into SSE for math: Average
time to composite: 0.036601.  That's ~3% better than today.  

I'm slightly surprised that pentium4+sse2 only ties the existing
-march=i686 flags (although it beats it by actually being
standards-correct) and in particular I'm surprised that forcing SSE math
slows things down vs. -march=pentium4 alone.  I'm not sure the reason
for this.  It could be that the SSE2 instructions are truly a slower way
of doing the math.  It could also be that the compiler has received less
optimisation attention here due to it being a non-default option.

I did another test with a simple program that approximates the tight
inner loop in a mandlebrot set calculation.  It saw similar results in
terms of i686 vs. pentium4 and sse (i686 ~= sse, plain pentium4 ~2%
faster).  In this case the performance hit of
-fexcess-precision=standard was much worse, though: +40%.

In short: I'm dismayed to report that turning on '-march=pentium4
-mfpmath=sse -mtune=generic' gives no performance improvement on this
particular piece of code.

If we approach this problem from the standpoint of "we must provide a C
compiler that adheres to standards" then using these options does give a
substantial improvement on fp-heavy code over the alternative of using
-fexcess-precision=standard.

Cheers