The compilers and the options used in the 2012 benchmarks were
Version: Debian clang version 3.0-6 (tags/RELEASE_30/final) (based on LLVM 3.0)
-Wall -Werror -ftemplate-depth-1000 -Drestrict= -O3 -ffast-math
There is no official list of options for Clang, so I had to guess what to use. -funroll-loops did not help.
Version: g++-4.7 (Debian 4.7.0-8) 4.7.0
-Wall -Werror -ftemplate-depth-1000 -Drestrict=-Ofast
-Ofastoption was crucial for performance. With only
-O3, performance was more like Clang.
For profile guided optimization, I used
Version: icpc (ICC) 12.1.4 20120410
For profile guided optimizations I used
-ipo -prof-use. The training executables ran up to 20 times slower, but the final result produced much better results for FTensor.
Version:Microsoft Windows SDK for Windows 7 and .NET Framework 4
/D "restrict=" /O2 /W3 /TP /fp:fast
I could not install the compilers directly from the SDK, but instead had to install an update to Visual Studio. Note that Visual Studio 2010 Express does not come with 64 bit compilers.
With these options, the compiler gives warnings about exceptions. Adding
/EHscsilences the warnings, but increases the run time substantially. I did not try out profile-guided optimization because it is not included with the free compiler. It is also interesting to note that in 2003 Microsoft's compiler could not compile FTensor at all.
Version:Open64 Compiler Suite: Version 5.0
-Wall -Werror -Drestrict= -Ofast
I tried out the options
-LNO:full_unroll=10, but they did not provide any improvement in my limited tests.
I tried out profile-guided optimizations with
-fb-opt, but the training executable was too slow to be practical.
Version:Nightly build 2012-06-19
-Wall -Werror -Drestrict= -Ofast
Profile guided optimizations were not working in this version.
Version: pgcpp 12.5-0 64-bit target on x86-64 Linux -tp sandybridge
-fast -Mipa=fast,inline --restrict -I/usr/include/x86_64-linux-gnu
The -I option is present because PGI could not find the right standard headers without it. PGI is the only library that could not fully optimize even the simplest FTensor expressions. It also did a poor job of optimizing the simplest C-tran expressions, though it was competitive when the C-tran expressions got more complicated. The PGI website suggests adding
-Msmartalloc --zc_eh, but this did not seem to affect the results. Adding
-Mprefetch, or replacing
-fastssealso did not change the results.
While the 2002 benchmarks were run on a wide
variety of operating systems and hardware, the 2012
tests were all run on my Lenovo Thinkpad x220T with
an Intel Core i7-2640M CPU running at 2.8 GHz. For
all of the compilers except the Microsoft compiler,
the operating system was 64 bit Debian Linux testing
"wheezy". Total run times on Linux were measured
time command. Total run time
under 64 bit Windows 7 for Microsoft Visual C++ were
measured with the
One possible source of bias is that the benchmarks were run under two different operating systems. However, the only system resource that the code uses is to print out the result at the end. No exceptions are thrown or caught. Also, all of the compilers except PGI optimize the simplest C-tran and FTensor to have the same run time of about 1.45 s.
For all the results, they were run 5 times each, and I took a median value. I would estimate the variance to be less than 10%.