FTensor Compilers 2012

The compilers and the options used in the 2012 benchmarks were

Clang
Version: Debian clang version 3.0-6 (tags/RELEASE_30/final) (based on LLVM 3.0)
Options: -Wall -Werror -ftemplate-depth-1000 -Drestrict= -O3 -ffast-math
There is no official list of options for Clang, so I had to guess what to use. -funroll-loops did not help.
GCC
Version: g++-4.7 (Debian 4.7.0-8) 4.7.0
Options: -Wall -Werror -ftemplate-depth-1000 -Drestrict=-Ofast
The -Ofast option was crucial for performance. With only -O3, performance was more like Clang.

For profile guided optimization, I used -flto -fprofile-generate and -flto -fprofile-use.
Intel
Version: icpc (ICC) 12.1.4 20120410
Options: -restrict -fast
For profile guided optimizations I used -ipo -prof-gen and -ipo -prof-use. The training executables ran up to 20 times slower, but the final result produced much better results for FTensor.
Microsoft
Version:Microsoft Windows SDK for Windows 7 and .NET Framework 4
Options: /D "restrict=" /O2 /W3 /TP /fp:fast
I could not install the compilers directly from the SDK, but instead had to install an update to Visual Studio. Note that Visual Studio 2010 Express does not come with 64 bit compilers.

With these options, the compiler gives warnings about exceptions. Adding /EHsc silences the warnings, but increases the run time substantially. I did not try out profile-guided optimization because it is not included with the free compiler. It is also interesting to note that in 2003 Microsoft's compiler could not compile FTensor at all.
Open64
Version:Open64 Compiler Suite: Version 5.0
Options: -Wall -Werror -Drestrict= -Ofast
I tried out the options -funsafe-math-optimizations and -LNO:full_unroll=10, but they did not provide any improvement in my limited tests.

I tried out profile-guided optimizations with -fb-create and -fb-opt, but the training executable was too slow to be practical.
ENZO
Version:Nightly build 2012-06-19
Options: -Wall -Werror -Drestrict= -Ofast
Profile guided optimizations were not working in this version.
PGI
Version: pgcpp 12.5-0 64-bit target on x86-64 Linux -tp sandybridge
Options: -fast -Mipa=fast,inline --restrict -I/usr/include/x86_64-linux-gnu
The -I option is present because PGI could not find the right standard headers without it. PGI is the only library that could not fully optimize even the simplest FTensor expressions. It also did a poor job of optimizing the simplest C-tran expressions, though it was competitive when the C-tran expressions got more complicated. The PGI website suggests adding -Msmartalloc --zc_eh, but this did not seem to affect the results. Adding -O4, -Mfprelaxed, -Mprefetch, or replacing -fast with -fastsse also did not change the results.

While the 2002 benchmarks were run on a wide variety of operating systems and hardware, the 2012 tests were all run on my Lenovo Thinkpad x220T with an Intel Core i7-2640M CPU running at 2.8 GHz. For all of the compilers except the Microsoft compiler, the operating system was 64 bit Debian Linux testing "wheezy". Total run times on Linux were measured with the time command. Total run time under 64 bit Windows 7 for Microsoft Visual C++ were measured with the Measure-Command in Powershell.

One possible source of bias is that the benchmarks were run under two different operating systems. However, the only system resource that the code uses is to print out the result at the end. No exceptions are thrown or caught. Also, all of the compilers except PGI optimize the simplest C-tran and FTensor to have the same run time of about 1.45 s.

For all the results, they were run 5 times each, and I took a median value. I would estimate the variance to be less than 10%.

Walter Landry

FTensor Compilers 2012