What is the GNU toolchain?
In this blog we will focus on two components of the GNU toolchain, theGNU Compiler Collection(GCC)and the GNU C library(glibc
). A full toolchaincontains severalvital components like assemblers, linkers and debuggers, but in this blogwe are focusing on thecompiler and the C library.
How important is it?
Very! GCC
is the platform compiler for majorLinux distributions like Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Ubuntu Linuxand many more. That means it is used to compile the Linux kernel, all the supporting system components, and the software packages that constitute a modern Linux distribution.It is also the default compiler for the developers using these distributions for software engineering.Correspondingly, glibc
is the default library in these systems, providing thebackbone for theextraordinary diversity offunctionality, performance and security required by modern software.
Given the above,we arehard at work making sure the GNU toolchain is the best it can be on Arm platforms. While some of the work presented here is by Arm engineers we must emphasize all of this is only possible because of ourcollaborationwiththe strong GNU toolchain community. Check out thevarious blogs throughout the communityto get a feel for the breadth of work that is being done!
Toolchain performance
Oneof the areas we focus on is improving the performance of applications built with the GNU toolchain. There are many waysto do this and in this blog wepresent the highlights from our work in GCC
and glibc
as these are the two toolchain components that affect performance the most.
Improvements in GCC
The GNU Tools team in Arm has been hard at work doing our share to make this release the best version of GCC
for Arm platforms to date.Theproject follows an annual release cadence and the 2018 release of GCC 8has too many improvements to list in this blog! I would, however, like to highlightsome of the many optimisation improvements that GCC
gained over the lastdevelopment cycle:
GCC
gains a new loop interchange pass. This pass transforms loop nests toimproveuse of the data cache and makes memory access patterns morefriendly for crucial subsequent optimisations likeauto-vectorisation. It is a well-studied transform that has been missing a good implementation inGCC
. Until now! It is now enabled by defaultat high optimisation levels and has already shown its utility by acceleratingmultiple benchmarks with a highlight in the503.bwaves
benchmarkfrom the popular SPEC CPU 2017 benchmark suite ofmore than 10%. This is a phenomenalperformance improvement, reproducible acrossall Armprocessorsand provided as part of the default toolchain for all users of GCC 8.Consider the loop:
for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0; i < N; i++) c[i][j] = c[i][j] + a[i][k] * b[k][j];
The loop interchange pass can transform this into:
for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) // i, j, k interchanged for (int k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j];
We can see the memory access pattern for c[i][j]
changed to a more cache-friendly iteration.Wheneach element in a row of the array c
, accessed through i
, liesin the same cache-line the interchanged access patternmakes much better use of the data locality.
- The loop distribution pass in
GCC
is extended to handle more complexsituations present inreal code. Complex loops that contain vectorisable sequences mixed with non-vectorisable ones (for example due to loop-carried dependencies, complex data aliasing layouts) canbeseparated into their own loops. The parts that were vectorisable can then be vectorised independently of therest of the code, giving the expected performance uplift. Again, this is not an academic, prototype implementation but production-ready functionality that is enabled by default in the compiler athigh optimisation levels, giving an improvement of over 25% on the456.hmmer
benchmark from the SPEC CPU 2006benchmark suite. Thispass is a very powerful tool. Theanalysis it does can beused for manyexcitingoptimisations in the compiler.For example, the code below:
#define M (256)#define N (512)struct st{ int a[M][N]; int c[M]; int b[M][N];};voidfoo (struct st *p){ for (unsigned i = 0; i < M; ++i) { p->c[i] = 0; for (unsigned j = N; j > 0; --j) { p->a[i][j - 1] = 0; p->b[i][j - 1] = 0; } }}
is now optimised into a single call to the standardmemset
function instead of initialising each field of the struct separately:
foo: mov x2, 1024 movk x2, 0x10, lsl 16 // size of memory to initialise is size of whole 'st' struct in bytes mov w1, 0 // initialise memory with zero b memset
We take our role in the GNUdeveloper community very seriously and all such impactful improvements are presented to the community, co-designed when possible and iterated through cycles of feedback until we have a solution thatworks not only for our conveniencebut is maintainable, scalable and usable by as manyconsumers of the toolchain as possible. We encourage strong participation at developer conferences and present on all kinds of topics, from Bin Cheng presenting the aboveloop optimisation workto our performance tracking methodologyby James Greenhalgh.
Improvements in glibc
Theglibc
project has been pretty active as well.Many real world applications spend large portions of theirexecution time in the library. Arm collaborated with the excellentglibccommunity todeliver some truly exciting improvements for the2.27 release on February 2017 and the preceding 2.26 release:
- The most frequently used single-precision floating-point math routines expf, powf, logfand their derivatives arerewrittenfrom the ground up. The new approach uses double-precision hardware to accelerate single-precision arithmetic operations and other improvements to the approximation algorithm to achieve massiveincreases in latency and throughput of the order of 200% and 300% over theprevious implementations. On top of that, the new implementations achieve better precision and are written in completely portable standard C, replacing existing hard-to-maintain assembly implementations on some targets, improving the maintainability of the codebase as well.Szabolcs Nagy provided the new implementations and collaborated with thecommunity to integratethis awesome work into theupstream glibc release.Thanks to these new routines usingglibc 2.27 gives a whopping 60%improvement on the 521.wrf benchmark from the SPEC CPU 2017 suite! That by itself pushes the entire aggregate SPEC fprate 2017 score by 3%.
- In response to a customer observation about inconsistent performance of the standardinput/output function
getchar
we investigated and improved the locking sequence to giveupwards of 400% improvement in single-threaded code that uses that common function heavily.
- Wilco Dijkstraadded an optimised implementation of the
memcmp
function improvingits performance on aligned memoryarguments by 25% and more than 500% on unaligned arguments.
- Unnecessary synchronisation was removed when accessing Thread Local Storage (TLS) variablesfroma shared library.Thisroughly halvesthe access time to these variables on AArch64 platforms.
- Memory allocation and deallocation is one of the core functions of a C library andis tricky to get right because so many workloads need to do it. Finding the right balance between memory use, execution speed measured insingle-threaded and multi-threaded environments across the whole gamut of supported architectures is not a task for the faint-hearted!The glibc community (and a call out here to our friends at Red Hat) put in a lot of effort in improving the algorithms used for memory allocation andeveryone benefits. From the
malloc
improvements inglibc 2.26
we see gains of 3% and abovein benchmarks like523.xalancbmk
from SPEC CPU 2017 and othermalloc
-heavy workloads.
Putting it all together
Users of Linux distributions that come out with these newer versionsof GCC
and glibc
can get these and many more improvements as part of their out-of-the-box experience. Ourperformance tracking metrics show that using the2018 state of the artcomponents of the GNU toolchain against the equivalent early 2017releases gives an uplift of at least 1.5% on the aggregate SPEC intrate score of the SPEC CPU 2017suite and around 8% improvement on the SPEC fprate aggregate score. A Pretty gooduplift from just upgrading the software stack. The SPEC CPU benchmarks are derived from real-world software packagesthat have been optimisation targets for decades in some cases.And remember, these are just the aggregatescores in one benchmark suite. Individual applications, depending on their execution profile may achieve much more.
This post focuses on performance improvements but the GNU toolchain is about so much more. Check out the long list of new features and improvements in GCC 8
on the main project page.Support for bleeding-edge language standards, novel architectures like the Arm Scalable Vector Extensions, the Armv8.4-A architecture, the latest processorsspanningfrom the smallest embeddedapplications to the largest HPC behemoths and much more.
What's next?
The wheels of progress never stop turning. The GNU toolchaincommunity andour team here in Arm is already hard at work improving the toolchain for the 2019 releases. We've got somevery exciting projects in flight that we hope to share withyou throughout the year.
We will be providing more visibility into the work we do to improve the GNU software ecosystem as well as ways you can get involved and provide us with feedback andareas you'd like to see improved.
Thank you for reading and watch this space, this will be an exciting year for the GNU toolchain on Arm.
Tools, Software and IDEs blog
Taking Windows on Arm to the North Africa Developer Community
Peter Ing
Arm Ambassadors bring the benefits of Window on Arm to developers in North Africa.
Part 1: Porting to Arm Intrinsics with SIMDe
Khalid Saadi
This blog post presents a case study using SIMD Everywhere (SIMDe) to automatically port software using x86 SSE and AVX SIMD intrinsics to Arm Neon.
Product update: Arm Development Studio 2024.0 now available
Ronan Synnott
Arm Development Studio 2024.0 is now available.