accULL release 0.2

It’s been a while since we last published an update, but it does not mean we have not been working!

Thanks to the benchmark codes provided by Nick Johnson (EPCC), we have been able to detect several situations that were not properly addressed by our implementation. We also took the opportunity to do some “house cleaning”, and we added a set of tests with an auto-verification script to help the maintenance of the code. Since the 0.1a release that we published on October 11 2012, we have committed around 50 changesets to the compiler and more than 20 to the runtime, so we believe it is that time of the year when we pack everything, write some documentation and release a new version, release 0.2. Still far from version 1.0, but getting closer.

Many thanks to all the people that has contributed to this release, in particular Juan J. Fumero and José L. Grillo, from University of La Laguna, who have been doing an incredible job to have everything ready for this release!

The new version can be downloaded here. Follows is a list of relevant issues added or fixed in this new release:

  • Added 20 new validation tests
  • Added an script to run and check the tests automatically
  • Added support for the acc parallel  directive, including num_gangs and num_workers
  • Improved support for the if clause
  • Many minor/weird bugs fixed both in compiler and runtime
  • Added suport for the firstprivate clause (including arrays)
  • Removed the requirement of putting reduction variables in a copy clause before using it
  • Script to ease the compilation of source code (just type ./accull source.c)
  • Some cleanup in the Frangollo code generator
  • Added support for Kepler cards to the runtime
  • Code generation should be slightly faster than before

As you can see, the majority of the changes have affected the compiler. We expect a new release with many changes to the runtime, addressing cleanup and performance, in a short period of time. Keep posted!

Advertisement

HPC Europa2 visit

DISCLAIMER: The comparisons shown in the following figures do not illustrate the general performance of the different compilers, but the performance of a particular version of the used compilers with this particular testcases and implemented with the knowledge of a MSc student. Performance on real applications or in codes implemented by experts in the area could change dramatically across implementations. We discourage users from using the information published here for other purposes. In addition, the accULL compiler is a research implementation which should not be used for professional purposes. The views and opinions expressed in this article are those of the author and do not
necessarily reflect the official policy or position of the accULL project, the GCAP group or the University of La Laguna.

During June and July 2012, one of our master students (Iván López) visited the EPCC at Edinburgh.

In this incredible Scottish city, you can find one of the most relevant centres of High Performance Computing.

Imagen

During my stay, we made an study of the status, at the time, of different OpenACC implementations, using the resources that EPCC and the HPCEuropa2 provided to us.

We chose three codes from the Rodinia benchmark suite: HotSpot, Path Finder and a non-blocked LU decomposition, apart from a blocked matrix multiplication implementation for exploratory purposes. We used an Nvidia C2050 GPU with 3Gbs of memory, connected to a quad-core Intel i7 (2.8Ghz). The CUDA version available at the time was 4.1.

As usual in our comparisons, we try to illustrate the “time-to-solution”, thus, we include memory transfer times, but not initialisation, since it is possible to hide this cost using an external program to open the device (as the PGI compiler does).

The results obtained with the PGI Compiler tool-kit used version 12.5, which features initial OpenACC support. The version of the CAPS HMPP compiler with OpenACC support was 3.2.0, the latest available at that time.

Imagen

The results obtained are shown in the previous image as percentage of the performance relative to the native CUDA implementation. For the HotSpot (HS) test case, the generated code almost reaches 70% of the native CUDA performance. However, the performance for the blocked matrix multiplication is barely a 5% of the total. It is worth noting that the chosen native implementation for the MxM is the DGEMM routine from the CUBLAS library which is highly optimised.

One of the most important aspects that can affect the performance is choosing the thread and kernel block configuration. OpenACC provides the gang, worker and vector clauses to enable users to manual tune the shape of the kernel.

The following graph illustrates the effect that varying the number of gang, worker and vector has on the overall performance, and how this effect varies from one compiler implementation to another.

Imagen

It is important to use an appropriate combination for the different scheduling clauses in order to take the maximum performance for the different implementations, particularly with the CAPS compiler. And finally, despite the cold of Scotland and its strange animals, we could say that the time spent at the EPCC was really worth.

Imagen