Optimizing a stencil code with OpenACC

A couple of weeks ago we received an OpenACC/OpenMP code from one of our users that our compiler could not parse properly. Our current release (0.3) did not handle the mixed pragmas properly and badly crashed.

There was also a problem with the way the array indexes were interpreted by our kernel generator when using kernels regions - a problem that did not happen when using the parallel directive!.

While fixing these bugs we took the opportunity of doing some cleanup on the kernel generation path, and we added some new _ReverseVisitor_ iterators that can explore from the leaf of the AST to the root faster than the previous implementation.

Once we fixed these issues, we realised that the code could be written in a different way to improve the performance. accULL follows closely the OpenACC 1.0 standard, and does magically add or transform pieces of code like other compilers. For example, the following is commonly used with the PGI compiler:

double A[N][M];
...
#pragma acc data copyin(A)
...

However, in both accULL and CAPS 3.3.4, it is required to specify the size of the data to transfer:

...
#pragma acc data copyin(a([0:n*m])
...

This seems to be unnecessary in this particular case, but in more complex situations, guessing the size of the variable requires a complex source analysis.

A similar problem we face occurs with scalar varibles. A code like the following will not show "2" in the printf if using accULL or CAPS 3.3.4.

double error = 1.0;
#pragma acc kernels loop copy(error)
...
   error = 2.0;
...
printf("Error %g \n", error);

Scalars are passed by value and cannot be restored. To force copying the value, one should write the following:

double error = 1.0;
double * ptr = & error;
...
#pragma acc kernels loop copy(ptr[0:1])

Or easier using array notation:

double error[1] = { 1.0 };
...
#pragma acc kernels loop copy(error[0:1])
...
   error[0] = ...

In order to optimize the transfer time, which is one of the main performance bottlenecks of this kind of code, it would be necessary to reduce the amount of data transfered in. The code we are working on is a stencil-like application and contains two regions suitable to run on the GPU. The first region computes data on a temporal matrix (Anew) and computes the error:

// Code simplified for clarity
#pragma acc kernels loop
for ( int j = 1; ... )
   for ( int i = 1; ... )
   {
      Anew[j][i] = (  A[j][i] + ... ) * 0.25;
      error = max( error, Anew[j][i] - A[j][i] );
   }

After this, another kernel region copies back the information to the original matrix and updates its elements individually.

// Code simplified for clarity
#pragma acc kernels loop
for ( int j = 1; ... )
   for ( int i = 1; ... )
   {
      Anew[j][i] = A[j][i];
   }

Both regions reside within a while loop which will run until a particular error critera is achieved.

while ( error[0] > tol ... )
{
   // Region 1
   ...
   // Region 2
   ...
}

When running the code with a 4096x4096 problem size, accULL required 4m13.250s whereas PGI 3m14.513s. A closer look to the profiling output below reveals that nearly 80% of the time is spent in transferring data when using accULL.

======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
45.36,105.4921,6000,17.58202,2.40e-03,89.08226,"[CUDA memcpy DtoH]"
33.69,78.35808,7000,11.19401,8.64e-04,84.84136,"[CUDA memcpy HtoD]"
12.84,29.86713,1000,29.86713,29.51034,30.36115,"_laplace2d_random_1"
8.10,18.84783,1000,18.84783,18.51467,19.28088,"_laplace2d_random_2"

Despite the fact that the code performs more than 900 iterations for this problem size, it is not strictly necessary to transfer the data in and out in each iteration, and it is possible to reduce the overall number of data copied in and out. Using a single data region encapsulating the while loop accomplishes this. An sketch of the code is shown below.

// Code simplified for clarity
#pragma acc data copyin(A[0:n*m]) create(Anew[0:n*m])
{
   while() {

      #pragma acc kernels loop copy(error[0:1])
      for ( int j = 1; ... )
      for ( int i = 1; ... )
      {
         Anew[j][i] = (  A[j][i] + ... ) * 0.25;
         error[0] = max( error[0], Anew[j][i] - A[j][i] );
      }

      #pragma acc kernels loop
      for ( int j = 1; ... )
         for ( int i = 1; ... )
            Anew[j][i] = A[j][i];

   }
}

This greatly improves performance in both PGI and accULL (~40s and ~56s respectively). Now the time is spent on kernel computation rather than memory transfers:

======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
61.54,30.32356,1000,30.32356,30.12888,30.55630,"_laplace2d_random_1"
38.32,18.88398,1000,18.88398,18.59863,19.33076,"_laplace2d_random_2"
0.13,0.06502,1003,0.06482,8.64e-04,64.11577,"[CUDA memcpy HtoD]"
0.00,2.08e-03,1000,2.08e-03,2.02e-03,2.50e-03,"[CUDA memcpy DtoH]"

We estimate that there is an excess of 8s in accULL due to the nature of the initialization process (where we check every time for CUDA and OpenCL support, kind of devices and other platform details). We expect that, by reducing this initialization time our performance figures can get closer to those offered by commercial compilers.

Keep watching this space for updates on accULL. If you are trying to run a code using accULL, do not hesitate contacting us; we can help each other!

J. Lucas Grillo

accULL release 0.3

Well, after a summer break and some Hangout meetings with the team, it is time again to publish a new release of the accULL. Release 0.3 (codename: Lucas) has many bugs fixed in the compiler and in the toolchain. Although we circulated an alpha a while ago, we have been adding features and fixing bugs thanks to the feedback we have received from users.

Some people reported problems when using automake 1.4 while building the runtime, so we updated the autoconf and automake files so that it is no longer using subdirs but a cleaner way to produce the Frangollo library. It is, however, a static library. It should not be difficult to build a dynamic library, and we may add that option to a future release.

We have improved the implementation of the parallels directive. In the previous release it used a default kernel launch configuration of 16 threads per grid dimension.  Now we are using the same estimator we use for the kernels region and users should see at least some performance improvements.

We have now an accull repository and a project webpage (in bitbucket of course!) were you can get the latest versions and information about the project. You can get the latest released package from the Downloads package, or if you are feeling in the mood for and adventure, you can download the development repository, which will get the latest version of Frangollo and YaCF from their respective repositories. Feel free to experiment and report issues or feature requests!

accULL release 0.2

It’s been a while since we last published an update, but it does not mean we have not been working!

Thanks to the benchmark codes provided by Nick Johnson (EPCC), we have been able to detect several situations that were not properly addressed by our implementation. We also took the opportunity to do some “house cleaning”, and we added a set of tests with an auto-verification script to help the maintenance of the code. Since the 0.1a release that we published on October 11 2012, we have committed around 50 changesets to the compiler and more than 20 to the runtime, so we believe it is that time of the year when we pack everything, write some documentation and release a new version, release 0.2. Still far from version 1.0, but getting closer.

Many thanks to all the people that has contributed to this release, in particular Juan J. Fumero and José L. Grillo, from University of La Laguna, who have been doing an incredible job to have everything ready for this release!

The new version can be downloaded here. Follows is a list of relevant issues added or fixed in this new release:

  • Added 20 new validation tests
  • Added an script to run and check the tests automatically
  • Added support for the acc parallel  directive, including num_gangs and num_workers
  • Improved support for the if clause
  • Many minor/weird bugs fixed both in compiler and runtime
  • Added suport for the firstprivate clause (including arrays)
  • Removed the requirement of putting reduction variables in a copy clause before using it
  • Script to ease the compilation of source code (just type ./accull source.c)
  • Some cleanup in the Frangollo code generator
  • Added support for Kepler cards to the runtime
  • Code generation should be slightly faster than before

As you can see, the majority of the changes have affected the compiler. We expect a new release with many changes to the runtime, addressing cleanup and performance, in a short period of time. Keep posted!

HPC Europa2 visit

DISCLAIMER: The comparisons shown in the following figures do not illustrate the general performance of the different compilers, but the performance of a particular version of the used compilers with this particular testcases and implemented with the knowledge of a MSc student. Performance on real applications or in codes implemented by experts in the area could change dramatically across implementations. We discourage users from using the information published here for other purposes. In addition, the accULL compiler is a research implementation which should not be used for professional purposes. The views and opinions expressed in this article are those of the author and do not
necessarily reflect the official policy or position of the accULL project, the GCAP group or the University of La Laguna.

During June and July 2012, one of our master students (Iván López) visited the EPCC at Edinburgh.

In this incredible Scottish city, you can find one of the most relevant centres of High Performance Computing.

Imagen

During my stay, we made an study of the status, at the time, of different OpenACC implementations, using the resources that EPCC and the HPCEuropa2 provided to us.

We chose three codes from the Rodinia benchmark suite: HotSpot, Path Finder and a non-blocked LU decomposition, apart from a blocked matrix multiplication implementation for exploratory purposes. We used an Nvidia C2050 GPU with 3Gbs of memory, connected to a quad-core Intel i7 (2.8Ghz). The CUDA version available at the time was 4.1.

As usual in our comparisons, we try to illustrate the “time-to-solution”, thus, we include memory transfer times, but not initialisation, since it is possible to hide this cost using an external program to open the device (as the PGI compiler does).

The results obtained with the PGI Compiler tool-kit used version 12.5, which features initial OpenACC support. The version of the CAPS HMPP compiler with OpenACC support was 3.2.0, the latest available at that time.

Imagen

The results obtained are shown in the previous image as percentage of the performance relative to the native CUDA implementation. For the HotSpot (HS) test case, the generated code almost reaches 70% of the native CUDA performance. However, the performance for the blocked matrix multiplication is barely a 5% of the total. It is worth noting that the chosen native implementation for the MxM is the DGEMM routine from the CUBLAS library which is highly optimised.

One of the most important aspects that can affect the performance is choosing the thread and kernel block configuration. OpenACC provides the gang, worker and vector clauses to enable users to manual tune the shape of the kernel.

The following graph illustrates the effect that varying the number of gang, worker and vector has on the overall performance, and how this effect varies from one compiler implementation to another.

Imagen

It is important to use an appropriate combination for the different scheduling clauses in order to take the maximum performance for the different implementations, particularly with the CAPS compiler. And finally, despite the cold of Scotland and its strange animals, we could say that the time spent at the EPCC was really worth.

Imagen

Running OpenACC without a GPU: a sneak peek of accULL

In our research group (GCAP)  we have been working on directives for GPUs for the last three years (MICPRO-2011,JoS-2011). When the OpenACC standard was announced in the SC11, we found plenty of similarities with the language we were working at the moment. Immediately we focused our work in supporting and improving the new standard.
Although the amount of work to fully implement the OpenACC standard is not negligible, our compiler framework, yacf, provided us with the required flexibility to build a parser and a lowering phase to our runtime in a couple of weeks.

Most of our recent work will appear on conferences in the upcoming months. We will be presenting contributions about accULL in the following conferences:

Feel free to speak with us there. We will provide slides and detailed information in the upcoming weeks.

Source code of the compiler framework is already available , and if you are interested we can provide the development version of accULL. A public repository will be available in the near future. On the meantime, we show here a short “teaser” of the OpenCL support that we’ve implemented in the runtime.

Picture below shows execution time for a Molecular Dynamics simulation implemented in OpenACC, running on top of one of our development servers, M4. M4 is a shared memory system (that’s it, no GPU at all) with 4 Intel(R) Xeon(R) E7 4850 CPU processors.

The usual approach to implement algorithms for these architectures is to use OpenMP, however, using the Intel OpenCL platform it is also possible to run OpenACC on top of the server.

Red bars shows the execution time of OpenMP (provided by GCC 4.4.5) and green bars shows the execution time of the same code using accULL OpenCL backend.

In this case, the runtime library detects that it is possible to avoid transfers and uses mapping instead, thus avoiding unnecessary memory transfers. The Intel OpenCL platform takes advantage of the AVX instruction set, and the kernel is nicely transformed on vector instructions, which execute natively on the CPU.

accULL can also be used to run OpenACC programs while you are traveling (which is sometimes useful!). The following figure shows the same molecular dynamics simulation running on my  laptop.

Using an environment variable, runtime can switch between using the internal GPU or the CPU (again using the Intel OpenCL platform). The largest problem size froze my laptop, the problem was too big for 512Mb of graphics memory.

I hope this gives you an idea of the kind of work we are doing within accULL. More information will be available on the following weeks!