Optimizing a stencil code with OpenACC

A couple of weeks ago we received an OpenACC/OpenMP code from one of our users that our compiler could not parse properly. Our current release (0.3) did not handle the mixed pragmas properly and badly crashed.

There was also a problem with the way the array indexes were interpreted by our kernel generator when using kernels regions – a problem that did not happen when using the parallel directive!.

While fixing these bugs we took the opportunity of doing some cleanup on the kernel generation path, and we added some new _ReverseVisitor_ iterators that can explore from the leaf of the AST to the root faster than the previous implementation.

Once we fixed these issues, we realised that the code could be written in a different way to improve the performance. accULL follows closely the OpenACC 1.0 standard, and does magically add or transform pieces of code like other compilers. For example, the following is commonly used with the PGI compiler:

double A[N][M];
...
#pragma acc data copyin(A)
...

However, in both accULL and CAPS 3.3.4, it is required to specify the size of the data to transfer:

...
#pragma acc data copyin(a([0:n*m])
...

This seems to be unnecessary in this particular case, but in more complex situations, guessing the size of the variable requires a complex source analysis.

A similar problem we face occurs with scalar varibles. A code like the following will not show "2" in the printf if using accULL or CAPS 3.3.4.

double error = 1.0;
#pragma acc kernels loop copy(error)
...
   error = 2.0;
...
printf("Error %g \n", error);

Scalars are passed by value and cannot be restored. To force copying the value, one should write the following:

double error = 1.0;
double * ptr = & error;
...
#pragma acc kernels loop copy(ptr[0:1])

Or easier using array notation:

double error[1] = { 1.0 };
...
#pragma acc kernels loop copy(error[0:1])
...
   error[0] = ...

In order to optimize the transfer time, which is one of the main performance bottlenecks of this kind of code, it would be necessary to reduce the amount of data transfered in. The code we are working on is a stencil-like application and contains two regions suitable to run on the GPU. The first region computes data on a temporal matrix (Anew) and computes the error:

// Code simplified for clarity
#pragma acc kernels loop
for ( int j = 1; ... )
   for ( int i = 1; ... )
   {
      Anew[j][i] = (  A[j][i] + ... ) * 0.25;
      error = max( error, Anew[j][i] - A[j][i] );
   }

After this, another kernel region copies back the information to the original matrix and updates its elements individually.

// Code simplified for clarity
#pragma acc kernels loop
for ( int j = 1; ... )
   for ( int i = 1; ... )
   {
      Anew[j][i] = A[j][i];
   }

Both regions reside within a while loop which will run until a particular error critera is achieved.

while ( error[0] > tol ... )
{
   // Region 1
   ...
   // Region 2
   ...
}

When running the code with a 4096×4096 problem size, accULL required 4m13.250s whereas PGI 3m14.513s. A closer look to the profiling output below reveals that nearly 80% of the time is spent in transferring data when using accULL.

======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
45.36,105.4921,6000,17.58202,2.40e-03,89.08226,"[CUDA memcpy DtoH]"
33.69,78.35808,7000,11.19401,8.64e-04,84.84136,"[CUDA memcpy HtoD]"
12.84,29.86713,1000,29.86713,29.51034,30.36115,"_laplace2d_random_1"
8.10,18.84783,1000,18.84783,18.51467,19.28088,"_laplace2d_random_2"

Despite the fact that the code performs more than 900 iterations for this problem size, it is not strictly necessary to transfer the data in and out in each iteration, and it is possible to reduce the overall number of data copied in and out. Using a single data region encapsulating the while loop accomplishes this. An sketch of the code is shown below.

// Code simplified for clarity
#pragma acc data copyin(A[0:n*m]) create(Anew[0:n*m])
{
   while() {

      #pragma acc kernels loop copy(error[0:1])
      for ( int j = 1; ... )
      for ( int i = 1; ... )
      {
         Anew[j][i] = (  A[j][i] + ... ) * 0.25;
         error[0] = max( error[0], Anew[j][i] - A[j][i] );
      }

      #pragma acc kernels loop
      for ( int j = 1; ... )
         for ( int i = 1; ... )
            Anew[j][i] = A[j][i];

   }
}

This greatly improves performance in both PGI and accULL (~40s and ~56s respectively). Now the time is spent on kernel computation rather than memory transfers:

======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
61.54,30.32356,1000,30.32356,30.12888,30.55630,"_laplace2d_random_1"
38.32,18.88398,1000,18.88398,18.59863,19.33076,"_laplace2d_random_2"
0.13,0.06502,1003,0.06482,8.64e-04,64.11577,"[CUDA memcpy HtoD]"
0.00,2.08e-03,1000,2.08e-03,2.02e-03,2.50e-03,"[CUDA memcpy DtoH]"

We estimate that there is an excess of 8s in accULL due to the nature of the initialization process (where we check every time for CUDA and OpenCL support, kind of devices and other platform details). We expect that, by reducing this initialization time our performance figures can get closer to those offered by commercial compilers.

Keep watching this space for updates on accULL. If you are trying to run a code using accULL, do not hesitate contacting us; we can help each other!

J. Lucas Grillo

accULL release 0.3

Well, after a summer break and some Hangout meetings with the team, it is time again to publish a new release of the accULL. Release 0.3 (codename: Lucas) has many bugs fixed in the compiler and in the toolchain. Although we circulated an alpha a while ago, we have been adding features and fixing bugs thanks to the feedback we have received from users.

Some people reported problems when using automake 1.4 while building the runtime, so we updated the autoconf and automake files so that it is no longer using subdirs but a cleaner way to produce the Frangollo library. It is, however, a static library. It should not be difficult to build a dynamic library, and we may add that option to a future release.

We have improved the implementation of the parallels directive. In the previous release it used a default kernel launch configuration of 16 threads per grid dimension.  Now we are using the same estimator we use for the kernels region and users should see at least some performance improvements.

We have now an accull repository and a project webpage (in bitbucket of course!) were you can get the latest versions and information about the project. You can get the latest released package from the Downloads package, or if you are feeling in the mood for and adventure, you can download the development repository, which will get the latest version of Frangollo and YaCF from their respective repositories. Feel free to experiment and report issues or feature requests!