Optimizing a stencil code with OpenACC

A couple of weeks ago we received an OpenACC/OpenMP code from one of our users that our compiler could not parse properly. Our current release (0.3) did not handle the mixed pragmas properly and badly crashed.

There was also a problem with the way the array indexes were interpreted by our kernel generator when using kernels regions – a problem that did not happen when using the parallel directive!.

While fixing these bugs we took the opportunity of doing some cleanup on the kernel generation path, and we added some new _ReverseVisitor_ iterators that can explore from the leaf of the AST to the root faster than the previous implementation.

Once we fixed these issues, we realised that the code could be written in a different way to improve the performance. accULL follows closely the OpenACC 1.0 standard, and does magically add or transform pieces of code like other compilers. For example, the following is commonly used with the PGI compiler:

double A[N][M];
...
#pragma acc data copyin(A)
...

However, in both accULL and CAPS 3.3.4, it is required to specify the size of the data to transfer:

...
#pragma acc data copyin(a([0:n*m])
...

This seems to be unnecessary in this particular case, but in more complex situations, guessing the size of the variable requires a complex source analysis.

A similar problem we face occurs with scalar varibles. A code like the following will not show "2" in the printf if using accULL or CAPS 3.3.4.

double error = 1.0;
#pragma acc kernels loop copy(error)
...
   error = 2.0;
...
printf("Error %g \n", error);

Scalars are passed by value and cannot be restored. To force copying the value, one should write the following:

double error = 1.0;
double * ptr = & error;
...
#pragma acc kernels loop copy(ptr[0:1])

Or easier using array notation:

double error[1] = { 1.0 };
...
#pragma acc kernels loop copy(error[0:1])
...
   error[0] = ...

In order to optimize the transfer time, which is one of the main performance bottlenecks of this kind of code, it would be necessary to reduce the amount of data transfered in. The code we are working on is a stencil-like application and contains two regions suitable to run on the GPU. The first region computes data on a temporal matrix (Anew) and computes the error:

// Code simplified for clarity
#pragma acc kernels loop
for ( int j = 1; ... )
   for ( int i = 1; ... )
   {
      Anew[j][i] = (  A[j][i] + ... ) * 0.25;
      error = max( error, Anew[j][i] - A[j][i] );
   }

After this, another kernel region copies back the information to the original matrix and updates its elements individually.

// Code simplified for clarity
#pragma acc kernels loop
for ( int j = 1; ... )
   for ( int i = 1; ... )
   {
      Anew[j][i] = A[j][i];
   }

Both regions reside within a while loop which will run until a particular error critera is achieved.

while ( error[0] > tol ... )
{
   // Region 1
   ...
   // Region 2
   ...
}

When running the code with a 4096×4096 problem size, accULL required 4m13.250s whereas PGI 3m14.513s. A closer look to the profiling output below reveals that nearly 80% of the time is spent in transferring data when using accULL.

======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
45.36,105.4921,6000,17.58202,2.40e-03,89.08226,"[CUDA memcpy DtoH]"
33.69,78.35808,7000,11.19401,8.64e-04,84.84136,"[CUDA memcpy HtoD]"
12.84,29.86713,1000,29.86713,29.51034,30.36115,"_laplace2d_random_1"
8.10,18.84783,1000,18.84783,18.51467,19.28088,"_laplace2d_random_2"

Despite the fact that the code performs more than 900 iterations for this problem size, it is not strictly necessary to transfer the data in and out in each iteration, and it is possible to reduce the overall number of data copied in and out. Using a single data region encapsulating the while loop accomplishes this. An sketch of the code is shown below.

// Code simplified for clarity
#pragma acc data copyin(A[0:n*m]) create(Anew[0:n*m])
{
   while() {

      #pragma acc kernels loop copy(error[0:1])
      for ( int j = 1; ... )
      for ( int i = 1; ... )
      {
         Anew[j][i] = (  A[j][i] + ... ) * 0.25;
         error[0] = max( error[0], Anew[j][i] - A[j][i] );
      }

      #pragma acc kernels loop
      for ( int j = 1; ... )
         for ( int i = 1; ... )
            Anew[j][i] = A[j][i];

   }
}

This greatly improves performance in both PGI and accULL (~40s and ~56s respectively). Now the time is spent on kernel computation rather than memory transfers:

======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
61.54,30.32356,1000,30.32356,30.12888,30.55630,"_laplace2d_random_1"
38.32,18.88398,1000,18.88398,18.59863,19.33076,"_laplace2d_random_2"
0.13,0.06502,1003,0.06482,8.64e-04,64.11577,"[CUDA memcpy HtoD]"
0.00,2.08e-03,1000,2.08e-03,2.02e-03,2.50e-03,"[CUDA memcpy DtoH]"

We estimate that there is an excess of 8s in accULL due to the nature of the initialization process (where we check every time for CUDA and OpenCL support, kind of devices and other platform details). We expect that, by reducing this initialization time our performance figures can get closer to those offered by commercial compilers.

Keep watching this space for updates on accULL. If you are trying to run a code using accULL, do not hesitate contacting us; we can help each other!

J. Lucas Grillo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s