A couple of weeks ago we received an OpenACC/OpenMP code from one of our users that our compiler could not parse properly. Our current release (0.3) did not handle the mixed pragmas properly and badly crashed.
There was also a problem with the way the array indexes were interpreted by our kernel generator when using kernels
regions – a problem that did not happen when using the parallel
directive!.
While fixing these bugs we took the opportunity of doing some cleanup on the kernel generation path, and we added some new _ReverseVisitor_
iterators that can explore from the leaf of the AST to the root faster than the previous implementation.
Once we fixed these issues, we realised that the code could be written in a different way to improve the performance. accULL follows closely the OpenACC 1.0 standard, and does magically add or transform pieces of code like other compilers. For example, the following is commonly used with the PGI compiler:
double A[N][M]; ... #pragma acc data copyin(A) ...
However, in both accULL and CAPS 3.3.4, it is required to specify the size of the data to transfer:
... #pragma acc data copyin(a([0:n*m]) ...
This seems to be unnecessary in this particular case, but in more complex situations, guessing the size of the variable requires a complex source analysis.
A similar problem we face occurs with scalar varibles. A code like the following will not show "2"
in the printf if using accULL or CAPS 3.3.4.
double error = 1.0; #pragma acc kernels loop copy(error) ... error = 2.0; ... printf("Error %g \n", error);
Scalars are passed by value and cannot be restored. To force copying the value, one should write the following:
double error = 1.0; double * ptr = & error; ... #pragma acc kernels loop copy(ptr[0:1])
Or easier using array notation:
double error[1] = { 1.0 }; ... #pragma acc kernels loop copy(error[0:1]) ... error[0] = ...
In order to optimize the transfer time, which is one of the main performance bottlenecks of this kind of code, it would be necessary to reduce the amount of data transfered in. The code we are working on is a stencil-like application and contains two regions suitable to run on the GPU. The first region computes data on a temporal matrix (Anew
) and computes the error:
// Code simplified for clarity #pragma acc kernels loop for ( int j = 1; ... ) for ( int i = 1; ... ) { Anew[j][i] = ( A[j][i] + ... ) * 0.25; error = max( error, Anew[j][i] - A[j][i] ); }
After this, another kernel region copies back the information to the original matrix and updates its elements individually.
// Code simplified for clarity #pragma acc kernels loop for ( int j = 1; ... ) for ( int i = 1; ... ) { Anew[j][i] = A[j][i]; }
Both regions reside within a while loop which will run until a particular error critera is achieved.
while ( error[0] > tol ... ) { // Region 1 ... // Region 2 ... }
When running the code with a 4096×4096 problem size, accULL required 4m13.250s
whereas PGI 3m14.513s
. A closer look to the profiling output below reveals that nearly 80% of the time is spent in transferring data when using accULL.
======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
45.36,105.4921,6000,17.58202,2.40e-03,89.08226,"[CUDA memcpy DtoH]"
33.69,78.35808,7000,11.19401,8.64e-04,84.84136,"[CUDA memcpy HtoD]"
12.84,29.86713,1000,29.86713,29.51034,30.36115,"_laplace2d_random_1"
8.10,18.84783,1000,18.84783,18.51467,19.28088,"_laplace2d_random_2"
Despite the fact that the code performs more than 900 iterations for this problem size, it is not strictly necessary to transfer the data in and out in each iteration, and it is possible to reduce the overall number of data copied in and out. Using a single data region encapsulating the while loop accomplishes this. An sketch of the code is shown below.
// Code simplified for clarity #pragma acc data copyin(A[0:n*m]) create(Anew[0:n*m]) { while() { #pragma acc kernels loop copy(error[0:1]) for ( int j = 1; ... ) for ( int i = 1; ... ) { Anew[j][i] = ( A[j][i] + ... ) * 0.25; error[0] = max( error[0], Anew[j][i] - A[j][i] ); } #pragma acc kernels loop for ( int j = 1; ... ) for ( int i = 1; ... ) Anew[j][i] = A[j][i]; } }
This greatly improves performance in both PGI and accULL (~40s
and ~56s
respectively). Now the time is spent on kernel computation rather than memory transfers:
======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,s,,ms,ms,ms,
61.54,30.32356,1000,30.32356,30.12888,30.55630,"_laplace2d_random_1"
38.32,18.88398,1000,18.88398,18.59863,19.33076,"_laplace2d_random_2"
0.13,0.06502,1003,0.06482,8.64e-04,64.11577,"[CUDA memcpy HtoD]"
0.00,2.08e-03,1000,2.08e-03,2.02e-03,2.50e-03,"[CUDA memcpy DtoH]"
We estimate that there is an excess of 8s
in accULL due to the nature of the initialization process (where we check every time for CUDA and OpenCL support, kind of devices and other platform details). We expect that, by reducing this initialization time our performance figures can get closer to those offered by commercial compilers.
Keep watching this space for updates on accULL. If you are trying to run a code using accULL, do not hesitate contacting us; we can help each other!
J. Lucas Grillo