Wednesday, April 24, 2013

CUDA: constants, registers, local arrays

 

Constants

A constant variable has its value set at run-time
But code also often has plain constants whose value is
known at compile-time:
#define PI 3.1415926f
a = b / (2.0f * PI);
Leave these as they are – they seem to be embedded into
the executable code so they don’t use up any registers

 

Registers

Within each kernel, by default, individual variables are
assigned to registers:
__global__ void lap(int I, int J,
float *u1, float *u2) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;
int id = i + j*I;
if (i==0 || i==I-1 || j==0 || j==J-1) {
u2[id] = u1[id]; // Dirichlet b.c.’s
}
else {
u2[id] = 0.25f * ( u1[id-1] + u1[id+1]
+ u1[id-I] + u1[id+I] );
}
}

 

32K 32-bit registers per SM
up to 63 registers per thread
up to 1536 threads (at most 1024 per thread block)
max registers per thread =⇒ 520 threads
max threads =⇒ 21 registers per thread
not much difference between “fat” and “thin” threads

What happens if your application needs more registers?
They “spill” over into L1 cache, and from there to device
memory

application suffers from the latency and
bandwidth implications of using device memory

Avoiding register spill is now one of my main concerns in
big applications, but remember:

- with 1024 threads, 400-600 cycle latency of device
memory is usually OK because some warps can do
useful work while others wait for data

- provided there are 20 flops per variable read from (or
written to) device memory, the bandwidth is not a
limiting issue

Local arrays

What happens if your application uses a little array?

__global__ void lap(float *u) {
float ut[3];
int tid = threadIdx.x + blockIdx.x*blockDim.x;
for (int k=0; k<3; k++)
ut[k] = u[tid+k*gridDim.x*blockDim.x];
for (int k=0; k<3; k++)
u[tid+k*gridDim.x*blockDim.x] =
A[3*k]*ut[0]+A[3*k+1]*ut[1]+A[3*k+2]*ut[2];
}

In simple cases like this (quite common) compiler converts
to scalar registers:


__global__ void lap(float *u) {
int tid = threadIdx.x + blockIdx.x*blockDim.x;
float ut0 = u[tid+0*gridDim.x*blockDim.x];
float ut1 = u[tid+1*gridDim.x*blockDim.x];
float ut2 = u[tid+2*gridDim.x*blockDim.x];
u[tid+0*gridDim.x*blockDim.x] =
A[0]*ut0 + A[1]*ut1 + A[2]*ut2;
u[tid+1*gridDim.x*blockDim.x] =
A[3]*ut0 + A[4]*ut1 + A[5]*ut2;
u[tid+2*gridDim.x*blockDim.x] =
A[6]*ut0 + A[7]*ut1 + A[8]*ut2;
}

In more complicated cases, it puts the array into device
memory
still referred to in the documentation as a “local array”
because each thread has its own private copy
held in L1 cache by default, may never be transferred to
device memory
16kB of L1 cache equates to 4096 32-bit variables,
which is only 8 per thread when using 1024 threads
beyond this, it will have to spill to device memory

No comments:

Post a Comment