https://developer.nvidia.com/content/using-shared-memory-cuda-cc
In this case the shared memory allocation size per thread block must be specified (in bytes) using an optional third execution configuration parameter, as in the following excerpt.
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
The dynamic shared memory kernel, dynamicReverse(), declares the shared memory array using an unsized extern array syntax, extern __shared__ int s[] (note the empty brackets and use of the extern specifier). The size is implicitly determined from the third execution configuration parameter when the kernel is launched. The remainder of the kernel code is identical to the staticReverse() kernel.
What if you need multiple dynamically sized arrays in a single kernel? You must declare a single extern unsized array as before, and use pointers into it to divide it into multiple arrays, as in the following excerpt.
extern __shared__ int s[];
int *integerData = s; // nI ints
float *floatData = &integerData[nI]; // nF floats
char *charData = &floatData[nF]; // nC chars
In the kernel launch, specify the total shared memory needed, as in the following.
myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);
http://stackoverflow.com/a/5531640/2041023
“Also be aware when using pointers that shared memory uses 32 bit words, and all allocations must be 32 bit word aligned, irrespective of the type of the shared memory allocation.”
No comments:
Post a Comment