Wednesday, April 24, 2013

CUDA: More about shared Memory

 

 

http://stackoverflow.com/a/7041976/2041023

On Fermi, you can use up to 16kb or 48kb (depending on the configuration you select) of shared memory per block - the number of blocks which will run concurrently on a multiprocessor is determined by how much shared memory and registers each block requires, up to a maximum of 8. If you use 48kb, then only a single block can run concurrently. If you use 1kb per block, then up to 8 blocks could run concurrently per multiprocessor, depending on their register usage.

 

http://stackoverflow.com/a/11507311/2041023

Yes, blocks on the same multiprocessor shared the same amount of shared memory, which is 48KB per multiprocessor for your GPU card (compute capability 2.0). So if you have N blocks on the same multiprocessor, the maximum size of shared memory per block is (48/N) KB.

 

https://developer.nvidia.com/content/using-shared-memory-cuda-cc

SHARED MEMORY BANK CONFLICTS

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spansb distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.

However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

To minimize bank conflicts, it is important to understand how memory addresses map to memory banks. Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

Devices of compute capability 3.x have configurable bank size, which can be set usingcudaDeviceSetSharedMemConfig() to either four bytes (cudaSharedMemBankSizeFourByte, the default) or eight bytes (cudaSharedMemBankSizeEightByte). Setting the bank size to eight bytes can help avoid shared memory bank conflicts when accessing double precision data.

 

CONFIGURING THE AMOUNT OF SHARED MEMORY

On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory. For devices of compute capability 2.x, there are two settings, 48KB shared memory / 16KB L1 cache, and 16KB shared memory / 48KB L1 cache. By default the 48KB shared memory setting is used. This can be configured during runtime API from the host for all kernels usingcudaDeviceSetCacheConfig() or on a per-kernel basis using cudaFuncSetCacheConfig(). These accept one of three options: cudaFuncCachePreferNone, cudaFuncCachePreferShared, andcudaFuncCachePreferL1. The driver will honor the specified preference except when a kernel requires more shared memory per thread block than available in the specified configuration. Devices of compute capability 3.x allow a third setting of 32KB shared memory / 32KB L1 cache which can be obtained using the optioncudaFuncCachePreferEqual.

No comments:

Post a Comment