Whatever Goes: CUDA: Register, Shared Memo and Occupancy 1

Par Lab Boot Camp @ UC Berkeley 2010

1. Registers:

Each Cuda Thread has private access to a
configurable number of registers

– The 128 KB (64 KB) SM register file is par22oned
among all resident threads

– The Cuda program can trade degree of thread
block concurrency for amount of per‐thread state

– Registers, stack spill into (cached, on Fermi)
“local” DRAM if necessary

http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable

http://stackoverflow.com/questions/12207533/increasing-per-thread-register-usage-in-cuda

2. Shared Memo

Each Thread Block has private access to a
configurable amount of scratchpad memory

– The Fermi SM’s 64 KB SRAM can be
configured as 16 KB L1 cache + 48 KB
scratchpad, or vice‐versa*

– Pre‐Fermi SM’s have 16 KB scratchpad only

– The available scratchpad space is par22oned
among resident thread blocks, providing
another concurrency‐state tradeoff

http://stackoverflow.com/questions/11274853/is-cuda-shared-memory-also-cached

Section G.4.1 states:

"The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory with 16 KB of L1 cache (default setting)"

config it using cudaFuncSetCacheConfig()

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__HIGHLEVEL_ge0969184de8a5c2d809aa8d7d2425484.html

https://devtalk.nvidia.com/default/topic/469086/how-to-use-cudafuncsetcacheconfig-correctly-one-of-the-most-anticipating-features-does-not-seem-/

Whatever Goes

Friday, May 3, 2013

CUDA: Register, Shared Memo and Occupancy 1

Par Lab Boot Camp @ UC Berkeley 2010

No comments:

Post a Comment