Par Lab Boot Camp @ UC Berkeley 2010
1. Registers:
Each Cuda Thread has private access to a
configurable number of registers
– The 128 KB (64 KB) SM register file is par22oned
among all resident threads
– The Cuda program can trade degree of thread
block concurrency for amount of per‐thread state
– Registers, stack spill into (cached, on Fermi)
“local” DRAM if necessary
http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
http://stackoverflow.com/questions/12207533/increasing-per-thread-register-usage-in-cuda
2. Shared Memo
Each Thread Block has private access to a
configurable amount of scratchpad memory
– The Fermi SM’s 64 KB SRAM can be
configured as 16 KB L1 cache + 48 KB
scratchpad, or vice‐versa*
– Pre‐Fermi SM’s have 16 KB scratchpad only
– The available scratchpad space is par22oned
among resident thread blocks, providing
another concurrency‐state tradeoff
http://stackoverflow.com/questions/11274853/is-cuda-shared-memory-also-cached
Section G.4.1 states:
"The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory with 16 KB of L1 cache (default setting)"
config it using cudaFuncSetCacheConfig()
No comments:
Post a Comment