As illustrated in CUDA C Runtime, a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:
- Load data from device memory to shared memory,
- Synchronize with all the other threads of the block so that each thread can safely read shared memory locations that were populated by different threads,
- Process the data in shared memory,
- Synchronize again if necessary to make sure that shared memory has been updated with the results,
- Write the results back to device memory.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
Global Memory
Size and Alignment Requirement
Two-Dimensional Arrays
…
In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch() and cuMemAllocPitch() functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.
Local Memory
…
The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory
Shared Memory
…
Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.
…
To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts.
about bank conflict:
http://space.itpub.net/22785983/viewspace-619794
about CPU GPU bandwidth
http://blog.csdn.net/jubincn/article/details/6624854
Constant Memory
Texture and Surface Memory
No comments:
Post a Comment