Tuesday, April 9, 2013

CUDA Memory Recap

 

As illustrated in CUDA C Runtime, a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:

  • Load data from device memory to shared memory,
  • Synchronize with all the other threads of the block so that each thread can safely read shared memory locations that were populated by different threads,
  • Process the data in shared memory,
  • Synchronize again if necessary to make sure that shared memory has been updated with the results,
  • Write the results back to device memory.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

Global Memory

Size and Alignment Requirement

Two-Dimensional Arrays

In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch() and cuMemAllocPitch() functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.

Local Memory

The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory

Shared Memory

Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.

To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts.

about bank conflict:

http://space.itpub.net/22785983/viewspace-619794

about CPU GPU bandwidth

http://blog.csdn.net/jubincn/article/details/6624854

Constant Memory

Texture and Surface Memory

No comments:

Post a Comment