Whatever Goes: April 2013

Monday, April 29, 2013

CUDA: cudaStreamQuery(0), you son of a dirty little trick

If you were like me, you’d wonder how you missed this one

cudaStreamQuery(0) will force to flush the buffer on the CPU side and send work to GPU.

Therefore, in such case as when you want to fire up many kernel in sequence (e.g. don’t want to blew up the GPU memo), and then send them off (and make sure they are send off) while you can do some CPU work before you call e.g. cudaDeviceSynchronize() again, use it!

http://nvidia.fullviewmedia.com/gtc2013/0321-230C-S3382.html

13:55

Sunday, April 28, 2013

CUDA: Overlapping memory transfer/copy with computation

overlaping

http://stackoverflow.com/questions/6437662/cuda-cpu-code-in-parallel-to-gpu-code

cuda: classical post about cuda basic event based timing/profiling

http://stackoverflow.com/questions/6959213/timing-a-cuda-application-using-events

http://stackoverflow.com/questions/13676102/strategies-for-timing-cuda-kernels-pros-and-cons

http://stackoverflow.com/questions/5801717/cuda-cudaevent-t-and-cudathreadsynchronize-usage

http://stackoverflow.com/questions/12497619/difference-in-time-reported-by-nvvp-and-counters/

Saturday, April 27, 2013

c++: convert to std::string

http://stackoverflow.com/a/332126

template <typename T>
std::string to_string(T const& value) {
    stringstream sstr;
    sstr << value;
    return sstr.str();
}

c++: header include order, std::string and <string>

A friend of mine wrote these 3 files

Singleton.h:

#pragma once
#include <fstream>
#include <string>

class Singleton
{
private:
    std::ofstream m_stream;
    Singleton(void);
    static bool instance;
    static Singleton* s;
public:
    ~Singleton(void);
    static Singleton* getInstance();
    void write(std::string c);
    void close();
};

Singleton.cpp

#pragma once

#include "Singleton.h"
#include "Constants.h"

Singleton* Singleton::s = NULL;
bool Singleton::instance = false;

Singleton::Singleton(void)
{
const char* cs = LOG_ALL_FILE_PATH.c_str();
m_stream.open(cs);
}

Singleton::~Singleton(void)
{
m_stream.close();
}

Singleton* Singleton::getInstance() {
    if (!instance){
        s = new Singleton();
        instance = true;
    }
    return s;
}

void Singleton::write(std::string logline){
m_stream << logline << std::endl;
}

void Singleton::close(){
if (instance)
m_stream.close();
}

Constant.h

/** Log file path */
const std::string LOG_ALL_FILE_PATH = "file.log";

If we change

#include "Singleton.h"
#include "Constants.h"

#include "Constants.h”

#include "Singleton.h"

it will not compile

Which let me to believe that

std::string can not appear before the 1st #include <string>

c\c++: <strings.h> <string.h> <string> <cstring> recap

<string.h> v.s. <string>

string.h contains old functions like strcpy, strlen,

string primiraly contains std::string class.

It should also be noted that using string.h is deprecated within C++. If you need the functionality contained within, you should use the header cstring. This more or less completely bypasses the issue of "What's the difference between these two" because it's very obvious that one is from the C library. – Mike Bantegui

<cstring> v.s. <string>

http://www.cplusplus.com/forum/general/38801/

“<cstring> is basically a header containing a set of functions for dealing with C-style strings (char*). <string>, on the other hand, is header that allows you to use C++-style strings (std::string), which can do a lot of if not all of the functions provided in <cstring> on their own. - Albatross“

Use string. cstring is so 1970's. string is a modern way to represent strings in c++. you'll need to learn cstring because you will run into code that uses it. but cstring is responsible for lots of unsafe code. - PanGalactic“

http://stackoverflow.com/a/12824665

<strings.h> v.s <string.h>

http://stackoverflow.com/a/4291176

Typically <strings.h> just adds some useful but non-standard additional string functions to the standard header <string.h>. For maximum portability you should only use <string.h> but if you need the functions in <strings.h> more than you need portability then you can use <strings.h> instead of <string.h>.

tinyxml: how to retrieve the value in <!CDATA[]>

http://www.codelast.com/?p=883

<?xml version="1.0" ?>

<![CDATA[if(a<b){return 0;}]]>

</property>

</config>

-------------------------------

TiXmlDocument doc;

doc.LoadFile("/root/test.xml");

TiXmlElement* pRoot = doc.RootElement();

TiXmlElement* pProperty = pRoot->FirstChildElement();

std::string strCDATA = pProperty->FirstChild()->Value(); // strCDATA中就得到了 if(a<b){return 0;}

c++/c: How to parse a scientific numeric string

Very crude method:

float eMod_d1;
sscanf("2e+020","%e",&eMod_d1);

float eMod_d2;
sscanf("20","%e",&eMod_d2);

float eMod_d3;
sscanf("0.01","%e",&eMod_d3);

Friday, April 26, 2013

C++: Good old for_each again

http://en.cppreference.com/w/cpp/algorithm/for_each

Example

The following example uses a lambda function to increment all of the elements of a vector and then computes a sum of them:

run this code

#include <vector>
#include <algorithm>
#include <iostream>   struct Sum {
    Sum() { sum = 0; }
    void operator()(int n) { sum += n; }   int sum;
};   int main()
{
    std::vector<int> nums{3, 4, 2, 9, 15, 267};   std::cout << "before: ";
    for (auto n : nums) {
        std::cout << n << " ";
    }
    std::cout << '\n';   std::for_each(nums.begin(), nums.end(), [](int &n){ n++; });
    Sum s = std::for_each(nums.begin(), nums.end(), Sum());   std::cout << "after:  ";
    for (auto n : nums) {
        std::cout << n << " ";
    }
    std::cout << '\n';
    std::cout << "sum: " << s.sum << '\n';
}

Wednesday, April 24, 2013

CUDA: workflow recap

http://stackoverflow.com/a/10467342/2041023

Two of the best references are

I'll try to answer each of your questions.

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.

D'. There is a mapping between laneid (threads index in a warp) and a core.

E'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.

F'. A thread block will be divided into WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize There is no requirement for the warp schedulers to select two warps from the same thread block. G'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched gain in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.

See reference 2 for differences between a GTX480 and GTX560.

If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.

1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 warps will be assigned to different SMs. If 2 warps are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.

2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.

8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions 64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions

In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.

3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.

4'. Parallel Nsight and the Visual Profiler show a. executed IPC b. issued IPC c. active warps per active cycle d. eligible warps per active cycle (Nsight only) e. warp stall reasons (Nsight only) f. active threads per instruction executed The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC. For MaxIPC assume GF100 (GTX480) is 2 GF10x (GTX560) is 4 but target is 3 is a better target.

CUDA: constants, registers, local arrays

Constants

A constant variable has its value set at run-time
But code also often has plain constants whose value is
known at compile-time:
#define PI 3.1415926f
a = b / (2.0f * PI);
Leave these as they are – they seem to be embedded into
the executable code so they don’t use up any registers

Registers

Within each kernel, by default, individual variables are
assigned to registers:
__global__ void lap(int I, int J,
float *u1, float *u2) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;
int id = i + j*I;
if (i==0 || i==I-1 || j==0 || j==J-1) {
u2[id] = u1[id]; // Dirichlet b.c.’s
}
else {
u2[id] = 0.25f * ( u1[id-1] + u1[id+1]
+ u1[id-I] + u1[id+I] );
}
}

32K 32-bit registers per SM
up to 63 registers per thread
up to 1536 threads (at most 1024 per thread block)
max registers per thread =⇒ 520 threads
max threads =⇒ 21 registers per thread
not much difference between “fat” and “thin” threads

What happens if your application needs more registers?
They “spill” over into L1 cache, and from there to device
memory

application suffers from the latency and
bandwidth implications of using device memory

Avoiding register spill is now one of my main concerns in
big applications, but remember:

- with 1024 threads, 400-600 cycle latency of device
memory is usually OK because some warps can do
useful work while others wait for data

- provided there are 20 ﬂops per variable read from (or
written to) device memory, the bandwidth is not a
limiting issue

Local arrays

What happens if your application uses a little array?

__global__ void lap(float *u) {
float ut[3];
int tid = threadIdx.x + blockIdx.x*blockDim.x;
for (int k=0; k<3; k++)
ut[k] = u[tid+k*gridDim.x*blockDim.x];
for (int k=0; k<3; k++)
u[tid+k*gridDim.x*blockDim.x] =
A[3*k]*ut[0]+A[3*k+1]*ut[1]+A[3*k+2]*ut[2];
}

In simple cases like this (quite common) compiler converts
to scalar registers:

__global__ void lap(float *u) {
int tid = threadIdx.x + blockIdx.x*blockDim.x;
float ut0 = u[tid+0*gridDim.x*blockDim.x];
float ut1 = u[tid+1*gridDim.x*blockDim.x];
float ut2 = u[tid+2*gridDim.x*blockDim.x];
u[tid+0*gridDim.x*blockDim.x] =
A[0]*ut0 + A[1]*ut1 + A[2]*ut2;
u[tid+1*gridDim.x*blockDim.x] =
A[3]*ut0 + A[4]*ut1 + A[5]*ut2;
u[tid+2*gridDim.x*blockDim.x] =
A[6]*ut0 + A[7]*ut1 + A[8]*ut2;
}

In more complicated cases, it puts the array into device
memory
still referred to in the documentation as a “local array”
because each thread has its own private copy
held in L1 cache by default, may never be transferred to
device memory
16kB of L1 cache equates to 4096 32-bit variables,
which is only 8 per thread when using 1024 threads
beyond this, it will have to spill to device memory

CUDA: More about shared Memory

http://stackoverflow.com/a/7041976/2041023

On Fermi, you can use up to 16kb or 48kb (depending on the configuration you select) of shared memory per block - the number of blocks which will run concurrently on a multiprocessor is determined by how much shared memory and registers each block requires, up to a maximum of 8. If you use 48kb, then only a single block can run concurrently. If you use 1kb per block, then up to 8 blocks could run concurrently per multiprocessor, depending on their register usage.

http://stackoverflow.com/a/11507311/2041023

Yes, blocks on the same multiprocessor shared the same amount of shared memory, which is 48KB per multiprocessor for your GPU card (compute capability 2.0). So if you have N blocks on the same multiprocessor, the maximum size of shared memory per block is (48/N) KB.

https://developer.nvidia.com/content/using-shared-memory-cuda-cc

SHARED MEMORY BANK CONFLICTS

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spansb distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.

However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

To minimize bank conflicts, it is important to understand how memory addresses map to memory banks. Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

Devices of compute capability 3.x have configurable bank size, which can be set usingcudaDeviceSetSharedMemConfig() to either four bytes (cudaSharedMemBankSizeFourByte, the default) or eight bytes (cudaSharedMemBankSizeEightByte). Setting the bank size to eight bytes can help avoid shared memory bank conflicts when accessing double precision data.

CONFIGURING THE AMOUNT OF SHARED MEMORY

On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory. For devices of compute capability 2.x, there are two settings, 48KB shared memory / 16KB L1 cache, and 16KB shared memory / 48KB L1 cache. By default the 48KB shared memory setting is used. This can be configured during runtime API from the host for all kernels usingcudaDeviceSetCacheConfig() or on a per-kernel basis using cudaFuncSetCacheConfig(). These accept one of three options: cudaFuncCachePreferNone, cudaFuncCachePreferShared, andcudaFuncCachePreferL1. The driver will honor the specified preference except when a kernel requires more shared memory per thread block than available in the specified configuration. Devices of compute capability 3.x allow a third setting of 32KB shared memory / 32KB L1 cache which can be obtained using the optioncudaFuncCachePreferEqual.

Tuesday, April 23, 2013

C/C++ data types – when we were young

http://www.cplusplus.com/doc/tutorial/variables/

char
Character or small integer.
1byte
signed: -128 to 127
unsigned: 0 to 255

short int(short)
Short Integer.
2bytes
signed: -32768 to 32767
unsigned: 0 to 65535

int
Integer.
4bytes
signed: -2147483648 to 2147483647
unsigned: 0 to 4294967295

long int (long)
Long integer.
4bytes
signed: -2147483648 to 2147483647
unsigned: 0 to 4294967295

bool
Boolean value. It can take one of two values: true or false.
1byte
true or false

float
Floating point number.
4bytes
+/- 3.4e +/- 38 (~7 digits)

double
Double precision floating point number.
8bytes
+/- 1.7e +/- 308 (~15 digits)

long double
Long double precision floating point number.
8bytes
+/- 1.7e +/- 308 (~15 digits)

wchar_t
Wide character.
2 or 4 bytes
1 wide character

* The values of the columns Size and Range depend on the system the program is compiled for. The values shown above are those found on most 32-bit systems. But for other systems, the general specification is that int has the natural size suggested by the system architecture (one "word") and the four integer types char, short, int and long must each one be at least as large as the one preceding it, with char being always one byte in size. The same applies to the floating point types float, double and long double, where each one must provide at least as much precision as the preceding one.

CUDA: Dynamic Shared Memo Kernel

https://developer.nvidia.com/content/using-shared-memory-cuda-cc

In this case the shared memory allocation size per thread block must be specified (in bytes) using an optional third execution configuration parameter, as in the following excerpt.

dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);

The dynamic shared memory kernel, dynamicReverse(), declares the shared memory array using an unsized extern array syntax, extern __shared__ int s[] (note the empty brackets and use of the extern specifier). The size is implicitly determined from the third execution configuration parameter when the kernel is launched. The remainder of the kernel code is identical to the staticReverse() kernel.

What if you need multiple dynamically sized arrays in a single kernel? You must declare a single extern unsized array as before, and use pointers into it to divide it into multiple arrays, as in the following excerpt.

extern __shared__ int s[];
int *integerData = s;                // nI ints
float *floatData = &integerData[nI]; // nF floats
char *charData = &floatData[nF];     // nC chars

In the kernel launch, specify the total shared memory needed, as in the following.

myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);

http://stackoverflow.com/a/5531640/2041023

“Also be aware when using pointers that shared memory uses 32 bit words, and all allocations must be 32 bit word aligned, irrespective of the type of the shared memory allocation.”

Monday, April 22, 2013

CUDA: Pass a global variable for all kernel calls as a kernel parameter or retrieve from global memo

Seems to be less expensive if pass it on every time

http://stackoverflow.com/a/14085452

Saturday, April 20, 2013

Opengl/Glut: Glut keyboard codes/keys for glutSpecialFunc() only

http://www.lighthouse3d.com/tutorials/glut-tutorial/keyboard/

GLUT_KEY_F1		F1 function key
GLUT_KEY_F2		F2 function key
GLUT_KEY_F3		F3 function key
GLUT_KEY_F4		F4 function key
GLUT_KEY_F5		F5 function key
GLUT_KEY_F6		F6 function key
GLUT_KEY_F7		F7 function key
GLUT_KEY_F8		F8 function key
GLUT_KEY_F9		F9 function key
GLUT_KEY_F10		F10 function key
GLUT_KEY_F11		F11 function key
GLUT_KEY_F12		F12 function key
GLUT_KEY_LEFT		Left function key
GLUT_KEY_RIGHT		Right function key
GLUT_KEY_UP		Up function key
GLUT_KEY_DOWN		Down function key
GLUT_KEY_PAGE_UP	Page Up function key
GLUT_KEY_PAGE_DOWN	Page Down function key
GLUT_KEY_HOME		Home function key
GLUT_KEY_END		End function key
GLUT_KEY_INSERT		Insert function key

c++, java: hash (x, y, z)

c++:

http://stackoverflow.com/a/13389744

“

Here's a baseline hash function:

unsigned long long h = (n << 24) | (a << 16) | (b << 8) | c;
return std::hash(h);

I.e., just pack the members into an unsigned long long, then offload the work to std::hash. In the common case that int is 32 bits wide and long long is 64 bits, and assuming your chars are not negative, this uses all the information in your objects for the hash.

“

http://stackoverflow.com/a/1820504

unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];

java:

http://stackoverflow.com/a/5730232

“

    public int hashCode()
    {
        int result = (int) (x ^ (x >>> 32));
        result = 31 * result + (int) (y ^ (y >>> 32));
        result = 31 * result + (int) (z ^ (z >>> 32));
        return result;
    }

“

http://stackoverflow.com/a/5730337

(int)(x ^ (x >> 32) ^ y ^ (y >> 32) ^ z ^ (z >> 32));

Wednesday, April 17, 2013

CUDA: Impact of not having the correct Comput_xx and SM_xx value – Debug Error !

cudaErrorInvalidDeviceFunction

First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: cudaError at memory location 0x00564d84..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x00564f0c..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: cudaError at memory location 0x00564044..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x005641cc..

Solution:

- Check your GPU:

- Check Nvidia’s table of CPU – CUDA (Code Generation) Version

https://developer.nvidia.com/cuda-gpus

- In MSVS 2010, go to Project Properties –> CUDA C/C++ –> Device –> Code Generation

change Compute_xx and SM_xx to the number in that table without the decimal point

NOTE: it is possible that the exact version is not supported by your version of NSight/CUDA toolkit, in which case, go back one on the version number and try again.

Install/setup glut, glew recap

Download files

http://glew.sourceforge.net/

http://freeglut.sourceforge.net/

Copy FreeGlut files

Copy the contents from FreeGLUT’s lib folder into the Windows SDK’s Lib folder.
Copy the contents from FreeGLUT’s include\GL folder into the Windows SDK’s Include\gl folder.

Copy Glew files

Copy the contents from GLEW’s lib folder into the Windows SDK’s Lib folder.
Copy the contents from GLEW’s include\GL folder into the Windows SDK’s Include\gl folder.

Copy freeglut.dll from \freeglut\bin:

‘C:\Windows\System32\’ (32bit)

C:\Windows\SysWOW64 (64bit)

Link in MSVS Project properties:

Go to the Linker property page, and click on the Input sub-page. Expand the drop-down list next to the Additional Dependencies option and click on Edit…. In the window, add glew32.lib followed by a newline, and freeglut.lib

http://www.cplusplus.com/forum/beginner/51225/

http://openglbook.com/setting-up-opengl-glew-and-freeglut-in-visual-c/

Saturday, April 13, 2013

CUDA: unresolved external function …

When you see this compilation error with many random characters such as ‘@’ and ‘&’ wrapped around your function/variable names, check

1. Is the method signature between .cu and .cuh, as well as between definition and caller to that function are consistent

2. If this method is in declared in a .h but defined a .cpp file and not inline, and you only #include the .h file, u have to either move it into the header file or make it inline

CUDA: NVidia Driver/NSight Installation failure: NVidia installation failed…

If you see some component are installed and some are not, and you had even removed previous CUDA toolkit version manually.

It is most likely a registry issue: try take these steps mentioned in method1 (may method2 will work too, but I have not tried) of this post

http://answers.microsoft.com/en-us/windows/forum/windows_7-performance/cannot-access-the-windows-management/8849f190-cf4b-e011-8dfc-68b599b31bf5

“

Method 1

You may follow these steps and check if the issue persists.

a. Click Start -> Type CMD -> Right click on CMD from the result -> Click Run as Administrator

b. Run the following command one at a time and press enter to execute

• cd /d %windir%\system32\wbem

• for %i in (*.dll) do regsvr32 -s %i

• for %i in (*.exe) do %i /regserver

c. Close all windows and reboot the computer and now try opening the system information

Method 2

You may also try rebuilding WMI repository and check if the issue persists.

a. Click Start -> Type CMD -> Right click on CMD from the result -> Click Run as Administrator

b. Type the command ‘net stop winmgmt’ and press enter

c. Leave the command prompt open and click Start and type system32 -> Open the folder system32 from the result pane

d. From the system32 folder open Wbem and look for the folder Repository

e. Right click on Repository and click rename

f. Change the folder name from Repository to Repository.old

g. Switch back to command prompt and type net start winmgmt and hit enter to execute

h. Close all the windows and reboot the computer and now try opening the system information

“

Friday, April 12, 2013

CUDA: VS2010, ‘Build’ may not cover .cu and cuh changes

Safest is to rebuild the solution or project.

CUDA Thrust: An example of conversion between device_vector, device_ptr, host_vector and raw pointer

   1: #pragma once

2:

   3: #include <cuda_runtime.h>

   4: #include <thrust/device_vector.h>

   5: #include <thrust/device_ptr.h>

   6: #include "Particle.h"

   7: #include "ParticleSystem.h"

   8: #include <vector>

9:

  10: // Template structure to pass to kernel

  11: template < typename T >

  12: struct KernelArray

  13: {

  14:     T*  _array;

  15:     int _size;

  16: };

17:

  18: // Function to convert device_vector to structure

  19: template < typename T >

  20: KernelArray< T > vecToKernelArr( thrust::device_vector< T >& dVec )

  21: {

  22:     KernelArray< T > kArray;

  23:     kArray._array = thrust::raw_pointer_cast( &dVec[0] );

  24:     kArray._size  = ( int ) dVec.size();

  25:     return kArray;

  26: };

27:

  28: // Function to convert a KernelArray to a thrust::device_vector

  29: template <typename T>

  30: thrust::device_vector<T> kernelArrToDevVec( KernelArray<T>& kArray )

  31: {

  32:     thrust::device_ptr<T> dev_ptr = thrust::device_pointer_cast(kArray._array);

  33:     thrust::device_vector<T> dVec(dev_ptr, dev_ptr + kArray._size);

  34:     return dVec;

  35: }

36:

  37: // Function to convert a KernelArray to a thrust::device_vector

  38: template <typename T>

  39: thrust::host_vector<T> kernelArrToHostVec( KernelArray<T>& kArray )

  40: {

  41:     thrust::device_ptr<T> dev_ptr = thrust::device_pointer_cast(kArray._array);

  42:     thrust::host_vector<T> hVec(dev_ptr, dev_ptr + kArray._size);

  43:     return hVec;

  44: }

45:

  46: // Function to convert a KernelArray to a thrust::device_ptr

  47: template <typename T>

  48: thrust::device_ptr<T> kernelArrToDevPtr( KernelArray<T>& kArray )

  49: {

  50:     thrust::device_ptr<T> dev_ptr = thrust::device_pointer_cast(kArray._array);

  51:     return dev_ptr;

  52: }

CUDA Thrust: Summary: host_vector, device_vector, device_ptr, raw pointer conversions

http://stackoverflow.com/questions/7678995/from-thrustdevice-vector-to-raw-pointer-and-back

   1:  // our host vector

2:

   3:  thrust::host_vector<dbl2> hVec;

4:

5:

6:

7:

   8:  // pretend we put data in it here

9:

  10:  ….

11:

12:

13:

  14:  // get a device_vector

15:

  16:  thrust::device_vector<dbl2> dVec = hVec;

17:

18:

19:

  20:  // get the device ptr

21:

  22:  thrust::device_ptr devPtr = &d_vec[0];

23:

24:

25:

  26:  // if you want to pass it to the kernel, need to convert to a raw pointer

27:

  28:  dbl2* ptrDVec = thrust::raw_pointer_cast(&d_vec[0]);

29:

30:

31:

  32:  // To get back from the raw pointer to device_ptr so that host code can access it

33:

  34:  thrust::device_ptr<int> dev_ptr = thrust::device_pointer_cast(raw_ptr);

35:

  36:  // Now we can, for example:

37:

  38:  // use device_ptr in Thrust algorithms

39:

  40:  thrust::fill(dev_ptr, dev_ptr + N, (int) 0);

41:

  42:  // access device memory transparently through device_ptr

43:

  44:  dev_ptr[0] = 1;

45:

46:

47:

  48:  // On a side note, if you have a device_vector, you can get its device_ptr

49:

  50:  thrust::device_vector<double> v1(10); // create a vector of size 10

51:

  52:  thrust::device_ptr<double> dp = v1.data(); // or &v1[0]

53:

54:

55:

  56:  // Now: From thrust::device_ptr<T> we can construct thrust:device_vector<T>

57:

  58:  thrust::device_vector<double> v2(v1); // from copy

59:

  60:  thrust::device_vector<double> v3(dp, dp + 10); // from iterator range

61:

  62:  thrust::device_vector<double> v4(v1.begin(), v1.end()); // from iterator range

Wednesday, April 10, 2013

Lattice based Valuation Methods

http://en.wikipedia.org/wiki/Lattice_model_(finance)

http://en.wikipedia.org/wiki/Binomial_options_pricing_model

http://www.columbia.edu/~mnb2/broadie/Assets/JFQA-422-Broadie-Kaya-Proofs.pdf

http://faculty.weatherhead.case.edu/ritchken/documents/Beyond_BS.pdf

http://www.scientificjournals.org/journals2007/articles/1291.pdf

CUDA: back to basic object passing

http://stackoverflow.com/questions/13185221/cuda-host-object-to-device

Host:

MyKernel<<<grid_dim, block_dim>>>(my_object);

Device:

__global__ void MyKernel(MyObject my_object) {

If you need to pass an array of objects, an easy way is to use thrust::device_vector. For instance:

Host:

#include <thrust/device_vector.h>
device_vector<MyObject> my_objects;
...
MyObject* my_objects_d = thrust::raw_pointer_cast(&my_objects[0]);
MyKernel<<<grid_dim, block_dim>>>(my_objects_d);

Device:

__global__ void MyKernel(MyObject* my_objects) {

Tuesday, April 9, 2013

C++: for people who love STL deque

http://www.codeproject.com/Articles/5425/An-In-Depth-Study-of-the-STL-Deque-Container

CUDA: element of thrust::device_vector can be modified directly

http://code.google.com/p/thrust/wiki/QuickStartGuide#Vectors

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>

int main(void)
{

// H has storage for 4 integers
thrust::host_vector<int> H(4);

// initialize individual elements
H[0] = 14;
H[1] = 20;
H[2] = 38;
H[3] = 46;

// H.size() returns the size of vector H
std::cout << "H has size " << H.size() << std::endl;

// print contents of H
for(int i = 0; i < H.size(); i++)
std::cout << "H[" << i << "] = " << H[i] << std::endl;

// resize H
H.resize(2);
std::cout << "H now has size " << H.size() << std::endl;

// Copy host_vector H to device_vector D
thrust::device_vector<int> D = H;

// elements of D can be modified
D[0] = 99;
D[1] = 88;

// print contents of D
for(int i = 0; i < D.size(); i++)
std::cout << "D[" << i << "] = " << D[i] << std::endl;

// H and D are automatically deleted when the function returns
return 0;

}

CUDA: Convert all STL containers (vector, deque, list, etc) to thrust::device_vector

http://developer.download.nvidia.com/CUDA/training/introductiontothrust.pdf

// list container on host
std::list<int> h_list;
h_list.push_back(13);
h_list.push_back(27);

// copy list to device vector
thrust::device_vector<int> d_vec(h_list.size());
thrust::copy(h_list.begin(), h_list.end(), d_vec.begin());

// alternative method
thrust::device_vector<int> d_vec(h_list.begin(), h_list.end());

CUDA Memory Recap

As illustrated in CUDA C Runtime, a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:

Load data from device memory to shared memory,
Synchronize with all the other threads of the block so that each thread can safely read shared memory locations that were populated by different threads,
Process the data in shared memory,
Synchronize again if necessary to make sure that shared memory has been updated with the results,
Write the results back to device memory.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

Global Memory

Size and Alignment Requirement

Two-Dimensional Arrays

…

In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch() and cuMemAllocPitch() functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.

Local Memory

…

The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory

Shared Memory

…

Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.

…

To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts.

about bank conflict:

http://space.itpub.net/22785983/viewspace-619794

about CPU GPU bandwidth

http://blog.csdn.net/jubincn/article/details/6624854

Constant Memory

Texture and Surface Memory