Whatever Goes: May 2013

Saturday, May 18, 2013

GPU and HPC: an old but good article about the future

Top 10 Objections to GPU Computing Reconsidered

http://www.hpcwire.com/hpcwire/2011-06-09/top_10_objections_to_gpu_computing_reconsidered.html?page=1

Tuesday, May 14, 2013

opengl: glutMainLoopEvent() only loop once

http://www.gamedev.net/topic/582980-freeglut-glutmainloopevent-wont-return/

while(1) glutMainLoopEvent();

seems only run once

Reason:

glutMainLoopEvent() seems only loop if

there’s an gl event in same loop with it.

E.g. the following code will seem to only loop once,

instead of 100 times as specified

int main(int argc, char **argv) { 

    // init GLUT and create Window
    addPoints();
    glutInit(&argc, argv);
    glutInitDisplayMode(GLUT_DEPTH | GLUT_DOUBLE | GLUT_RGBA);
    glutInitWindowPosition(200,0);
    glutInitWindowSize(600,600);
    glutCreateWindow("Particle Simulator");
    glutDisplayFunc(display);
    glutIdleFunc(animation);
    glutKeyboardFunc (keyboard);
    glutSpecialFunc (keyboardSpecial); 

    for (int j=0; j < 100; j ++)
    {
        // enter GLUT event processing cycle
        glutMainLoopEvent();
    } 

    return 1; 

}

but the following will behave as expected

(display will call many gl draw function hence seem to work):

int main(int argc, char **argv) {

    // init GLUT and create Window
    addPoints();
    glutInit(&argc, argv);
    glutInitDisplayMode(GLUT_DEPTH | GLUT_DOUBLE | GLUT_RGBA);
    glutInitWindowPosition(200,0);
    glutInitWindowSize(600,600);
    glutCreateWindow("Particle Simulator");
    glutIdleFunc(animation);
    glutKeyboardFunc (keyboard);
    glutSpecialFunc (keyboardSpecial);

    for (int j=0; j < 100; j ++)
    {
        // display to the screen
        display();

        // enter GLUT event processing cycle
        glutMainLoopEvent();
    }

    return 1;
}

Monday, May 13, 2013

CUDA: Debugger will ‘freeze’ when block size is (significant) larger than the data size

The program still works but the Debugger looks like ‘freezed’, potentially just because there are many ‘empty’ thread it still trying to loop through?

CUDA: debugger doesn’t stop at break point

Try rebuild your solution first if you haven’t already done so

Sunday, May 12, 2013

CUDA: Clean and simple memory picture

http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf

Friday, May 3, 2013

CUDA: Register, Shared Memo and Occupancy 1

Par Lab Boot Camp @ UC Berkeley 2010

1. Registers:

Each Cuda Thread has private access to a
configurable number of registers

– The 128 KB (64 KB) SM register file is par22oned
among all resident threads

– The Cuda program can trade degree of thread
block concurrency for amount of per‐thread state

– Registers, stack spill into (cached, on Fermi)
“local” DRAM if necessary

http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable

http://stackoverflow.com/questions/12207533/increasing-per-thread-register-usage-in-cuda

2. Shared Memo

Each Thread Block has private access to a
configurable amount of scratchpad memory

– The Fermi SM’s 64 KB SRAM can be
configured as 16 KB L1 cache + 48 KB
scratchpad, or vice‐versa*

– Pre‐Fermi SM’s have 16 KB scratchpad only

– The available scratchpad space is par22oned
among resident thread blocks, providing
another concurrency‐state tradeoff

http://stackoverflow.com/questions/11274853/is-cuda-shared-memory-also-cached

Section G.4.1 states:

"The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory with 16 KB of L1 cache (default setting)"

config it using cudaFuncSetCacheConfig()

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__HIGHLEVEL_ge0969184de8a5c2d809aa8d7d2425484.html

https://devtalk.nvidia.com/default/topic/469086/how-to-use-cudafuncsetcacheconfig-correctly-one-of-the-most-anticipating-features-does-not-seem-/

Thursday, May 2, 2013

CUDA: most crucial thing for optimization

http://www.youtube.com/watch?v=hG1P8k4xqR0

0:19:06

if threads in a warp access aligned, contiguous blocks of DRAM, the accesses will be coalesced into a single high bandwidth access

Wednesday, May 1, 2013

CUDA: Avoid conditional statement in kernel

http://stackoverflow.com/a/1645126/2041023

From section 6.1 of the CUDA Best Practices Guide:

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, increasing the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

http://stackoverflow.com/a/13397496/2041023

The thread warp is a hardware group of threads that execute on the same Streaming Multiprocessor (SM). Threads of a warp can be compared to sharing a common program counter between the threads, hence all threads must execute the same line of program code. If the code has some brancing statements such as if ... then ... else the warp must first execute the threads that enter the first block, while the other threads of the warp wait, next the threads that enter the next block will execute while the other threads wait and so on. Because of this behaviour conditional statements should be avoided in GPU code if possible. When threads of a warp follow different lines of execution it is known as having divergent threads. While conditional blocks should be kept to a minimum inside CUDA kernels, it is sometimes possible to reorder statements so that all threads of the same warp follow only a single path of execution in an if ... then ... else block and mitigate this limitation.

The while and for statements are branching statements, so it is not limited to if.