Saturday, May 18, 2013
Tuesday, May 14, 2013
opengl: glutMainLoopEvent() only loop once
http://www.gamedev.net/topic/582980-freeglut-glutmainloopevent-wont-return/
while(1) glutMainLoopEvent();
seems only run once
Reason:
glutMainLoopEvent() seems only loop if
there’s an gl event in same loop with it.
E.g. the following code will seem to only loop once,
instead of 100 times as specified
int main(int argc, char **argv) {
// init GLUT and create Window
addPoints();
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DEPTH | GLUT_DOUBLE | GLUT_RGBA);
glutInitWindowPosition(200,0);
glutInitWindowSize(600,600);
glutCreateWindow("Particle Simulator");
glutDisplayFunc(display);
glutIdleFunc(animation);
glutKeyboardFunc (keyboard);
glutSpecialFunc (keyboardSpecial);
for (int j=0; j < 100; j ++)
{
// enter GLUT event processing cycle
glutMainLoopEvent();
}
return 1;
}
but the following will behave as expected
(display will call many gl draw function hence seem to work):
int main(int argc, char **argv) {
// init GLUT and create Window
addPoints();
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DEPTH | GLUT_DOUBLE | GLUT_RGBA);
glutInitWindowPosition(200,0);
glutInitWindowSize(600,600);
glutCreateWindow("Particle Simulator");
glutIdleFunc(animation);
glutKeyboardFunc (keyboard);
glutSpecialFunc (keyboardSpecial);
for (int j=0; j < 100; j ++)
{
// display to the screen
display();
// enter GLUT event processing cycle
glutMainLoopEvent();
}
return 1;
}
Monday, May 13, 2013
CUDA: Debugger will ‘freeze’ when block size is (significant) larger than the data size
The program still works but the Debugger looks like ‘freezed’, potentially just because there are many ‘empty’ thread it still trying to loop through?
CUDA: debugger doesn’t stop at break point
Try rebuild your solution first if you haven’t already done so
Sunday, May 12, 2013
Friday, May 3, 2013
CUDA: Register, Shared Memo and Occupancy 1
Par Lab Boot Camp @ UC Berkeley 2010
1. Registers:
Each Cuda Thread has private access to a
configurable number of registers
– The 128 KB (64 KB) SM register file is par22oned
among all resident threads
– The Cuda program can trade degree of thread
block concurrency for amount of per‐thread state
– Registers, stack spill into (cached, on Fermi)
“local” DRAM if necessary
http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
http://stackoverflow.com/questions/12207533/increasing-per-thread-register-usage-in-cuda
2. Shared Memo
Each Thread Block has private access to a
configurable amount of scratchpad memory
– The Fermi SM’s 64 KB SRAM can be
configured as 16 KB L1 cache + 48 KB
scratchpad, or vice‐versa*
– Pre‐Fermi SM’s have 16 KB scratchpad only
– The available scratchpad space is par22oned
among resident thread blocks, providing
another concurrency‐state tradeoff
http://stackoverflow.com/questions/11274853/is-cuda-shared-memory-also-cached
Section G.4.1 states:
"The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory with 16 KB of L1 cache (default setting)"
config it using cudaFuncSetCacheConfig()
Thursday, May 2, 2013
CUDA: most crucial thing for optimization
http://www.youtube.com/watch?v=hG1P8k4xqR0
0:19:06
if threads in a warp access aligned, contiguous blocks of DRAM, the accesses will be coalesced into a single high bandwidth access
Wednesday, May 1, 2013
CUDA: Avoid conditional statement in kernel
http://stackoverflow.com/a/1645126/2041023
From section 6.1 of the CUDA Best Practices Guide:
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, increasing the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.
http://stackoverflow.com/a/13397496/2041023
The thread warp is a hardware group of threads that execute on the same Streaming Multiprocessor (SM). Threads of a warp can be compared to sharing a common program counter between the threads, hence all threads must execute the same line of program code. If the code has some brancing statements such as if ... then ... else
the warp must first execute the threads that enter the first block, while the other threads of the warp wait, next the threads that enter the next block will execute while the other threads wait and so on. Because of this behaviour conditional statements should be avoided in GPU code if possible. When threads of a warp follow different lines of execution it is known as having divergent threads. While conditional blocks should be kept to a minimum inside CUDA kernels, it is sometimes possible to reorder statements so that all threads of the same warp follow only a single path of execution in an if ... then ... else
block and mitigate this limitation.
The while
and for
statements are branching statements, so it is not limited to if
.