Tuesday, November 5, 2013

java: toArray() throws ClassCastException

 

 

The following code (run in android) always gives me a ClassCastException in the 3rd line:

final String[] v1 = i18nCategory.translation.get(id);
final ArrayList<String> v2 = new ArrayList<String>(Arrays.asList(v1));
String[] v3 = (String[])v2.toArray();

 


This is because when you use

 toArray() 

it returns an Object[], which can't be cast to a String[] (even tho the contents are Strings) This is because the toArray method only gets a

List 

and not

List<String>

as generics are a source code only thing, and not available at runtime and so it can't determine what type of array to create.

use

toArray(new String[v2.size()]);

which allocates the right kind of array (String[] and of the right size)

Wednesday, October 9, 2013

Java: generic static method

 

http://stackoverflow.com/a/4409139

You need to move type parameter to the method level to indicate that you have a generic method rather than generic class:

e.g.

static <E> void swapInList(List<E> listToSort, int idxOne, int idxTwo) 
{
    E tmpEle = listToSort.get(idxOne);
    listToSort.set(idxOne, listToSort.get(idxTwo));
    listToSort.set(idxTwo, tmpEle);        
}

Friday, October 4, 2013

Algorithm: need the Grey (or ‘Visiting’) node in DFS?

 

In short, the answer is No. But when we ‘traverse’ to a grey (visiting) node, we can immediately deduce that there is a circle there, simply because we know that in this round of traversal, we started at (i.e. still ‘visiting’) the grey node.

http://cs.stackexchange.com/a/9681 

When doing a DFS, any node is in one of three states - before being visited, during recursively visiting its descendants, and after all its descendants have been visited (returning to its parent, i.e., wrap-up phase). The three colors correspond to each of the three states. One of the reasons for mentioning colors and time of visit and return is to explicitly make these distinctions for better understanding.

Of course, there are actual uses of these colors. Consider a directed graph G. Suppose you want to check G for the existence of cycles. In an undirected graph, if the node under consideration has a black or grey neighbor, it indicates a cycle (and the DFS does not visit it as you mention). However, in case of a directed graph, a black neighbor does not mean a cycle. For example, consider a graph with 3 vertices - A,B, and C, with directed edges as A→B, B→C, A→C. Suppose the DFS starts at A, then visits B, then C. When it has returned to A, it then checks that C has already been visited and is black. But there is no cycle in the graph.

In a directed graph, a cycle is present if and only if a node is seen again before all its descendants have been visited. In other words, if a node has a neighbor which is grey, then there is a cycle (and not when the neighbor is black). A grey node means we are currently exploring its descendants - and if one such descendant has an edge to this grey node, then there is a cycle. So, for cycle detection in directed graphs, you need to have 3 colors. There could be other examples too, but you should get the idea.

Wednesday, October 2, 2013

c/c++: the good old returning local variable

 

http://stackoverflow.com/questions/423186/since-i-cant-return-a-local-variable-whats-the-best-way-to-return-a-string-fr

From what I've seen, this should work as expected:

void greet(){
char c[] = "Hello";
greetWith(c);
return;
}

but this will cause undefined behavior:

char *greet(){ 
char c[] = "Hello";
return c;
}
 
 

You're absolutely right. Your c array in the second example is being allocated on the stack, and thus the memory will get reused immediately following. In particular, if you had code like

 printf("%s\n",greet());

you'd get weird results, because the call to printf would have reused some of the space of your array.

The solution is to allocate the memory somewhere else. For expample:

char c[] = "Hello";

char * greet() {
return c;
}

Would work. Another choice would be to allocate it statically in scope:

char * greet() {
static char c[] = "Hello";
return c;
}

because static memory is allocated separately from the stack in data space.

Your third choice is to allocate it on the heap via malloc:

char * greet() {
char * c = (char *) malloc(strlen("Hello")+1); /* +1 for the null */
strcpy(c, "Hello");
return c;
}

but now you have to make sure that memory is freed somehow, or else you have a memory leak.

Tuesday, October 1, 2013

Java: Can’t create generic array of type T

 

http://stackoverflow.com/questions/2927391/whats-the-reason-i-cant-create-generic-array-types-in-java
 
private T[] elements = new T[initialCapacity];

 


 


It's because Java's arrays (unlike generics) contain, at runtime, information about its component type. So you must know the component type when you create the array. Since you don't know what T is at runtime, you can't create the array.

Java:The meaning of T extends Comparable<T>

 

Review this:

http://stackoverflow.com/questions/8537500/java-the-meaning-of-t-extends-comparablet

 

This means that the type parameter must support comparison with other instances of its own type, via the Comparable interface.

An example of such a class is provided in the Oracle tutorial Object Ordering.

Friday, September 20, 2013

JAVA: Static Nested Class and Normal Nested Class (Inner Class) example code

 

   1: package general_threadCreation;
   2:  
   3: public class CreateThread_RunnableTest 
   4: {    
   5:     // Static Nested Class
   6:     public static class MyRunnableStatic implements Runnable
   7:     {
   8:         private int runnableIdx = -1;
   9:         
  10:         public MyRunnableStatic(int p_runnableIdx)
  11:         {
  12:             this.runnableIdx = p_runnableIdx;
  13:         }
  14:  
  15:         @Override
  16:         public void run() 
  17:         {
  18:             int i = 0; 
  19:             while(true)
  20:             {
  21:                 System.out.println("Static Runnable " + this.runnableIdx + ": " + i);
  22:                 i++;
  23:             }
  24:         }        
  25:     }
  26:     
  27:     // Nested Class (Inner Class)
  28:     public class MyRunnableInner implements Runnable
  29:     {
  30:         private int runnableIdx = -1;
  31:         
  32:         public MyRunnableInner(int p_runnableIdx)
  33:         {
  34:             this.runnableIdx = p_runnableIdx;
  35:         }
  36:  
  37:         @Override
  38:         public void run() 
  39:         {
  40:             int i = 0; 
  41:             while(true)
  42:             {
  43:                 System.out.println("Inner Runnable " + this.runnableIdx + ": " + i);
  44:                 i++;
  45:             }
  46:         }        
  47:     }
  48:     
  49:     
  50:     public static void main(String[] args)
  51:     {        
  52:         // ---- Use the normal nested class (inner class)
  53:         CreateThread_RunnableTest.MyRunnableInner myRunnableInner 
  54:             = new CreateThread_RunnableTest().new MyRunnableInner(1);
  55:         //myRunnableInner.run(); // Not parallel if not putting it into a thread
  56:         Thread myRunnableInnerThread = new Thread(myRunnableInner);
  57:         myRunnableInnerThread.start();
  58:         
  59:         // ---- Use the static nested class
  60:         MyRunnableStatic myRunnableStatic = new MyRunnableStatic(2);
  61:         //myRunnableStatic.run(); //Not parallel if not putting it into a thread
  62:         Thread myRunnableStaticThread = new Thread(myRunnableStatic);
  63:         myRunnableStaticThread.start();
  64:     }
  65:  
  66: }
  67:  

Tuesday, September 17, 2013

Inherit from a class with generic to a class without

 

I can’t believe I didn’t remember this

 

This is no good:

public class MyIntegerTreeNode<Integer> extends MyComparableTreeNode<T>

 

This is good:

public class MyPopulatedIntegerTreeNode extends MyComparableTreeNode<Integer>

Sunday, September 8, 2013

java: Arrays.copyOfRange ‘to’ is EXCLUSIVE

 

java.util.Arrays.copyOfRange(Integer[] original, int from, int to)

 

 

original - the array from which a range is to be copied
from - the initial index of the range to be copied, inclusive
to - the final index of the range to be copied, exclusive. (This index may lie outside the array.)

Thursday, August 15, 2013

c++ managed and unmanaged new

 

Hashtable^ tempHash = gcnew Hashtable(iterators_);

IDictionaryEnumerator^ enumerator = tempHash->GetEnumerator();

 


http://stackoverflow.com/a/202464/2041023


gcnew is for .NET reference objects; objects created with gcnew are automatically garbage-collected; it is important to use gcnew with CLR types


 


http://stackoverflow.com/a/202473/2041023


This is C++/CLI and the caret is the managed equivalent of a * (pointer) which in C++/CLI terminology is called a 'handle' to a 'reference type' (since you can still have unmanaged pointers).

(Thanks to Aardvark for pointing out the better terminology.)

http://stackoverflow.com/a/14378351/2041023

// here normal pointer
P* ptr = new P; // usual pointer allocated on heap
P& nat = *ptr; // object on heap bind to native object
//.. here CLI managed
MO^ mngd = gcnew MO; // allocate on CLI heap
MO% rr = *mngd; // object on CLI heap reference to gc-lvalue

//In general, the punctuator % is to ^ as the punctuator & is to *. In C++ the unary & operator is in C++/CLI the unary % operator. While &ptr yields a P*, %mngd yields at MO^.

c++ cli v.s. clr

 

http://stackoverflow.com/a/480755/2041023

 

The CLR is Microsoft's implementation of the CLI standard.

Sunday, July 21, 2013

java pass-by-value v.s. pass-by-reference

 

package cci_chap1_arrayString;

public class RemoveDuplicate {
    String removeDuplicate(String p_str)
    {
        int len = p_str.length();
        for (int i=0; i<len; i++)
        {
            for (int j=i+1; j<len; j++)
            {
                if (p_str.charAt(j)==p_str.charAt(i))
                {
                    p_str = ((new StringBuilder(p_str)).deleteCharAt(j)).toString();
                    j--;
                    len--;
                }
            }
        }
        return p_str;
    }
    public static void main(String[] args)
    {
        String str = "abbbecettt";
        RemoveDuplicate rdObj = new RemoveDuplicate();
        String resString = rdObj.removeDuplicate(str);
        System.out.println(resString);
    }

}

http://stackoverflow.com/a/8528764/2041023

 

When passing an Object variable to a function in java, it is passed by reference. If you assign a new value to the object in the function, then you overwrite the passed in reference without modifying the value seen by any calling code which still holds the original reference.

However, if you do the following then the value will be updated:

public class StringRef
{
public String someString;
}

static void f(StringRef s)
{
s.someString = "x";
}

public static void main(String[] args)
{
StringRef ref = new StringRef;
ref.someString = s;
f(ref);
// someString will be "x" here.
}

http://stackoverflow.com/a/40523/2041023

Monday, June 10, 2013

64-bit DLLs go to System32 and 32-bit DLLs to SysWoW64 on 64-bit Windows

 

 

http://stackoverflow.com/a/950011/2461653

 

This article explains a bit:

"Windows x64 has a directory System32 that contains 64-bit DLLs (sic!). Thus native processes with a bitness of 64 find “their” DLLs where they expect them: in the System32 folder. A second directory, SysWOW64, contains the 32-bit DLLs. The file system redirector does the magic of hiding the real System32 directory for 32-bit processes and showing SysWOW64 under the name of System32."

Edit: If you're talking about an installer, you should really not hard-code the path to the system folder. Instead, let Windows take care of it for you based on whether or not your installer is running on the emulation layer.

Thursday, June 6, 2013

x86, amd64, ia64, EM64T

 

http://blogs.msdn.com/b/heaths/archive/2005/02/17/x86-and-ia64-and-x64-oh-my.aspx

So what is difference between x86, AMD64, IA64, and x64?

- x86 is what most everyone is running now - 32-bit processes on 32-bit Windows.

- AMD64 is Advanced Micro Devices, Inc.'s answer to 64-bit computing that runs 32-bit code natively as well. This means that you can install 32-bit Windows on an AMD64 machine. These machines have already begun shipping with 32-bit Windows XP and a friend of mine in MN is already running one happily.

- IA64 - or Intel Itanium - processors run 64-bit natively and offer 32-bit emulation, but you cannot install 32-bit Windows on it. You need to run Windows Server 2003 for 64-bit Itanium-based Systems.

- Intel has also introduced EM64T - or Extended Memory 64 Technology - for Intel Xeon processors. This processor also supports running 32-bit processes natively like the AMD64.

Sunday, June 2, 2013

Virtual Box: share a folder in Ubuntu (guest) with Windows 7 (host)

 

http://www.virtualbox.org/manual/ch04.html#sharedfolders

http://www.giannistsakiris.com/index.php/2008/04/09/virtualbox-access-windows-host-shared-folders-from-ubuntu-guest/

With Guest Additions installed you may now go ahead and define the shared folder(s). From the VirtualBox's menu go to Devices → Shared Folders. A dialog will show up. In this dialog you can specify which folder from your Windows system you want to share with your Ubuntu. Press the button with the + symbol to add a new shared folder in the list. You will have to specify a Folder Name for each folder you add. Make sure you memorize that name because you will need it very soon.

Manual

When done with you shared folder(s) specification, you may now go ahead and actually mount these folders from Ubuntu. First you have to create a mounpoint, that is, a directory in your Ubuntu which will reflect the shared folder from Windows:

# sudo mkdir /media/windows-share

Of course you may choose an alternative path for your mountpoint. With your mountpoint created you can now mount the shared folder, like this:

# sudo mount -t vboxsf folder-name /media/windows-share

Where folder-name will be the name you assigned for this folder when you were adding it in the shared folders list.

Automatic Mounting

http://www.virtualbox.org/manual/ch04.html#sf_mount_auto

With Linux guests, auto-mounted shared folders are mounted into the /media directory, along with the prefix sf_. For example, the shared folder myfiles would be mounted to /media/sf_myfiles on Linux and /mnt/sf_myfiles on Solaris.

The guest property /VirtualBox/GuestAdd/SharedFolders/MountPrefix determines the prefix that is used. Change that guest property to a value other than "sf" to change that prefix; see the section called “Guest properties” for details.

Access to auto-mounted shared folders is only granted to the user group vboxsf, which is created by the VirtualBox Guest Additions installer. Hence guest users have to be member of that group to have read/write access or to have read-only access in case the folder is not mapped writable. Use:

sudo adduser <username> vboxsf

For convenience

sudo ln -s /media/sf_VbShared ~/

VirtualBox: install Guest Additions, VBoxGuestAddition.iso

 

Absolutely recommend the install addon

- mouse middle wheel movement

- shared folders

- many more

 

If you installed the VM with a .iso file, e.g. ubuntu’s .iso from their official website http://www.ubuntu.com/download/desktop , you will need to

- go to file system, eject the virtual drive corresponding to this .iso

- then on your VM’s menu, Devices –> Install Guest Additions, the Guest Addition .iso should then show up in your file system as your new virtual drive

=> as if only one of either the ubuntu installation .iso or the VBoxGuesAddition.iso can be mounted as the virtual drive at a time.

should see the guest additions start installing and showing progress in a terminal

 

http://askubuntu.com/questions/150491/cant-mount-or-unmount-virtual-discs-gives-option-for-force-unmounting

Tuesday, May 14, 2013

opengl: glutMainLoopEvent() only loop once

 

http://www.gamedev.net/topic/582980-freeglut-glutmainloopevent-wont-return/

while(1) glutMainLoopEvent();
seems only run once
 
Reason:
glutMainLoopEvent() seems only loop if 
there’s an gl event in same loop with it. 
E.g. the following code will seem to only loop once, 
instead of 100 times as specified
int main(int argc, char **argv) { 

// init GLUT and create Window
addPoints();
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DEPTH | GLUT_DOUBLE | GLUT_RGBA);
glutInitWindowPosition(200,0);
glutInitWindowSize(600,600);
glutCreateWindow("Particle Simulator");
glutDisplayFunc(display);
glutIdleFunc(animation);
glutKeyboardFunc (keyboard);
glutSpecialFunc (keyboardSpecial);

for (int j=0; j < 100; j ++)
{
// enter GLUT event processing cycle
glutMainLoopEvent();
}

return 1;

}
but the following will behave as expected 
(display will call many gl draw function hence seem to work):
int main(int argc, char **argv) {

// init GLUT and create Window
addPoints();
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DEPTH | GLUT_DOUBLE | GLUT_RGBA);
glutInitWindowPosition(200,0);
glutInitWindowSize(600,600);
glutCreateWindow("Particle Simulator");
glutIdleFunc(animation);
glutKeyboardFunc (keyboard);
glutSpecialFunc (keyboardSpecial);

for (int j=0; j < 100; j ++)
{
// display to the screen
display();

// enter GLUT event processing cycle
glutMainLoopEvent();
}

return 1;
}


 

Monday, May 13, 2013

CUDA: Debugger will ‘freeze’ when block size is (significant) larger than the data size

 

The program still works but the Debugger looks like ‘freezed’, potentially just because there are many ‘empty’ thread it still trying to loop through?

CUDA: debugger doesn’t stop at break point

 

Try rebuild your solution first if you haven’t already done so

Friday, May 3, 2013

CUDA: Register, Shared Memo and Occupancy 1

 

Par Lab Boot Camp @ UC Berkeley 2010

1. Registers:

Each Cuda Thread has private access to a
configurable number of registers

– The 128 KB (64 KB) SM register file is par22oned
among all resident threads

– The Cuda program can trade degree of thread
block concurrency for amount of per‐thread state

– Registers, stack spill into (cached, on Fermi)
“local” DRAM if necessary

 

http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable

 

http://stackoverflow.com/questions/12207533/increasing-per-thread-register-usage-in-cuda

 

2. Shared Memo

Each Thread Block has private access to a
configurable amount of scratchpad memory

– The Fermi SM’s 64 KB SRAM can be
configured as 16 KB L1 cache + 48 KB
scratchpad, or vice‐versa*

– Pre‐Fermi SM’s have 16 KB scratchpad only

– The available scratchpad space is par22oned
among resident thread blocks, providing
another concurrency‐state tradeoff

http://stackoverflow.com/questions/11274853/is-cuda-shared-memory-also-cached

 

Section G.4.1 states:

"The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory with 16 KB of L1 cache (default setting)"

config it using cudaFuncSetCacheConfig()

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__HIGHLEVEL_ge0969184de8a5c2d809aa8d7d2425484.html

https://devtalk.nvidia.com/default/topic/469086/how-to-use-cudafuncsetcacheconfig-correctly-one-of-the-most-anticipating-features-does-not-seem-/

Thursday, May 2, 2013

CUDA: most crucial thing for optimization

 

http://www.youtube.com/watch?v=hG1P8k4xqR0

0:19:06

if threads in a warp access aligned, contiguous blocks of DRAM, the accesses will be coalesced into a single high bandwidth access

Wednesday, May 1, 2013

CUDA: Avoid conditional statement in kernel

 

http://stackoverflow.com/a/1645126/2041023

From section 6.1 of the CUDA Best Practices Guide:

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, increasing the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

http://stackoverflow.com/a/13397496/2041023

The thread warp is a hardware group of threads that execute on the same Streaming Multiprocessor (SM). Threads of a warp can be compared to sharing a common program counter between the threads, hence all threads must execute the same line of program code. If the code has some brancing statements such as if ... then ... else the warp must first execute the threads that enter the first block, while the other threads of the warp wait, next the threads that enter the next block will execute while the other threads wait and so on. Because of this behaviour conditional statements should be avoided in GPU code if possible. When threads of a warp follow different lines of execution it is known as having divergent threads. While conditional blocks should be kept to a minimum inside CUDA kernels, it is sometimes possible to reorder statements so that all threads of the same warp follow only a single path of execution in an if ... then ... else block and mitigate this limitation.

The while and for statements are branching statements, so it is not limited to if.

Monday, April 29, 2013

CUDA: cudaStreamQuery(0), you son of a dirty little trick

 

If you were like me, you’d wonder how you missed this one

cudaStreamQuery(0) will force to flush the buffer on the CPU side and send work to GPU.

Therefore, in such case as when you want to fire up many kernel in sequence (e.g. don’t want to blew up the GPU memo), and then send them off (and make sure they are send off) while you can do some CPU work before you call e.g. cudaDeviceSynchronize() again, use it!

 

http://nvidia.fullviewmedia.com/gtc2013/0321-230C-S3382.html

13:55

Saturday, April 27, 2013

c++: convert to std::string

 

http://stackoverflow.com/a/332126

 

template <typename T>
std::string to_string(T const& value) {
stringstream sstr;
sstr << value;
return sstr.str();
}

c++: header include order, std::string and <string>

 

A friend of mine wrote these 3 files

Singleton.h:

#pragma once
#include <fstream>
#include <string>

class Singleton
{
private:
    std::ofstream m_stream;
    Singleton(void);
    static bool instance;
    static Singleton* s;
public:
    ~Singleton(void);
    static Singleton* getInstance();
    void write(std::string c);
    void close();
};

Singleton.cpp

#pragma once

#include "Singleton.h"
#include "Constants.h"

Singleton* Singleton::s = NULL;
bool Singleton::instance = false;

Singleton::Singleton(void)
{
    const char* cs = LOG_ALL_FILE_PATH.c_str();
    m_stream.open(cs);
}

Singleton::~Singleton(void)
{
    m_stream.close();
}

Singleton* Singleton::getInstance() {
    if (!instance){
        s = new Singleton();
        instance = true;
    }
    return s;
}

void Singleton::write(std::string logline){
    m_stream << logline << std::endl;
}

void Singleton::close(){
    if (instance)
        m_stream.close();
}

Constant.h

/** Log file path */
const std::string LOG_ALL_FILE_PATH = "file.log";

If we change

#include "Singleton.h"
#include "Constants.h"

to

#include "Constants.h”

#include "Singleton.h"

it will not compile

Which let me to believe that

std::string can not appear before the 1st #include <string>

c\c++: <strings.h> <string.h> <string> <cstring> recap

 

<string.h> v.s. <string>

string.h contains old functions like strcpy, strlen,

string primiraly contains std::string class.

It should also be noted that using string.h is deprecated within C++. If you need the functionality contained within, you should use the header cstring. This more or less completely bypasses the issue of "What's the difference between these two" because it's very obvious that one is from the C library. – Mike Bantegui

<cstring> v.s. <string>

http://www.cplusplus.com/forum/general/38801/

“<cstring> is basically a header containing a set of functions for dealing with C-style strings (char*). <string>, on the other hand, is header that allows you to use C++-style strings (std::string), which can do a lot of if not all of the functions provided in <cstring> on their own. - Albatross“

Use string. cstring is so 1970's. string is a modern way to represent strings in c++. you'll need to learn cstring because you will run into code that uses it. but cstring is responsible for lots of unsafe code.  - PanGalactic“

http://stackoverflow.com/a/12824665

<strings.h> v.s <string.h>

http://stackoverflow.com/a/4291176

Typically <strings.h> just adds some useful but non-standard additional string functions to the standard header <string.h>. For maximum portability you should only use <string.h> but if you need the functions in <strings.h> more than you need portability then you can use <strings.h> instead of <string.h>.

tinyxml: how to retrieve the value in <!CDATA[]>

 

http://www.codelast.com/?p=883

 

<?xml version="1.0" ?>

<config>

    <property>

        <![CDATA[if(a<b){return 0;}]]>

    </property>

</config>

-------------------------------

TiXmlDocument doc;

doc.LoadFile("/root/test.xml");

TiXmlElement* pRoot = doc.RootElement();

TiXmlElement* pProperty = pRoot->FirstChildElement();

std::string strCDATA = pProperty->FirstChild()->Value();    // strCDATA中就得到了 if(a<b){return 0;}

c++/c: How to parse a scientific numeric string

 

Very crude method:

float eMod_d1;   
sscanf("2e+020","%e",&eMod_d1);

float eMod_d2;   
sscanf("20","%e",&eMod_d2);

float eMod_d3;   
sscanf("0.01","%e",&eMod_d3);

Friday, April 26, 2013

C++: Good old for_each again

 

http://en.cppreference.com/w/cpp/algorithm/for_each

Example

The following example uses a lambda function to increment all of the elements of a vector and then computes a sum of them:

run this code

#include <vector>
#include <algorithm>
#include <iostream>   struct Sum {
Sum() { sum = 0; }
void operator()(int n) { sum += n; }   int sum;
};   int main()
{
std::vector<int> nums{3, 4, 2, 9, 15, 267};   std::cout << "before: ";
for (auto n : nums) {
std::cout << n << " ";
}
std::cout << '\n';   std::for_each(nums.begin(), nums.end(), [](int &n){ n++; });
Sum s = std::for_each(nums.begin(), nums.end(), Sum());   std::cout << "after: ";
for (auto n : nums) {
std::cout << n << " ";
}
std::cout << '\n';
std::cout << "sum: " << s.sum << '\n';
}

Wednesday, April 24, 2013

CUDA: workflow recap

 

http://stackoverflow.com/a/10467342/2041023

Two of the best references are

  1. NVIDIA Fermi Compute Architecture Whitepaper
  2. GF104 Reviews

I'll try to answer each of your questions.

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.

D'. There is a mapping between laneid (threads index in a warp) and a core.

E'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.

F'. A thread block will be divided into WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize There is no requirement for the warp schedulers to select two warps from the same thread block. G'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched gain in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.

See reference 2 for differences between a GTX480 and GTX560.

If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.

1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 warps will be assigned to different SMs. If 2 warps are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.

2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.

8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions 64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions

In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.

3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.

4'. Parallel Nsight and the Visual Profiler show a. executed IPC b. issued IPC c. active warps per active cycle d. eligible warps per active cycle (Nsight only) e. warp stall reasons (Nsight only) f. active threads per instruction executed The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC. For MaxIPC assume GF100 (GTX480) is 2 GF10x (GTX560) is 4 but target is 3 is a better target.

CUDA: constants, registers, local arrays

 

Constants

A constant variable has its value set at run-time
But code also often has plain constants whose value is
known at compile-time:
#define PI 3.1415926f
a = b / (2.0f * PI);
Leave these as they are – they seem to be embedded into
the executable code so they don’t use up any registers

 

Registers

Within each kernel, by default, individual variables are
assigned to registers:
__global__ void lap(int I, int J,
float *u1, float *u2) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;
int id = i + j*I;
if (i==0 || i==I-1 || j==0 || j==J-1) {
u2[id] = u1[id]; // Dirichlet b.c.’s
}
else {
u2[id] = 0.25f * ( u1[id-1] + u1[id+1]
+ u1[id-I] + u1[id+I] );
}
}

 

32K 32-bit registers per SM
up to 63 registers per thread
up to 1536 threads (at most 1024 per thread block)
max registers per thread =⇒ 520 threads
max threads =⇒ 21 registers per thread
not much difference between “fat” and “thin” threads

What happens if your application needs more registers?
They “spill” over into L1 cache, and from there to device
memory

application suffers from the latency and
bandwidth implications of using device memory

Avoiding register spill is now one of my main concerns in
big applications, but remember:

- with 1024 threads, 400-600 cycle latency of device
memory is usually OK because some warps can do
useful work while others wait for data

- provided there are 20 flops per variable read from (or
written to) device memory, the bandwidth is not a
limiting issue

Local arrays

What happens if your application uses a little array?

__global__ void lap(float *u) {
float ut[3];
int tid = threadIdx.x + blockIdx.x*blockDim.x;
for (int k=0; k<3; k++)
ut[k] = u[tid+k*gridDim.x*blockDim.x];
for (int k=0; k<3; k++)
u[tid+k*gridDim.x*blockDim.x] =
A[3*k]*ut[0]+A[3*k+1]*ut[1]+A[3*k+2]*ut[2];
}

In simple cases like this (quite common) compiler converts
to scalar registers:


__global__ void lap(float *u) {
int tid = threadIdx.x + blockIdx.x*blockDim.x;
float ut0 = u[tid+0*gridDim.x*blockDim.x];
float ut1 = u[tid+1*gridDim.x*blockDim.x];
float ut2 = u[tid+2*gridDim.x*blockDim.x];
u[tid+0*gridDim.x*blockDim.x] =
A[0]*ut0 + A[1]*ut1 + A[2]*ut2;
u[tid+1*gridDim.x*blockDim.x] =
A[3]*ut0 + A[4]*ut1 + A[5]*ut2;
u[tid+2*gridDim.x*blockDim.x] =
A[6]*ut0 + A[7]*ut1 + A[8]*ut2;
}

In more complicated cases, it puts the array into device
memory
still referred to in the documentation as a “local array”
because each thread has its own private copy
held in L1 cache by default, may never be transferred to
device memory
16kB of L1 cache equates to 4096 32-bit variables,
which is only 8 per thread when using 1024 threads
beyond this, it will have to spill to device memory

CUDA: More about shared Memory

 

 

http://stackoverflow.com/a/7041976/2041023

On Fermi, you can use up to 16kb or 48kb (depending on the configuration you select) of shared memory per block - the number of blocks which will run concurrently on a multiprocessor is determined by how much shared memory and registers each block requires, up to a maximum of 8. If you use 48kb, then only a single block can run concurrently. If you use 1kb per block, then up to 8 blocks could run concurrently per multiprocessor, depending on their register usage.

 

http://stackoverflow.com/a/11507311/2041023

Yes, blocks on the same multiprocessor shared the same amount of shared memory, which is 48KB per multiprocessor for your GPU card (compute capability 2.0). So if you have N blocks on the same multiprocessor, the maximum size of shared memory per block is (48/N) KB.

 

https://developer.nvidia.com/content/using-shared-memory-cuda-cc

SHARED MEMORY BANK CONFLICTS

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spansb distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.

However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

To minimize bank conflicts, it is important to understand how memory addresses map to memory banks. Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

Devices of compute capability 3.x have configurable bank size, which can be set usingcudaDeviceSetSharedMemConfig() to either four bytes (cudaSharedMemBankSizeFourByte, the default) or eight bytes (cudaSharedMemBankSizeEightByte). Setting the bank size to eight bytes can help avoid shared memory bank conflicts when accessing double precision data.

 

CONFIGURING THE AMOUNT OF SHARED MEMORY

On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory. For devices of compute capability 2.x, there are two settings, 48KB shared memory / 16KB L1 cache, and 16KB shared memory / 48KB L1 cache. By default the 48KB shared memory setting is used. This can be configured during runtime API from the host for all kernels usingcudaDeviceSetCacheConfig() or on a per-kernel basis using cudaFuncSetCacheConfig(). These accept one of three options: cudaFuncCachePreferNone, cudaFuncCachePreferShared, andcudaFuncCachePreferL1. The driver will honor the specified preference except when a kernel requires more shared memory per thread block than available in the specified configuration. Devices of compute capability 3.x allow a third setting of 32KB shared memory / 32KB L1 cache which can be obtained using the optioncudaFuncCachePreferEqual.

Tuesday, April 23, 2013

C/C++ data types – when we were young

 

http://www.cplusplus.com/doc/tutorial/variables/

char
Character or small integer.
1byte
signed: -128 to 127
unsigned: 0 to 255

short int(short)
Short Integer.
2bytes
signed: -32768 to 32767
unsigned: 0 to 65535

int
Integer.
4bytes
signed: -2147483648 to 2147483647
unsigned: 0 to 4294967295

long int (long)
Long integer.
4bytes
signed: -2147483648 to 2147483647
unsigned: 0 to 4294967295

bool
Boolean value. It can take one of two values: true or false.
1byte
true or false

float
Floating point number.
4bytes
+/- 3.4e +/- 38 (~7 digits)

double
Double precision floating point number.
8bytes
+/- 1.7e +/- 308 (~15 digits)

long double
Long double precision floating point number.
8bytes
+/- 1.7e +/- 308 (~15 digits)

wchar_t
Wide character.
2 or 4 bytes
1 wide character

 

* The values of the columns Size and Range depend on the system the program is compiled for. The values shown above are those found on most 32-bit systems. But for other systems, the general specification is that int has the natural size suggested by the system architecture (one "word") and the four integer types char, short, int and long must each one be at least as large as the one preceding it, with char being always one byte in size. The same applies to the floating point types float, double and long double, where each one must provide at least as much precision as the preceding one.

CUDA: Dynamic Shared Memo Kernel

 

https://developer.nvidia.com/content/using-shared-memory-cuda-cc

In this case the shared memory allocation size per thread block must be specified (in bytes) using an optional third execution configuration parameter, as in the following excerpt.

dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);

The dynamic shared memory kernel, dynamicReverse(), declares the shared memory array using an unsized extern array syntax, extern __shared__ int s[] (note the empty brackets and use of the extern specifier). The size is implicitly determined from the third execution configuration parameter when the kernel is launched. The remainder of the kernel code is identical to the staticReverse() kernel.

What if you need multiple dynamically sized arrays in a single kernel? You must declare a single extern unsized array as before, and use pointers into it to divide it into multiple arrays, as in the following excerpt.

extern __shared__ int s[];
int *integerData = s; // nI ints
float *floatData = &integerData[nI]; // nF floats
char *charData = &floatData[nF]; // nC chars

In the kernel launch, specify the total shared memory needed, as in the following.

myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);
http://stackoverflow.com/a/5531640/2041023
“Also be aware when using pointers that shared memory uses 32 bit words, and all allocations must be 32 bit word aligned, irrespective of the type of the shared memory allocation.”

Saturday, April 20, 2013

Opengl/Glut: Glut keyboard codes/keys for glutSpecialFunc() only

 

http://www.lighthouse3d.com/tutorials/glut-tutorial/keyboard/

GLUT_KEY_F1		F1 function key
GLUT_KEY_F2 F2 function key
GLUT_KEY_F3 F3 function key
GLUT_KEY_F4 F4 function key
GLUT_KEY_F5 F5 function key
GLUT_KEY_F6 F6 function key
GLUT_KEY_F7 F7 function key
GLUT_KEY_F8 F8 function key
GLUT_KEY_F9 F9 function key
GLUT_KEY_F10 F10 function key
GLUT_KEY_F11 F11 function key
GLUT_KEY_F12 F12 function key
GLUT_KEY_LEFT Left function key
GLUT_KEY_RIGHT Right function key
GLUT_KEY_UP Up function key
GLUT_KEY_DOWN Down function key
GLUT_KEY_PAGE_UP Page Up function key
GLUT_KEY_PAGE_DOWN Page Down function key
GLUT_KEY_HOME Home function key
GLUT_KEY_END End function key
GLUT_KEY_INSERT Insert function key

c++, java: hash (x, y, z)

 

c++:

http://stackoverflow.com/a/13389744

Here's a baseline hash function:

unsigned long long h = (n << 24) | (a << 16) | (b << 8) | c;
return std::hash(h);

I.e., just pack the members into an unsigned long long, then offload the work to std::hash. In the common case that int is 32 bits wide and long long is 64 bits, and assuming your chars are not negative, this uses all the information in your objects for the hash.


http://stackoverflow.com/a/1820504

unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];

java:


http://stackoverflow.com/a/5730232


    public int hashCode()
{
int result = (int) (x ^ (x >>> 32));
result = 31 * result + (int) (y ^ (y >>> 32));
result = 31 * result + (int) (z ^ (z >>> 32));
return result;
}

 



http://stackoverflow.com/a/5730337

(int)(x ^ (x >> 32) ^ y ^ (y >> 32) ^ z ^ (z >> 32));

Wednesday, April 17, 2013

CUDA: Impact of not having the correct Comput_xx and SM_xx value – Debug Error !

 

image

cudaErrorInvalidDeviceFunction

image

 

First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: cudaError at memory location 0x00564d84..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x00564f0c..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: cudaError at memory location 0x00564044..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000..
First-chance exception at 0x7664c41f in ParticleSystemJames.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x005641cc..

Solution:

- Check your GPU:

image

- Check Nvidia’s table of CPU – CUDA (Code Generation) Version

https://developer.nvidia.com/cuda-gpus

- In MSVS 2010, go to Project Properties –> CUDA C/C++ –> Device –> Code Generation

change Compute_xx and SM_xx to the number in that table without the decimal point

image

NOTE: it is possible that the exact version is not supported by your version of NSight/CUDA toolkit, in which case, go back one on the version number and try again.