The examples used in this chapter are based on examples in Programming Massively Parallel Processors: A Hands-on Approach, written by David B. Kirk and Wen-mei W. Hwu, and published by Morgan Kaufmann publishers.

Copyright 2010 David B. Kirk/NVIDIA Corporation and Wen-mei Hwu. Published by Elsevier Inc. All rights reserved.

This copy of code is a derivative based on the original code and designed for educational purposes only. It contains source code provided by the book above.

In this chapter, we will learn more about GPU computing on multi-dimensional problems and really experience the advantage of GPU computing over CPU computing. Before we proceed to the next section, this section will introduce a Matrix Multiplication program that is written in standard C and use only CPU for the computation. We hope the result we obtain in this chapter can serve as a baseline for following sections and provide a clearer view on how fast GPU computing can be.

CPU Matrix Multiplication Program source file:
`MM-CPU.c`

- We conducted 5 tests and the results are below.
- 41045.35 ms

- 40277.44 ms

- 40369.30 ms

- 40385.85 ms

- 40446.14 ms

- Average: 40504.82 ms

Starting from this example, we will look at the how to solve problem in two-dimensional domain using two-dimensional grid and block. As we know, threads can be organized into multi-dimensional block and blocks can also be organized into multi-dimensional grid. This feature in CUDA architecture enable us to create two-dimensional or even three-dimensional thread hierarchy so that solving two or three-dimensional problems becomes easier and more efficient.

In this example, we will do the Square Matrix Multiplication. Two input matrices of size *Width x Width* are M and N. The output matrix is P with the same size. If you have learned linear algebra before, you will know that the output of two square matrices multiplied together is a square matrix of the same size. For example, to calculate entry (A,B) in the output matrix, we need to use row A in one input matrix and column B in another input matrix. We first take the left most element in row A and multiply it by top element in column B. Later, we take the second left element in row A and multiply it by second top element in column B. We do this for all the elements in row A and column B, and then we get the sum of products. The result will be the value at entry (A,B) in the output matrix. As you can see, this kind of operation is highly paralleled, make it perfect for us to use CUDA. We do this by assigning each entry in output matrix a thread of its own. This thread will fetch the data and do all the calculations. It will later on write back the result to the out put matrix.

Matrix Multiplication with Global Memory source file:
`MM-GPU-GM.cu`

```
__global__ void Kernel(float *Md, float *Nd, float *Pd, int Width) {
// Calculate the column index of the Pd element, denote by x
int x = threadIdx.x + blockIdx.x * blockDim.x;
// Calculate the row index of the Pd element, denote by y
int y = threadIdx.y + blockIdx.y * blockDim.y;
float Pvalue = 0;
// each thread computes one element of the output matrix Pd.
for (int k = 0; k < Width; ++k) {
Pvalue += Md[y*Width + k] * Nd[k*Width + x];
}
// write back to the global memory
Pd[y*Width + x] = Pvalue;
}
```

This is the complete device code.

```
// Calculate the column index of the Pd element, denote by x
int x = threadIdx.x + blockIdx.x * blockDim.x;
// Calculate the row index of the Pd element, denote by y
int y = threadIdx.y + blockIdx.y * blockDim.y;
```

This 4 lines of code will assign index to the thread so that they can match up with entries in output matrix. As you may notice, we introduced a new CUDA built-in variable **blockDim** into this code. **blockDim** has the variable type of dim3, which is an 3-component integer vector type that is used to specify dimensions. This variable contains the dimensions of the block, and we can access its component by calling blockDim.x, blockDim.y, blockdIM.z.

Each thread in one specific block is identified by threadIdx.x and threadIdx.y. Each block is one specific grid is identified by blockIdx.x and blockIdx.y. Therefore, if we have threadIdx.x, threadIdx.y, blockIdx.x and blockIdx.y, we can locate one specific thread.

```
dim3 dimBlock(32, 32);
dim3 dimGrid(Width/32, Width/32);
Kernel<<<dimGrid, dimBlock>>>( Md, Nd, Pd, Width);
```

This 3 lines of code above is declaring and initializing dim3 variables which give the grid dimensions and block dimensions. In each of the initializations, we only passed two parameters as components. The CUDA runtime will initialize any component left unspecified to 1. So technically, we are initializing dimBlock as (32, 32, 1) and dimGrid as (Width/32, Width/32, 1).

The rest of the host code is similar to examples we have seen before. Here is the complete version of the host code.

```
main(void){
void MatrixMultiplication(float *, float *, float *, int);
const int Width = 1024;
int size = Width * Width * sizeof(float);
float *M, *N, *P;
// allocate memory on the CPU
M = (float*)malloc(size);
N = (float*)malloc(size);
P = (float*)malloc(size);
// initialize the matrices
for (int y=0; y<Width; y++) {
for (int x=0; x<Width; x++){
M[y*Width + x] = x + y*Width;
N[y*Width + x] = x + y*Width;
}
}
MatrixMultiplication(M, N, P, Width);
// free the memory allocated on the CPU
free( M );
free( N );
free( P );
return 0;
}
void MatrixMultiplication(float *M, float *N, float *P, int Width) {
int size = Width * Width * sizeof(float);
float *Md, *Nd, *Pd;
// capture start time
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );
// allocate memory on the GPU
HANDLE_ERROR( cudaMalloc((void**)&Md, size) );
HANDLE_ERROR( cudaMalloc((void**)&Nd, size) );
HANDLE_ERROR( cudaMalloc((void**)&Pd, size) );
// transfer M and N to device memory
HANDLE_ERROR( cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice) );
HANDLE_ERROR( cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice) );
// kernel invocation code
dim3 dimBlock(32, 32);
dim3 dimGrid(Width/32, Width/32);
Kernel<<<dimGrid, dimBlock>>>( Md, Nd, Pd, Width);
// transfer P from device
HANDLE_ERROR( cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost) );
// get stop time, and display the timing results
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );
// free the memory allocated on the GPU
HANDLE_ERROR( cudaFree(Md) );
HANDLE_ERROR( cudaFree(Nd) );
HANDLE_ERROR( cudaFree(Pd) );
// destroy events to free memory
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );
}
```

In the very top of the source file you can define the size of the matrix. Just change the Width definition to some number you like. While testing the performance, we used 1024 as Width same as the number used in the CPU baseline program. We conducted 5 tests and the results are below.

- 52.5 ms

- 52.4 ms

- 52.4 ms

- 52.4 ms

- 52.6 ms
- average: 52.46 ms

Compared the CPU program, our GPU program is **772** times faster.

The reason CUDA architecture has many memory types is to increase the memory accessing speed so that data transfer speed can match data processing speed. The following example will show you why matching these two speeds is so important to GPU computation.

One of the most important standards of a processor’s computation ability is its flops computation. We assume that in order to perform one floating point operation, the runtime need to transfer one single-precision floating-point from global memory datum to the computational kernel. With this in mind, we can proceed to our example.

The nVidia Tesla C2075 companion processor supports 144 gigabytes per second (GB/s) of global memory access bandwidth. With 4 bytes in each single precision floating-point datum, we can load no more than 36 (144/4) giga single precision data per second. Since the computational kernel cannot compute more floating-point than the amount global memory has loaded, it will execute no more than 36 gigaflops per second. The actual kernel computational capability of our tesla card is 1 teraflops (1000 gigaflops) per second, but due to limited memory accessing speed, we can only achieve less than 4 percent of the actual speed. In other words, the highest achievable floating-point calculation throughout is limited by the rate at which the input data can be transfered from global memory to computational kernel.

To address this problem, CUDA architecture designed several types of memory that could potentially speed up the data loading process. We will see how use them in later examples. For now, we first need to know specifications of different memory types.

There are in total 4 types of memory designed for GPU cards with CUDA architecture. Global memory, located in the gird, has large storage capacity but limited speed, and can be read and write from all the blocks within CUDA system. Shared memory, located in each block, has small storage capacity (16KB per block) but fast accessing speed, can be read and write by all the threads within the located block. Constant memory, also located in the grid, has very small storage capacity (64KB per GPU) but very fast accessing speed, and can read (can’t write) from any threads. There is also local memory located in each thread.

Memory | Scope of Access | Lifetime | R/W ability | Speed | Declaration |
---|---|---|---|---|---|

Register | Thread | Kernel | R/W | Fast | Automatic Variables |

Local | Thread | Kernel | R/W | Fast | Automatic Arrays |

Shared | Block | Kernel | R/W | Fast | __shared__ |

Global | Grid | Host | R/W | Slow | __device__ |

Constant | Grid | Host | Read only | Fast | __constant__ |