Vector Add with OpenMPΒΆ

Computers with multicore processors and a single shared memory space are the norm, including not only laptops and desktops, but also most phones and tablets. Using multiple cores concurrently on these machines, can be done in several programming languages; we will demonstrate the use of C with a set of compiler directives and library functions known as OpenMP. The OpenMP standard is built into many C compilers, including gcc on unix machines.

OpenMP on shared memory multicore machines creates threads that execute concurrently. The creation of these threads is implicit and built by the compiler when you insert special directives in the C code called pragmas. The code that begins executing main() is considered thread 0. At certain points in the code, you can designate that more threads should be used in parallel and exucute concurrently. This is called forking threads.

In the code below, you will see this pragma, which is implicitly forking the threads to complete the computation on equal chunks of the orginal array:

#pragma omp parallel for shared(a, b, c) private(i) schedule(static, 2)

The shared keyword indicates that the arrays are shared in the same memory space for all threads, and the private keyword indicates that each thread will have its own copy of the index counter i that it will increment.

The schedule keyword is used in this pragma to indicate how many consecutive iterations of the loop, and thus computations on consecutive elements of the arrays, that each thread will execute. In data decomposition, we like to call this the chunk size assigned to each thread (not necessarily a universal term, but hopefully it conveys the idea). To mimic our simple 8-element example, this code (shown below) sets the number of threads to 4 and the chunk size to 2.

The syntax of this OpenMP code example below is very similar to the original sequential version. In fact, it was derived from the sequential version by adding this pragma, including the OpenMP library, called omp.h, setting how many threads to use and the chuck size just before the forking, and adding some print statements to illustrate the decomposition and verify the results.

This pragma around for loops is built into openMP because this ‘repeat N times” pattern occurs so frequently in a great deal of code. This simplicity can be deceiving, however– this particular example lends itself well to having the threads share data, but other types of problems are not this simple. This type of data decomposition example is sometimes called embarassingly parallel, because each thread can read and update data that no other thread should ever touch.

This code is the file VectorAdd/OpenMP/VA-OMP-simple.c in the compressed tar file of examples that accompanies this reading.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#include <stdlib.h>   //malloc and free
#include <stdio.h>    //printf
#include <omp.h>      //OpenMP

// Very small values for this simple illustrative example
#define ARRAY_SIZE 8     //Size of arrays whose elements will be added together.
#define NUM_THREADS 4    //Number of threads to use for vector addition.

/*
 *  Classic vector addition using openMP default data decomposition.
 *
 *  Compile using gcc like this:
 *  	gcc -o va-omp-simple VA-OMP-simple.c -fopenmp
 *
 *  Execute:
 *  	./va-omp-simple
 */
int main (int argc, char *argv[]) 
{
	// elements of arrays a and b will be added
	// and placed in array c
	int * a;
	int * b; 
	int * c;
        
        int n = ARRAY_SIZE;                 // number of array elements
	int n_per_thread;                   // elements per thread
	int total_threads = NUM_THREADS;    // number of threads to use  
	int i;       // loop index
        
        // allocate spce for the arrays
        a = (int *) malloc(sizeof(int)*n);
	b = (int *) malloc(sizeof(int)*n);
	c = (int *) malloc(sizeof(int)*n);

        // initialize arrays a and b with consecutive integer values
	// as a simple example
        for(i=0; i<n; i++) {
            a[i] = i;
        }
        for(i=0; i<n; i++) {
            b[i] = i;
        }   
        
	// Additional work to set the number of threads.
	// We hard-code to 4 for illustration purposes only.
	omp_set_num_threads(total_threads);
	
	// determine how many elements each process will work on
	n_per_thread = n/total_threads;
	
        // Compute the vector addition
	// Here is where the 4 threads are specifically 'forked' to
	// execute in parallel. This is directed by the pragma and
	// thread forking is compiled into the resulting exacutable.
	// Here we use a 'static schedule' so each thread works on  
	// a 2-element chunk of the original 8-element arrays.
	#pragma omp parallel for shared(a, b, c) private(i) schedule(static, n_per_thread)
        for(i=0; i<n; i++) {
		c[i] = a[i]+b[i];
		// Which thread am I? Show who works on what for this samll example
		printf("Thread %d works on element%d\n", omp_get_thread_num(), i);
        }
	
	// Check for correctness (only plausible for small vector size)
	// A test we would eventually leave out
	printf("i\ta[i]\t+\tb[i]\t=\tc[i]\n");
        for(i=0; i<n; i++) {
		printf("%d\t%d\t\t%d\t\t%d\n", i, a[i], b[i], c[i]);
        }
	
        // clean up memory
        free(a);  free(b); free(c);
	
	return 0;
}

Previous topic

Vector Add with MPI

Next topic

Vector Add with CUDA