Warning
The Unified Memory parts of this lab may not work on your machine. Run
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
and check that your device’s CUDA capability is >= 3.0 If the command gives you an error, you may need to compile the samples like so
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery
Download and extract vectorAdd.tar.gz and extract it with tar -xzvf vectorAdd.tar.gz Open vectorAdd.cu and vectorAdd6.cu and familiarize yourself with the code. Compile and run the programs.
Note
don’t compile vectorAdd6.cu if your machine is incompatable
Run the programs and see what happens.
Using omp_get_wtime() modify vectorAdd.cu so that it reports
None of these times should include any I/O so make sure you comment out the printf() statements.
Use the Makefile to build your modified version of the program. When it compiles successfully run it as follows:
./vectorAdd
The program’s default array size is 50,000 elements
In a spreadsheet record and label your times in a column labeld 50,000. Which is faster, the CUDA version or the CPU version?
Repeat this problem with a larger array. Run it again with 500,000 elements.
./vectorAdd 500000
Record your results. Repeat the process again wih 5,000,000 elements, 50,000,000 and 500,000,000 elements. How do these times compare? Were you able to run all of them succesfully? If not why?
Create a line chart, with one line for the sequential code’s times and one line for the CUDA code’s total times. Your X-axis should be labeled with 50,000 500,000 5,000,000 and 50,000,000 your Y-axis should be the time.
Then create a “stacked” barchart of the CUDA times with the same X and Y axes as your first chart.. For each X-axis value, this chart should “stack” the CUDA computation’s
What observations can you make about the CUDA vs the sequential computations? How much time does the CUDA computation spend transferring data compared to computing? What is the answer to our first research question?
Note
skip this section if your device is not compatable with Unified Memory.
Using omp_get_wtime() modify vectorAdd6.cu so that it reports
Again, none of these times should include any I/O so make sure you comment out the printf() statements.
Run your program using
./vectorAdd6
Record your results using 50,000 500,000 5,000,000 and 50,000,000 elements. How do these times compare?
Add this new data to the line chart and stacked bar charts from part one. How does using unified memory compare to using cudaMemcpy? What is the bottleneck for the cudaMemcpy version? What about the unified memory version?