CSCI 5451 Assignment 3: Shared Memory and GPU Programming

Due: Tue 02-May by 11:59 pm
Approximately 25% of total grade
Submit to Gradescope
You may work in groups of 2 and submit one assignment per group.

CODE DISTRIBUTION: a3-code.zip

CHANGELOG:

Wed Apr 26 11:33:50 AM CDT 2023

A bug in test_kmeans_omp.org has been corrected; As reported in Post 84 Test 8 expected incorrect output. The codepack now contains a corrected version of the test file which you can also download here: test_kmeans_omp.org

Some notes on accessing the nvcc compiler have been added including how to init the module command if it is not available in your shell.

1. Overview
2. Download Code and Setup
3. A3-WRITEUP.txt Writeup File
4. Benchmarking on CUDA Machines
5. Problem 1: OpenMP K-Means Clustering
6. Problem 2: CUDA K-Means Clustering
7. Project Submission on Gradescope

1 Overview

As the previous assignment, this one involves coding a parallel version of the K-Means application in according to 2 different parallel paradigms.

Shared Memory Parallelism via OpenMP
GPU Parallelism via CUDA

For both problems, after finishing your code, you will need to run some timing experiments and describe the results in a short text file.

2 Download Code and Setup

Download the code pack linked at the top of the page. Unzip this which will create a project folder. Create new files in this folder. Ultimately you will re-zip this folder to submit it.

File	State	Notes
ALL PROBLEMS
`A3-WRITEUP.txt`	EDIT	Fill in timing tables and write answers to discussion questions
`Makefile`	Provided	Build file to compile all programs
`testy`	Testing	Test running script
`kmeans_serial.c`	CREATE	Serial version of K-means clustering to write in C
`kmeans.py`	Provided	Serial version of K-means clustering in Python
PROBLEM 1
`kmeans_omp.c`	CREATE	Shared memory OMP version of K-means clustering to write in C
`test_kmeans_omp.org`	Testing	Problem 2 tests for K-means clustering
PROBLEM 2
`kmeans_cuda.cu`	CREATE	GPU parallel version of K-means clustering to write in C
`test_kmeans_cuda.org`	Testing	Problem 2 tests for K-means clustering

3 `A3-WRITEUP.txt` Writeup File

Below is a blank copy of the writeup document included in the code pack. Fill in answers directly into this file as you complete your programs and submit it as part of your upload.

                              ____________

                               A3 WRITEUP
                              ____________





GROUP MEMBERS
-------------

  - Member 1: <NAME> <X500>
  - Member 2: <NAME> <X500>

  Up to 2 people may collaborate on this assignment. Write names/x.500
  above. If working alone, leave off Member 2.

  ONLY ONE GROUP MEMBER SHOULD SUBMIT TO GRADESCOPE THEN ADD THEIR
  PARTNER ACCORDING TO INSTRUCTIONS IN THE ASSIGNMENT WEB PAGE.


Problem 1: kmeans_omp
=====================

kmeans_omp Timing Table
~~~~~~~~~~~~~~~~~~~~~~~

  Fill in the following table on measuring the performance of your
  `kmeans_omp' program on `cudaNN.cselabs.umn.edu' where `NN' is `01' to
  `05'. Replace 00.00 entries with your actual run times. You can use
  the provided `kmeans-omp.sh' script to ease this task.

  The columns are for each of 3 data files that are provided and run in
  the job script.

  -------------------------------------------------------------------
                                       Data File                     
   Procs  digits_all_5e3.txt  digits_all_1e4.txt  digits_all_3e4.txt 
  -------------------------------------------------------------------
       1               00.00               00.00               00.00 
       2               00.00               00.00               00.00 
       4               00.00               00.00               00.00 
       8               00.00               00.00               00.00 
      10               00.00               00.00               00.00 
      13               00.00               00.00               00.00 
      16               00.00               00.00               00.00 
      32               00.00               00.00               00.00 
  -------------------------------------------------------------------


kmeans_omp Discussion Questions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Analyze your table of results and answer the following questions.
  1. Did using more processors result in speedups?
  2. Describe any trends or anomalies you see in the timings and
     speculate on their causes - e.g. was there are a steady increase in
     runtimes, steady decrease, or jagged changes in timing?
  3. Try to explain how number of processors and problem size seem to
     affect runtimes/speedup in the problem.


Problem 2: kmeans_cuda
======================

kmeans_cuda Timing Table
~~~~~~~~~~~~~~~~~~~~~~~~

  Fill in the following table on measuring the performance of your
  `kmeans_cuda' program on `cudaNN.cselabs.umn.edu' where `NN' is `01'
  to `05'. Replace 00.00 entries with your actual run times. You can use
  the provided `kmeans-omp.sh' script to ease this task.

  The columns are for each of 3 data files that are provided and run in
  the job script.

  ------------------------------------------------------------------------
                                            Data File                     
   Procs       digits_all_5e3.txt  digits_all_1e4.txt  digits_all_3e4.txt 
  ------------------------------------------------------------------------
   CPU Serial               00.00               00.00               00.00 
   GPU                      00.00               00.00               00.00 
  ------------------------------------------------------------------------


kmeans_cuda Discussion Questions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Analyze your table of results and answer the following questions.
  1. Did using the GPU result in speedups?
  2. Describe your general approach on how you used GPU threads/blocks
     to parallelize the algorithm.
     - How did you subdivide the Cluster Center Calculation Phase?
     - How did you subdivide the Data Assignment Phase?

4 Benchmarking on CUDA Machines

CSE Labs machines has a series of servers named machines cuda01-cselabs.umn.edu to cuda05-cselabs.umn.edu. Each of these a large-ish number of CPU cores and an Nvidia GPU along with the compilation toolchain such as the nvcc compiler. You may develop your code anywhere you like but perform timing evaluations on these machines.

The nvcc compiler is available on these machines using the module system. The following two commands will init the module system and load the CUDA software.

cuda01>> source /usr/local/modules-tcl/init/bash
cuda01>> module load soft/cuda

Adding these lines to init files like .bashrc is a good idea as it will happen automatically on every login.

5 Problem 1: OpenMP K-Means Clustering

Implement parallel version of the K-means algorithm for shared memory using OpenMP. Name your program kmeans_omp.c to be compatible with the provided Makefile. We will discuss the basic strategy for parallelizing the code during lecture so review the discussion present there if you need guidance.

5.1 Implementation Requirements

kmeans_omp will accept the same command line arguments as the Serial version.
The behavior and output of the OpenMP version will match the serial version in format; e.g. printed messages and produced output files.
The program will be compiled with OpenMP features enabled so there is no requirement that the needs to compile/work in the absence of OpenMP libraries or compiler support.
As before, follow the general flow provided in the Python implementation. The OpenMP version should produce identical results and hopefully faster.
Your program should honor the environment variable OMP_NUM_THREADS which sets the number of threads used in OpenMP. This variable is used in the provided timing script. The variable will be honored unless you do something special in your code like call omp_set_num_threads() so avoid doing that and all is well.

5.2 Implementation Notes

Parallelize the two major phases of the algorithm, Cluster Center Calculation and Data Assignment to Clusters.
Note that the Cluster Center Calculation involves a reduction in the original form of the code. You will need to take some care on this to preserve correctness as the built-in OpenMP reduction facility is likely insufficient. Using a locking mechanism or "manual reduction" which involves a local copy of some data is recommended. If using a locking mechanism, create a lock for each cluster so that only one thread is modifying the new cluster center at a time.

5.3 Grading Criteria for Problem 1 grading 50

Test cases are provided in test_kmeans_omp.org. The Makefile contains a target for this, just requires the missing test_kmeans_omp.org file which will have the tests.

	CRITERIA
10	Code compiles via `make prob1`, honors command line parameters on interactive runs.
10	Passes automated tests via `make test-prob1`
10	Cleanly written code with good documentation and demonstrates appropriate use of
	OpenMP directives and functions to facilitate parallelism
10	Written report includes timings table described above
10	Written report includes answers to discussion questions written above.

6 Problem 2: CUDA K-Means Clustering

Implement parallel version of the K-means algorithm for Nvidia GPU acceleration using CUDA. Name your program kmeans_cuda.cu to be compatible with the provided Makefile. We will discuss the basic strategy for parallelizing the code during lecture so review the discussion present there if you need guidance.

6.1 Implementation Requirements

kmeans_cuda will accept the same command line arguments as the Serial version.
The behavior and output of the CUDA version will match the serial version in format; e.g. printed messages and produced output files.
The program will be compiled with the CUDA compiler so there is no requirement that the needs to compile/work in the absence of CUDA libraries or compiler support.
As before, follow the general flow provided in the Python implementation. The OpenMP version should produce identical results and hopefully faster.
No runtime parameters to set the number of threads/blocks will be provided. Instead, determine the number of threads/blocks to use from the data size. The timing tables to report reflect this: report times only on the data files indicated.

6.2 Implementation Notes

The nvcc compiler is a C++ compiler so has slightly different conventions than gcc. Errors for setting pointers without explicit castes are prevalent. Also .c files yield different compile/link behavior than .cu files. You may need to adjust the provided Makefile as it uses files kmeans_cuda.cu and kmeans_util.cu during compilation. Only a kmeans_util.c file is provided but symbolically linking these as the same file via ln -s kmeans_util.c kmeans_util.cu can allow its dual use for both C and C++ compilation.
Make use of cudaMalloc() / cudaMemcpy() to create memory for the data and cluster structs / arrays on the GPU and transfer the dataset to the GPU. You may assume that the entire dataset fits in the GPU memory. Keep in mind that the data must be moved between the Host / Device through explicit calls: anything that is computed on GPU must be copied back to the CPU.
The two major phases of the algorithm should be parallelized
1. Computation of cluster centers based on assignment of data to clusters.
2. Assignment of data to clusters based on cluster centers.
These two should be at least their own Kernels but it may be wise to further subdivide.
There are no specific requirements on how to subdivide the tasks into Blocks / Threads and you are encouraged to experiment somewhat to try to get better performance. Keep in mind that there is no ability to synchronize threads between blocks and atomic operations are costly. Consider carefully how to subdivide the loops of the problem to get correct results and good performance.
Calculating Cluster Centers: This phase calculates new cluster centers based on the existing assignment of data to clusters. This is essentially a reduction (sum) over features followed by a division by cluster counts OR alternatively the division can be done during the reduction at the risk of small numerical differences. Since reductions on CUDA are tricky, take care to find a work distribution among threads that will enable good performance and avoid the need to synchronize. Options to consider are 1 thread per feature dimension, 1 thread per cluster, and combinations thereof. This is the trickier phase to parallelize well.
Assignment to Clusters: This phase calculates the assignment of each data element to a cluster based on existing cluster centers. It should also calculate the total number of data in each cluster based on the new assignment. Since it is expected that the number of data points is large, an easy way to parallelize the loop is to assign one thread per data element. With this parallelization, the only major coordination is on totaling the cluster counts.

6.3 Grading Criteria for Problem 2 grading 50

Test cases are provided in test_kmeans_cuda.org. The Makefile contains a target for this, just requires the missing test_kmeans_cuda.org file which will have the tests.

	CRITERIA
10	Code compiles via `make prob2`, honors command line parameters on interactive runs.

10	Passes automated tests via `make test-prob2` (will only work if CUDA is present such as on lab machines)
	NOTE: Gradescope will not run test-prob2 as it does not have a GPU; this score will be added during grading

10	Cleanly written code with good documentation and demonstrates appropriate use of
	Code that transfers data between Host / Device
	Use of several CUDA kernels which parallelize the tasks present in the clustering algorithm

10	Written report includes timings table described above

10	Written report includes answers to discussion questions written above.

7 Project Submission on Gradescope

Submission is identical to the previous programming assignment. See the A2 Description if you need additional details.

Late Policies

You may wish to review the policy on late project submission which will cost 1 Engagement Point per day late. No projects will be accepted more than 48 hours after the deadline.

https://www-users.cs.umn.edu/~kauffman/5451/syllabus.html