CSCI 5451 Assignment 2: Distributed and Shared Memory Programming

Due: Wed 12-Apr by 11:59 pm
Approximately 12.5% of total grade
Submit to Gradescope
You may work in groups of 2 and submit one assignment per group.

CODE DISTRIBUTION: a2-code.zip

GRADESCOPE TESTS: a2-gradescope-tests.zip

CHANGELOG:

Thu Apr 27 05:05:38 PM CDT 2023

The weight on this assignment incorrectly stated 30%; it is worth 12.5% of the overall grade.

Wed Apr 12 08:20:16 AM CDT 2023

The deadline for the project has been extended to Wed 12-Apr.

The Gradescope submission link is open. The automated tests on Gradescope have no memory checking via Valgrind as this proved intractable to make work across systems. The Gradescope tests are linked here: a2-gradescope-tests.zip. Running them locally should reflect he behavior on Gradescope. Add the contents Makefile-target file's contents to your Makefile and run make test-gradescope

Fri Apr 7 02:57:14 PM CDT 2023

The data file digits_all_3e4.txt was initially missing and has been added into the codepack. You can also download a stand-alone copy of it here: digits_all_3e4.txt

Wed Apr 5 03:16:50 PM CDT 2023

Minor typo fixes to change to heat_mpi rather than mpi_heat

Thu Mar 30 02:04:39 PM CDT 2023

An update for the file test_kmeans_serial.org is now available which uses the proper name for the data directory; it should be mnist-data but was wrong in the original version.

1 Overview

The assignment involves programming in MPI and describing the results of running your programs with several different parameter sets. It is a programming assignment so dust off your C skills. We have to spent some class discussing issues related to the assignment but it may be a good idea to review the lecture videos for when this took place earlier in the semester. It will pay to start early to get oriented as debugging parallel programs can be difficult and requires time.

There are 3 problems to solve.

Parallelize the heat program from A1 using MPI
Ensure your serial version of K-Means is intact.
Parallelize K-Means clustering using MPI.

For all several of the problems, after finishing your code, you will need to run some timing experiments and describe the results in a short text file.

2 Download Code and Setup

Download the code pack linked at the top of the page. Unzip this which will create a project folder. Create new files in this folder. Ultimately you will re-zip this folder to submit it.

File	State	Notes
ALL PROBLEMS
`A2-WRITEUP.txt`	EDIT	Fill in timing tables and write answers to discussion questions
`Makefile`	Provided	Build file to compile all programs
`testy`	Testing	Test running script
`test_mpi.supp`	Testing	Suppression file to get Valgrind to hide some library errors
`msi-setup.sh`	Provided	Source this to set path for MPI on MSI machines
`mpi_hello.c`	Provided	Demo MPI program along with debug printing function
PROBLEM 1
`heat_mpi.c`	CREATE	Problem 1 parallel version of heat
`heat_serial.c`	Provided	Problem 1 serial version of the heat problem
`heat-slurm.sh`	Provided	Problem 2 script to generate timing for heat on MSI
`test_heat.org`	Testing	Problem 1 tests for heat program
PROBLEM 2
`kmeans_serial.c`	CREATE	Problem 2 serial version of K-means clustering to write in C
`kmeans.py`	Provided	Problem 2 serial version of K-means clustering in Python
`test_kmeans_serial.org`	Testing	Problem 2 tests for K-means clustering
PROBLEM 3
`kmeans_mpi.c`	CREATE	Problem 2 serial version of K-means clustering
`test_kmeans_mpi.org`	Testing	Problem 2 tests for K-means clustering
		NOTE: The `test_kmeans_mpi.org` file will be released later

3 `A2-WRITEUP.txt` Writeup File

Below is a blank copy of the writeup document included in the code pack. Fill in answers directly into this file as you complete your programs and submit it as part of your upload.

                              ____________

                               A2 WRITEUP
                              ____________





GROUP MEMBERS
-------------

  - Member 1: <NAME> <X500>
  - Member 2: <NAME> <X500>

  Up to 2 people may collaborate on this assignment. Write names/x.500
  above. If working alone, leave off Member 2.

  ONLY ONE GROUP MEMBER SHOULD SUBMIT TO GRADESCOPE THEN ADD THEIR
  PARTNER ACCORDING TO INSTRUCTIONS IN THE ASSIGNMENT WEB PAGE.


Problem 1: heat_mpi
===================

heat_mpi Timing Table
~~~~~~~~~~~~~~~~~~~~~

  Fill in the following table on measuring the performance of your
  `heat_mpi' program on MSI's Mesabi machine. Replace 00.00 entries with
  your actual run times. You can use the provided `heat-slurm.sh' script
  to ease this task. Submit it using `sbatch heat-slurm.sh' and extract
  the lines marked `runtime:'.

  -----------------------------
                 Width         
   Procs   6400  25600  102400 
  -----------------------------
       1  00.00  00.00   00.00 
       2  00.00  00.00   00.00 
       4  00.00  00.00   00.00 
       8  00.00  00.00   00.00 
      10  00.00  00.00   00.00 
      16  00.00  00.00   00.00 
      32  00.00  00.00   00.00 
      64  00.00  00.00   00.00 
     128  00.00  00.00   00.00 
  -----------------------------


heat_mpi Discussion Questions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Analyze your table of results and answer the following questions.
  1. Did using more processors result in speedups?
  2. Describe any trends or anomalies you see in the timings and
     speculate on their causes - e.g. was there are a steady increase in
     runtimes, steady decrease, or jagged changes in timing?
  3. Try to explain how number of processors and problem size seem to
     affect runtimes/speedup in the problem. Consider that most Mesabi
     Nodes have 24 cores on them. Comment on whether this number seems
     to affect the performance of the runs you see.


Problem 2: kmeans_serial vs kmeans_mpi
======================================

  Discuss how you chose to parallelize your serial version of K-means in
  the program `kmeans_mpi.c'. Answer the following questions briefly.
  1. How is the input and output data partitioned among the processors?
  2. What communication is required throughout the algorithm?
  3. Which MPI collective communication operations did you employ?


Problem 3: kmeans_mpi
=====================

kmeans_mpi Timing Table
~~~~~~~~~~~~~~~~~~~~~~~

  Fill in the following table on measuring the performance of your
  `kmeans_mpi' program on MSI's Mesabi machine. Replace 00.00 entries
  with your actual run times. You can use the provided `kmeans-slurm.sh'
  script to ease this task.

  The columns are for each of 3 data files that are provided and run in
  the job script.

  digits_all_5e3.txt digits_all_1e4.txt
  -------------------------------------------------------------------
                                       Data File                     
   Procs  digits_all_5e3.txt  digits_all_1e4.txt  digits_all_3e4.txt 
  -------------------------------------------------------------------
       1               00.00               00.00               00.00 
       2               00.00               00.00               00.00 
       4               00.00               00.00               00.00 
       8               00.00               00.00               00.00 
      10               00.00               00.00               00.00 
      16               00.00               00.00               00.00 
      32               00.00               00.00               00.00 
      64               00.00               00.00               00.00 
     128               00.00               00.00               00.00 
  -------------------------------------------------------------------


kmeans_mpi Discussion Questions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Analyze your table of results and answer the following questions.
  1. Did using more processors result in speedups?
  2. Describe any trends or anomalies you see in the timings and
     speculate on their causes - e.g. was there are a steady increase in
     runtimes, steady decrease, or jagged changes in timing?
  3. Try to explain how number of processors and problem size seem to
     affect runtimes/speedup in the problem. Consider that most Mesabi
     Nodes have 24 cores on them. Comment on whether this number seems
     to affect the performance of the runs you see.

4 `mpi_hello.c`: A Sample MPI Program

A simple sample program called mpi_hello.c is provided as part of the code distribution. This program includes a useful utility for debugging purposes.

dprintf(fmt,...) is like printf() except that it only prints if the environment variable DEBUG is set and displays processor information in the prefixed with |DEBUG. This enables debugging messages to be printed during development but are disabled during normal runs.

Compiling and running the program can be done locally on any machine with MPI installed as follows.

>> . mpiopts.sh                       # set the MPIOPTS environment variable

>> mpirun $MPIOTS -np 4 ./mpi_hello   # run the code normally
Hello world from process 0 of 4 (host: val)
Hello from the root processor 0 of 4 (host: val)
Hello world from process 2 of 4 (host: val)
Hello world from process 1 of 4 (host: val)
Hello world from process 3 of 4 (host: val)

>> DEBUG=1 mpirun -np 4 ./mpi_hello  # enable debug messages for this run
Hello world from process 2 of 4 (host: val)
Hello world from process 3 of 4 (host: val)
Hello world from process 1 of 4 (host: val)
|DEBUG Proc 002 / 4 PID 1484871 Host val| Debug message from processor 2
|DEBUG Proc 003 / 4 PID 1484872 Host val| Debug message from processor 3
|DEBUG Proc 001 / 4 PID 1484870 Host val| Debug message from processor 1
Hello world from process 0 of 4 (host: val)
Hello from the root processor 0 of 4 (host: val)
|DEBUG Proc 000 / 4 PID 1484869 Host val| Debug message from processor 0

Debug printing takes time and should be turned off when reporting runtimes for programs. Using the shell commands below ensures that the DEBUG environment variable is unset so debug printing is turned off for further runs.

>> echo $DEBUG                  # check value of DEBUG env var
1                               # currently defined

>> unset DEBUG                  # unset it to remove it
>> echo $DEBUG                  # now has no value

>> mpirun -np 4 ./mpi_hello     # this run has no debug output
P 0: Hello world from process 0 of 4 (host: val)
Hello from the root processor 0 of 4 (host: val)
P 2: Hello world from process 2 of 4 (host: val)
P 3: Hello world from process 3 of 4 (host: val)
P 1: Hello world from process 1 of 4 (host: val)

NOTE: the dpprintf() function is somewhat inefficient even when debug output is turned off as it requires calls to getenv(). There are more efficient alternatives to this that involve macros but that also involves recompiling code. Since this is a learning exercise we can tolerate some performance hits in the name of easier debugging. In the wild you may want to consider alternative debug printing techniques.

5 MPI Jobs on MSI Machines

The Minnesota Supercomputing Institute has facilities to run parallel programs which we will use to evaluate our code. Jobs are handled in two ways

Interactive Jobs which can be run on login nodes. SSH to a machine, compile your code and run it. You may use a limited number of processors for this but it is extremely useful while developing and testing. Login nodes are shared among all users logged in so performance numbers are unreliable (e.g. affected by other activity by other users).
Batch Jobs which are submitted to a job queue to be run on via the sbatch job.sh command. When a job is selected to be run, it is assigned to run exclusively on some compute nodes so that it may achieve reliable and maximal performance. This is where timing results will be reported.

We will use the mesabi MSI system which can be reached via SSH connections:

>> ssh kauffman@mesabi.msi.umn.edu
password: ..........
Duo two-factor login for kauffman

Enter a passcode or select one of the following options:

 1. Duo Push to XXX-XXX-4971
 2. Phone call to XXX-XXX-4971

Passcode or option (1-2): 1
...

ln0006 [~]%

It is suggested that you set up SSH keys to ease login to Mesabi though you will need to continue Duo-authenticating during login.

5.1 Software Modules on MSI

Additionally, MSI uses a software modules system to specify use of different libraries and code packages. To access the most current version of MPI, you will need to run the following command in your login shell.

# load mpi software into path
ln0006 [~]% module load ompi/4.0.0/gnu-8.2.0-centos7

# alternative using script in code pack directory
ln0006 [~]% source a2-code/msi-setup.sh

# check version of MPI
ln0006 [~]% which mpicc
/panfs/roc/msisoft/openmpi/el6/4.0.0/gnu-8.2.0.CentOS7/bin/mpicc

ln0006 [~]% which mpirun
/panfs/roc/msisoft/openmpi/el6/4.0.0/gnu-8.2.0.CentOS7/bin/mpirun

Note that the msi-setup.sh script is provided as an alternative and can be sourced.

The list of all modules available is seen via module avail. Note that the version of MPI we will use is the most current but not the default.

5.2 Batch Jobs on MSI

Use the sbatch job.sh command to submit a job to the SLURM job queue. To ease timing evaluations, a job script with required parameters has been provided for all of the problems. For instance, for the Heat problem, use the script heat-slurm.sh. Checking on the status of jobs in the queue is done via squeue -u username. The provided scripts have the following properties:

Require you to change the directory of your code in it but otherwise do not need to be modified. Look for the line
```
  # ADJUST: location of executable
  cd ~/mesabi-a2-5451/
```
in the script and change the target directory to wherever you have placed your code. Ensure it is compiled ahead of time.
Will run MPI jobs for the required programs on the required number of processors
Output for the job will be left in a file like heat-slurm.sh.job-156425531.out which starts with the job script name
Lines in the script that are marked runtime: report wall clock time along with the problem parameters in it these appear like
```
runtime: procs 16 width 102400 realtime 5.15
```
and make it possible to use grep or other tools to quickly fish out the lines reporting runtimes for construction of the results.
Will charge compute time to the CSCI 5451 class account via the --account option to SLURM
The scripts have the correct MPI software module loaded at the beginning of them and will call mpirun on the program.

The session below demonstrates use of sbatch and squeue

ln0006 [mesabi-a2-5451]% sbatch heat-slurm.sh    # submit script to run for heat jobs
Submitted batch job 156504577

ln0006 [mesabi-a2-5451]% squeue -u kauffman      # check on the submitted jobs by user kauffman
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         156504535     small kmeans-s kauffman PD       0:00      6 (Priority)
         156504577     small heat-slu kauffman PD       0:00      6 (None)

# jobs are PD - Pending but not being run yet


# a useful command to "watch" the queue, press Ctl-c to kill the watch
ln0006 [mesabi-a2-5451]% watch 'squeue -u kauffman'

# later....
ln0006 [mesabi-a2-5451]% squeue -u kauffman      # check on the submitted jobs by user kauffman
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

ln0006 [mesabi-a2-5451]%                         # no jobs listed, all jobs done with results

ln0006 [mesabi-a2-5451]% ls *156504577*          # check for output job
heat-slurm.sh.job-156504577.out

6 Testing MPI Codes

6.1 On MSI Machines

Testing out codes on MSI machines is ideal as it is the same environment as what the timing evaluation will involve. The same commands can be used (load MPI module, compile, run with mpirun -np 4), with the minor limitation that there is a cap on the number of processors that can be tested interactively.

6.2 On CSE Labs Machines

CSE Labs machines have an MPI installation (sort of) set up on the machines cuda01-cselabs.umn.edu to cuda05-cselabs.umn.edu. By logging in, one will have access on those machines to mpicc and mpirun commands. However, the configuration is a bit unwieldly and requires some setup. Here are some gotchyas to keep in mind and move you towards the MSI machines.

You must use a Hostfile via mpirun -hostfile hf.txt and due to the configuration problems on these systems, the hosts will need to be listed via their IP address rather than their symbolic name.
You will need to have SSH keys setup up for password-free login. For details on how to do this see this tutorial.
All of the cuda-NN machines will need to be in your authorized host file. The easiest way to do this is to log in to cuda01 then, from that machine, SSH to cuda02, then cuda03, and so forth up to cuda05. This only needs to be done once but if you fail to do so, MPI runs beyond the local number of processors will hang silently.

Surmounting these factors is not difficult but creates enough friction that MSI is easier and preferred for testing.

6.3 On Home Machines

Setting up a home MPI installation is another option. In most cases a Linux or Unix system can install OpenMPI with a package manager and by using mpirun --oversubscribe one can test on a number of processors larger than available as physical cores on the system. Do not expect speedup in these cases but it is extremely useful for debugging.

7 Problem 1: MPI Heat

7.1 The Heat Problem

A slightly modified version of the heat propagation simulation from HW1 and in-class discussion is in the code pack and called heat_serial.c. This program can be compiled and run with the provided Makefile as follows.

>> make heat_serial             # build program
gcc -g -Wall -o heat_serial heat_serial.c

>> ./heat_serial                # run with no args to show help info
usage: ./heat_serial max_time width print
  max_time: int
  width: int
  print: 1 print output, 0 no printing

>> ./heat_serial 10 8 1         # run for 10 timesteps with 8 "elements"
   |     0     1     2     3     4     5     6     7 
---+-------------------------------------------------
  0|  20.0  50.0  50.0  50.0  50.0  50.0  50.0  10.0 
  1|  20.0  35.0  50.0  50.0  50.0  50.0  30.0  10.0 
  2|  20.0  35.0  42.5  50.0  50.0  40.0  30.0  10.0 
  3|  20.0  31.2  42.5  46.2  45.0  40.0  25.0  10.0 
  4|  20.0  31.2  38.8  43.8  43.1  35.0  25.0  10.0 
  5|  20.0  29.4  37.5  40.9  39.4  34.1  22.5  10.0 
  6|  20.0  28.8  35.2  38.4  37.5  30.9  22.0  10.0 
  7|  20.0  27.6  33.6  36.3  34.7  29.8  20.5  10.0 
  8|  20.0  26.8  32.0  34.1  33.0  27.6  19.9  10.0 
  9|  20.0  26.0  30.5  32.5  30.9  26.5  18.8  10.0 

>> ./heat_serial 10 8 0         # same run but don't print output, useful for timing as output takes a while

>> ./heat_serial 12 5 1         # run for 12 timesteps with 5 columns / elements
   |     0     1     2     3     4 
---+-------------------------------
  0|  20.0  50.0  50.0  50.0  10.0 
  1|  20.0  35.0  50.0  30.0  10.0 
  2|  20.0  35.0  32.5  30.0  10.0 
  3|  20.0  26.2  32.5  21.2  10.0 
  4|  20.0  26.2  23.8  21.2  10.0 
  5|  20.0  21.9  23.8  16.9  10.0 
  6|  20.0  21.9  19.4  16.9  10.0 
  7|  20.0  19.7  19.4  14.7  10.0 
  8|  20.0  19.7  17.2  14.7  10.0 
  9|  20.0  18.6  17.2  13.6  10.0 
 10|  20.0  18.6  16.1  13.6  10.0 
 11|  20.0  18.0  16.1  13.0  10.0

7.2 MPI Heat

In our Thu 02-Feb Lecture we discussed an MPI version of the heat transfer problem and it may be worthwhile to review this lecture to recall some details of data distribution and communication.

The central task of this problem is to create an MPI version of this program named heat_mpi which performs the same task but uses MPI calls to perform the heat calculations on distributed memory machines. Once completed, this program can be run as follows.

>> make heat_mpi                               # build MPI version of heat program
mpicc -g -Wall -o heat_mpi heat_mpi.c

>> source mpiopts.sh                           # set the MPIOPTS env variable, used to suppress warnings

>> mpirun $MPIOPTS -np 2 ./heat_mpi 10 8 1     # run using 2 procs, 10 steps, 8 elements = 4 per proc
   |     0     1     2     3     4     5     6     7 
---+-------------------------------------------------
  0|  20.0  50.0  50.0  50.0  50.0  50.0  50.0  10.0 
  1|  20.0  35.0  50.0  50.0  50.0  50.0  30.0  10.0 
  2|  20.0  35.0  42.5  50.0  50.0  40.0  30.0  10.0 
  3|  20.0  31.2  42.5  46.2  45.0  40.0  25.0  10.0 
  4|  20.0  31.2  38.8  43.8  43.1  35.0  25.0  10.0 
  5|  20.0  29.4  37.5  40.9  39.4  34.1  22.5  10.0 
  6|  20.0  28.8  35.2  38.4  37.5  30.9  22.0  10.0 
  7|  20.0  27.6  33.6  36.3  34.7  29.8  20.5  10.0 
  8|  20.0  26.8  32.0  34.1  33.0  27.6  19.9  10.0 
  9|  20.0  26.0  30.5  32.5  30.9  26.5  18.8  10.0 

>> mpirun $MPIOPTS -np 4 ./heat_mpi 6 12 1     # run using 4 procs, 6 steps, 12 elements = 3 per proc 
   |     0     1     2     3     4     5     6     7     8     9    10    11 
---+-------------------------------------------------------------------------
  0|  20.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  10.0 
  1|  20.0  35.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  30.0  10.0 
  2|  20.0  35.0  42.5  50.0  50.0  50.0  50.0  50.0  50.0  40.0  30.0  10.0 
  3|  20.0  31.2  42.5  46.2  50.0  50.0  50.0  50.0  45.0  40.0  25.0  10.0 
  4|  20.0  31.2  38.8  46.2  48.1  50.0  50.0  47.5  45.0  35.0  25.0  10.0 
  5|  20.0  29.4  38.8  43.4  48.1  49.1  48.8  47.5  41.2  35.0  22.5  10.0 

>> time mpirun $MPIOPTS -np 4 ./heat_mpi 6 12 0 # same as above but suppress output and time the run
real	0m0.168s    # wall clock time to report for the run
user	0m0.102s
sys	0m0.073s

7.3 Features of `heat_mpi`

Name your program heat_mpi.c to be compatible with the provided Makefile. It has a target to build both heat_serial and heat_mpi if you name the source file heat_mpi.c.
The serial version of the program provided accepts 3 command line arguments:
1. Number of time steps (rows in output)
2. Width of the rod in elements (columns of output)
3. 1 or 0 to indicate whether final output should be printed or suppressed.
The MPI version should allow for the same arguments so that runs like the following will work.
```
  >> mpirun $MPIOPTS -np 4 ./heat_mpit 10 40 1   # 4 procs, 10 timesteps, width 40, show output 
  ...                                           # output for the run

  >> mpirun $MPIOPTS -np 4 ./heat_mpi 10 40 0   # same but no output
  >> 
```

There is a small script called mpiopts.sh which can set options for MPI runs to suppress unnecessary warning messages. While experimenting with your programs in a shell, you can source this script via the following.

  >> source mpiopts.sh            # sets variable MPIOPTS
  
  >> echo $MPIOPTS                # show value of MPIOPTS
  --mca opal_warn_on_missing_libcuda 0
  
  >> . mpiopts.sh                 # same as using "source" to execute script in current shell

Divide the problem data so that each processor owns only a portion of the columns of the heat matrix as discussed in class.
Utilize send and receives or the combined MPI_Sendrecv to allow processors to communicate with neighbors.
Utilize a collective communication operation at the end of the computation to gather all results on Processor 0 and have it print out the entire results matrix if command line args indicate this is necessary.
Verify that the output of your MPI version is identical to the output of the serial version which is provided. There are a series of automated tests that help with this which are described later.
To be compatible with the automated tests, heat_mpi must produce an exit code of 0; e.g. return 0 at the end of main() as is done in heat_serial.c.

Your MPI version is only required to work correctly in the following situations:

The width of the rod in elements is evenly divisible by the number of processors being run.
The width of the rod is at least three times the number of processors so that each processor would have at least 3 columns associated with it.

That means the following configurations should work or fail as indicated.

#Procs	Width	Works?	Notes
1	1	no	not enough cols
1	2	no	not enough cols
1	3	yes	take special care for 1 proc
4	4	no	only 1 column per proc
4	8	no	only 2 columns per proc
4	12	yes	at least 3 cols per proc
4	16	yes	at least 3 cols per proc
4	15	no	uneven cols
3	9	yes	3 cols per proc, evenly divisible
4	40	yes	evenly divisible, >= 3 cols per proc

Runs that are marked with "no" in the "Works?" column will not be tested so are free to do anything (segfault, work correctly, print an error and exit immediately, etc.).

7.4 Written Summary of the `heat_mpi` Results

Included with the project code is the file A2-WRITEUP.txt which has a timing table to fill in and a few discussion questions which should be answered.

Time your runs on MSI's Mesabi Cluster. You can SSH to it via ssh myX500@mesabi.msi.umn.edu. Gathering data for the timing table is eased via the provided heat-slurm.sh job script which will run jobs with each of the parameters in the timing table listed. Timing results are printed with the string runtime: prepended to each line. See the Earlier examples of how to use sbatch, check the queue via squeue and find the output files for the completed jobs.

7.5 Automated Tests for `heat_mpi`

A battery of automated tests are provided to evaluate whether heat_mpi is producing correct results on some small examples. These are present in the file test_heat.org and are run via the testy script. This can be done manually or via make test-prob1. Compliant programs will give results that look like the following.

>> unset DEBUG                  # enusre that DEBUG output is disabled

>> make test-prob1              # build prob1 program and run tests
mpicc -g -Wall -o heat_mpi heat_mpi.c
./testy test_heat.org
============================================================
== testy test_heat.org
== Running 10 / 10 tests
1)  Procs=1 Width=20           : ok
2)  Procs=1 Width=20 Valgrind  : ok
3)  Procs=2 Width=20           : ok
4)  Procs=2 Width=20 Valgrind  : ok
5)  Procs=2 Width=20 No output : ok
6)  Procs=2 Width=6            : ok
7)  Procs=2 Width=6 Valgrind   : ok
8)  Procs=4 Width=20           : ok
9)  Procs=4 Width=20 Valgrind  : ok
10) Procs=4 Steps=30 Width=40  : ok
============================================================
RESULTS: 10 / 10 tests passed

Failed tests will provide a results file with information that can be studied to gain insight into detected problems with the programs.

Tests are limited to 4 processors max. Some tests run codes under Valgrind to detect memory problems and help diagnose segmentation faults.

7.6 Grading Criteria for Problem 1 grading 40

	CRITERIA
10	Code compiles via `make prob1` and passes automated tests via `make test-prob1`
10	Cleanly written code with good documentation according to a Manual Inspection
10	Written report includes timings table described above
10	Written report includes answers to discussion questions written above.

8 Problem 2: Serial K-Means Clustering

Assignment 1 introduced the K-Means clustering algorithm and provided a Python. If you did not finish your C port of this program, do so now as the MPI version should be based upon it.

Make sure to name your program kmeans_serial.c to be compatible with the Makefile provided in the codepack.

Some automated tests are provided for the serial program which check that the output produced by the program and the clusters it generates match expectations. The tests also evaluate whether there any memory problems with the program.

Grading Criteria for Problem 2 grading 10

	CRITERIA
5	Code compiles via `make prob2` and passes automated tests via `make test-prob2`
5	Manual Inspection of `kmeans_serial.c` to assess code quality, errors, etc.
5	Written report includes a description of how the serial code was parallelized

9 Problem 3: MPI K-Means Clustering

Implement parallel version of the K-means algorithm for distributed memory using MPI calls. Name your program kmeans_mpi.c to be compatible with the provided Makefile. We have discussed the basic strategy for parallelizing the code during lecture on Thu 16-Feb so review the discussion present there if you need guidance.

9.1 Implementation Requirements

kmeans_mpi will accept the same command line arguments as the Serial version. As all MPI programs, it will be launched via mpirun such as in
```
mpirun -np 4 kmeans_mpi mnist-data/digits_all_1e2.txt 10 outdir1
```
The behavior and output of the MPI version will match the serial version in format; e.g. printed messages and produced output files.
All I/O will be performed only by the root processor. Only the root Processor 0 will read the data file specified for the program. It will send required information to other processors that are needed by them to contribute to the communication. At the end of the computation, the root processor will be sent all cluster assignments and cluster centers to be written to files by the root processor.

This requirement is for learning purposes only: in a production setting, one would look to have each processor perform I/O operations in parallel if the system supported it. Such parallel I/O would be facilitated by a binary storage format rather than our text-based version. However, this assignment is designed to gain practice using MPI calls and distributing data then accumulating presents a good opportunity for this.
The algorithm implemented in the MPI version will follow an input partitioning approach as was discussed in lecture. Other possibilities exist but input partitioning is the most likely to be generally useful and lead to speedups.

Implementation Notes

As before, follow the general flow provided in the Python implementation. The MPI version should produce identical results and hopefully faster.
In our Thu Feb-16 lecture we had a class discussion on how to implement K-Means clustering for distributed memory machines. It may be worthwhile to review this discussion as it lays out some information of data distribution and communication required.

9.2 Grading Criteria for Problem 3 grading 50

Test cases will be provided for kmeans_mpi.c but will be released some time after the initial project is distributed. The Makefile contains a target for this, just requires the missing test_kmeans_mpi.org file which will have the tests.

	CRITERIA
10	Code compiles via `make prob3`, honors command line parameters on interactive runs.
10	Passes automated tests via `make test-prob3`
5	Cleanly written code with good documentation and demonstrates appropriate use of
	MPI function calls to implement the algorithm such as Broadcast, Scatter, Reduction, Gather, etc.
5	Code handles data size that is not evenly divided by the number of processors by using appropriate
	"vector" versions of MPI calls like `MPI_Scatterv()`
5	Written report includes timings table described above
10	Written report includes answers to discussion questions written above.

10 Project Submission on Gradescope

NOTE 1: Submission to Gradescope is not yet open; follow the instructions in this section when it opens.

NOTE 2: The instructions below pertain to another class and some of the pictures mention "project" and p1-code which in our case is "assignment" and a2-code. The instructions apply nonetheless and boil down to:

Create a zip of your assignment code via make zip
Upload the code to Gradescope
Check that the automated tests that run on Gradescope match you expectations.
Add you partner to your submission. Only one partner should submit the code.

10.1 Submit to Gradescope

Some of the pictures below mention 'Assignment' which is now 'Project' and may mention some files that are not part of the current project. The process of uploading submission is otherwise the same.

In a terminal, change to your project code directory and type make zip which will create a zip file of your code. A session should look like this:

   > cd Desktop/5451/a2-code      # location of assignment code

   > ls 
   Makefile    dense_pagerank_mpi.c    heat_serial.c
   ...

   > make zip                     # create a zip file using Makefile target
   rm -f a2-code.zip
   cd .. && zip "a2-code/p1-code.zip" -r "a2-code"
     adding: a2-code/ (stored 0%)
     adding: a2-code/Makefile (deflated 68%)
     adding: a2-code/dense_pagerank_mpi.c (deflated 69%)
     adding: a2-code/test_dense_pagerank_mpi.org (deflated 71%)
     ...
   Zip created in a2-code.zip

   > ls a2-code.zip
   a2-code.zip

Log into Gradescope and locate and click 'Assignment 2' which will open up submission
Click on the 'Drag and Drop' text which will open a file selection dialog; locate and choose your a2-code.zip file
This will show the contents of the Zip file and should include your C source files along with testing files and directories.
Click 'Upload' which will show progress uploading files. It may take a few seconds before this dialog closes to indicate that the upload is successful. Note: there is a limit of 256 files per upload; normal submissions are not likely to have problems with this but you may want to make sure that nothing has gone wrong such as infinite loops creating many files or incredibly large files.

WARNING: There is a limit of 256 files per zip. Doing make zip will warn if this limit is exceeded but uploading to Gradescope will fail without any helpful messages if you upload more the 256 files in a zip.
Once files have successfully uploaded, the Autograder will begin running the command line tests and recording results. These are the same tests that are run via make test.
When the tests have completed, results will be displayed summarizing scores along with output for each batch of tests.
Don't forget to add you partner to your submission after uploading. Only one partner needs to submit the code.

10.2 Late Policies

You may wish to review the policy on late project submission which will cost 1 Engagement Point per day late. No projects will be accepted more than 48 hours after the deadline.

https://www-users.cs.umn.edu/~kauffman/5451/syllabus.html