Last Updated: 2023-04-10 Mon 11:33

CSCI 2021 HW12: Micro-Optimization Techniques

CODE DISTRIBUTION: hw12-code.zip

CHANGELOG: Empty

1 Rationale

Program optimization is an important skill and basic familiarity with hand-coded optimization techniques is handy in some contexts. More importantly, good programs recognize which optimization techniques impact their programs most and usually focus their attention on optimizing in the following order:

  1. Algorithms and Data Structure Selection
  2. Elimination of unneeded work/hidden costs
  3. Memory Utilization
  4. Micro-optimizations

This HW explores several micro-optimization techniques and compares them to optimizing memory utilization.

Associated Reading / Preparation

Review the following sections from Bryant and O'Hallaron:

  • Ch 4 at a high level; processor architecture is covered but we are primary concerned with features that pertain to efficient code
  • Ch 5 on code optimizations
  • Ch 6 on the memory hierarchy especially writing cache friendly code

Grading Policy

Credit for this HW is earned by taking the associated HW Quiz which is linked under Gradescope. The quiz will ask similar questions as those that are present in the QUESTIONS.txt file and those that complete all answers in QUESTIONS.txt should have no trouble with the quiz.

Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.

See the full policies in the course syllabus.

2 Codepack

The codepack for the HW contains the following files:

File Description
QUESTIONS.txt Questions to answer
Makefile Makefile to build the colmins_main program
matvec.h Header file defining some types and functions for
matvec_util.c Utility functions to manipulate matrices/vectors
colmins_funcs.c Various versions of column min-finding
colmins_main.c Main function that times column min-finding techinques
reversal_benchmark.c Problem 2 C file for analysis and editing
warsim/ Optional Problem 4 directory with application to profile

3 What to Understand

  • Ensure you have a good understanding of how C programs can exploit features of the processor and memory hierarchy to improve program performance.
  • Knowledge of basic optimizing code transformations and cache effects are a must.
  • Know how to use basic timing functions like clock() to time specific portions of code and report run duration.

4 Questions

Analyze the files in the provided codepack and answer the questions given in QUESTIONS.txt.

                           _________________

                            HW 12 QUESTIONS
                           _________________


Write your answers to the questions below directly in this text file to
prepare for the associated quiz. Credit for the HW is earned by
completing the associated online quiz on Gradescope.


Note on Experimentation: Run on csel-kh1250-NN
==============================================

  As has been the case in the past, timing execution of code is always
  influenced by the specific machine the code is run on. While you are
  free to run the benchmark codes anywhere on HWs, TAs will be familiar
  with the answers for runs on csel-kh1250-NN.cselabs.umn.edu.  For the
  most consistent results, run the codes there.


PROBLEM 1: colmins_main.c
=========================

(A) Timing
~~~~~~~~~~

  Compile and run the provided `colmins_main' program as indicated
  below.

  ,----
  | > make
  | gcc -Wall -g -Og -c colmins_main.c		
  | gcc -Wall -g -Og -c colmins_funcs.c		
  | gcc -Wall -g -Og -c matvec_util.c
  | gcc -Wall -g -Og -o colmins_main colmins_main.o colmins_funcs.o matvec_util.o
  | 
  | > ./colmins_main 8000 16000
  `----

  Notice that the size of the matrix being used is quite large: 8000
  rows by 16000 columns. You may time other sizes but 8000x16000 is
  usually large enough to get beyond obvious cache effects on most
  machines.

  Run the program several times to ensure that you get a good sense of
  what it's normal behavior is like: there should be timing differences
  between the different functions reported on.

  Paste the timing results produced below for `./colmins_main 8000
  16000'


(B) Tricks
~~~~~~~~~~

  Examine the source code for `colmins_main.c'.  Identify the technique
  that is used to avoid a large amount of repeated code to time the
  multiple functions.


PROBLEM 2: Comparisons in colmins_funcs.c
=========================================

(A) col_mins1 baseline
~~~~~~~~~~~~~~~~~~~~~~

  Examine the `col_mins1' function in `colmins_funcs.c' and describe the
  basic approach it uses to solve the problem of finding the minimum of
  each column of a matrix.
  - What pattern of access is used? Is this advantageous for the layout
    of the matrix?
  - What local variables are used versus main memory gets/sets?


(B) col_mins2 Comparison
~~~~~~~~~~~~~~~~~~~~~~~~

  Examine the differences between `col_mins1' and `col_mins2'.
  Particularly comment on
  - Any differences in memory access pattern
  - Any differences in use of local variables/main memory
  - Any differences in speed


(C) col_mins3 Comparison
~~~~~~~~~~~~~~~~~~~~~~~~

  `col_mins3' implements an optimization called loop unrolling.  In this
  technique, rather than iterate by single increments, larger steps are
  taken. Since each iteration uses multiple local variables to store
  partial results that must eventually be combined. All this is meant to
  allow the processor to execute more instructions in sequence before
  looping back which may enable more efficient pipelined and superscalar
  operations.

  Examine the differences between `col_mins2' and `col_mins3'.
  Particularly comment on
  - Any differences in memory access pattern
  - Any differences in use of local variables/main memory
  - Any differences in speed that might be due to the new optimizations


(D) col_mins4 Comparison
~~~~~~~~~~~~~~~~~~~~~~~~

  `col_mins4' also loop unrolling but in a different way than
  `col_mins3'.

  Examine the differences between `col_mins3' and `col_mins4'.
  Particularly comment on
  - What loops are "unrolled" in each and how does this affect the
    remaining code?
  - Any differences in memory access pattern
  - Any differences in use of local variables/main memory
  - Any differences in speed that might be due to the new optimizations


(E) col_mins5 Comparison: The Real Lesson
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  `col_mins5' is inherently different than all of the other routines.
  Examine its structure carefully and ensure that you understand it as
  it may prove useful in an assignment.  Particularly comment on
  - Any differences in memory access pattern from the others
  - Any differences in use of local variables/main memory
  - Any differences in speed that might be due to the new optimizations

Author: Chris Kauffman (kauffman@umn.edu)
Date: 2023-04-10 Mon 11:33