CSCI 2021 HW12: Micro-Optimization Techniques
- Due: 11:59pm 18-Apr-2023
- Approximately 0.83% of total grade
- Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
CODE DISTRIBUTION: hw12-code.zip
CHANGELOG: Empty
1 Rationale
Program optimization is an important skill and basic familiarity with hand-coded optimization techniques is handy in some contexts. More importantly, good programs recognize which optimization techniques impact their programs most and usually focus their attention on optimizing in the following order:
- Algorithms and Data Structure Selection
- Elimination of unneeded work/hidden costs
- Memory Utilization
- Micro-optimizations
This HW explores several micro-optimization techniques and compares them to optimizing memory utilization.
Associated Reading / Preparation
Review the following sections from Bryant and O'Hallaron:
- Ch 4 at a high level; processor architecture is covered but we are primary concerned with features that pertain to efficient code
- Ch 5 on code optimizations
- Ch 6 on the memory hierarchy especially writing cache friendly code
Grading Policy
Credit for this HW is earned by taking the associated HW Quiz which is
linked under Gradescope. The quiz will ask similar questions as
those that are present in the QUESTIONS.txt file and those that
complete all answers in QUESTIONS.txt should have no trouble with
the quiz.
Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
See the full policies in the course syllabus.
2 Codepack
The codepack for the HW contains the following files:
| File | Description |
|---|---|
QUESTIONS.txt |
Questions to answer |
Makefile |
Makefile to build the colmins_main program |
matvec.h |
Header file defining some types and functions for |
matvec_util.c |
Utility functions to manipulate matrices/vectors |
colmins_funcs.c |
Various versions of column min-finding |
colmins_main.c |
Main function that times column min-finding techinques |
reversal_benchmark.c |
Problem 2 C file for analysis and editing |
warsim/ |
Optional Problem 4 directory with application to profile |
3 What to Understand
- Ensure you have a good understanding of how C programs can exploit features of the processor and memory hierarchy to improve program performance.
- Knowledge of basic optimizing code transformations and cache effects are a must.
- Know how to use basic timing functions like
clock()to time specific portions of code and report run duration.
4 Questions
Analyze the files in the provided codepack and answer the questions
given in QUESTIONS.txt.
_________________
HW 12 QUESTIONS
_________________
Write your answers to the questions below directly in this text file to
prepare for the associated quiz. Credit for the HW is earned by
completing the associated online quiz on Gradescope.
Note on Experimentation: Run on csel-kh1250-NN
==============================================
As has been the case in the past, timing execution of code is always
influenced by the specific machine the code is run on. While you are
free to run the benchmark codes anywhere on HWs, TAs will be familiar
with the answers for runs on csel-kh1250-NN.cselabs.umn.edu. For the
most consistent results, run the codes there.
PROBLEM 1: colmins_main.c
=========================
(A) Timing
~~~~~~~~~~
Compile and run the provided `colmins_main' program as indicated
below.
,----
| > make
| gcc -Wall -g -Og -c colmins_main.c
| gcc -Wall -g -Og -c colmins_funcs.c
| gcc -Wall -g -Og -c matvec_util.c
| gcc -Wall -g -Og -o colmins_main colmins_main.o colmins_funcs.o matvec_util.o
|
| > ./colmins_main 8000 16000
`----
Notice that the size of the matrix being used is quite large: 8000
rows by 16000 columns. You may time other sizes but 8000x16000 is
usually large enough to get beyond obvious cache effects on most
machines.
Run the program several times to ensure that you get a good sense of
what it's normal behavior is like: there should be timing differences
between the different functions reported on.
Paste the timing results produced below for `./colmins_main 8000
16000'
(B) Tricks
~~~~~~~~~~
Examine the source code for `colmins_main.c'. Identify the technique
that is used to avoid a large amount of repeated code to time the
multiple functions.
PROBLEM 2: Comparisons in colmins_funcs.c
=========================================
(A) col_mins1 baseline
~~~~~~~~~~~~~~~~~~~~~~
Examine the `col_mins1' function in `colmins_funcs.c' and describe the
basic approach it uses to solve the problem of finding the minimum of
each column of a matrix.
- What pattern of access is used? Is this advantageous for the layout
of the matrix?
- What local variables are used versus main memory gets/sets?
(B) col_mins2 Comparison
~~~~~~~~~~~~~~~~~~~~~~~~
Examine the differences between `col_mins1' and `col_mins2'.
Particularly comment on
- Any differences in memory access pattern
- Any differences in use of local variables/main memory
- Any differences in speed
(C) col_mins3 Comparison
~~~~~~~~~~~~~~~~~~~~~~~~
`col_mins3' implements an optimization called loop unrolling. In this
technique, rather than iterate by single increments, larger steps are
taken. Since each iteration uses multiple local variables to store
partial results that must eventually be combined. All this is meant to
allow the processor to execute more instructions in sequence before
looping back which may enable more efficient pipelined and superscalar
operations.
Examine the differences between `col_mins2' and `col_mins3'.
Particularly comment on
- Any differences in memory access pattern
- Any differences in use of local variables/main memory
- Any differences in speed that might be due to the new optimizations
(D) col_mins4 Comparison
~~~~~~~~~~~~~~~~~~~~~~~~
`col_mins4' also loop unrolling but in a different way than
`col_mins3'.
Examine the differences between `col_mins3' and `col_mins4'.
Particularly comment on
- What loops are "unrolled" in each and how does this affect the
remaining code?
- Any differences in memory access pattern
- Any differences in use of local variables/main memory
- Any differences in speed that might be due to the new optimizations
(E) col_mins5 Comparison: The Real Lesson
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`col_mins5' is inherently different than all of the other routines.
Examine its structure carefully and ensure that you understand it as
it may prove useful in an assignment. Particularly comment on
- Any differences in memory access pattern from the others
- Any differences in use of local variables/main memory
- Any differences in speed that might be due to the new optimizations