CSCI 2021 HW12: Micro-Optimization Techniques
- Due: 11:59pm 18-Apr-2023
- Approximately 0.83% of total grade
- Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
CODE DISTRIBUTION: hw12-code.zip
CHANGELOG: Empty
1 Rationale
Program optimization is an important skill and basic familiarity with hand-coded optimization techniques is handy in some contexts. More importantly, good programs recognize which optimization techniques impact their programs most and usually focus their attention on optimizing in the following order:
- Algorithms and Data Structure Selection
- Elimination of unneeded work/hidden costs
- Memory Utilization
- Micro-optimizations
This HW explores several micro-optimization techniques and compares them to optimizing memory utilization.
Associated Reading / Preparation
Review the following sections from Bryant and O'Hallaron:
- Ch 4 at a high level; processor architecture is covered but we are primary concerned with features that pertain to efficient code
- Ch 5 on code optimizations
- Ch 6 on the memory hierarchy especially writing cache friendly code
Grading Policy
Credit for this HW is earned by taking the associated HW Quiz which is
linked under Gradescope
. The quiz will ask similar questions as
those that are present in the QUESTIONS.txt
file and those that
complete all answers in QUESTIONS.txt
should have no trouble with
the quiz.
Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
See the full policies in the course syllabus.
2 Codepack
The codepack for the HW contains the following files:
File | Description |
---|---|
QUESTIONS.txt |
Questions to answer |
Makefile |
Makefile to build the colmins_main program |
matvec.h |
Header file defining some types and functions for |
matvec_util.c |
Utility functions to manipulate matrices/vectors |
colmins_funcs.c |
Various versions of column min-finding |
colmins_main.c |
Main function that times column min-finding techinques |
reversal_benchmark.c |
Problem 2 C file for analysis and editing |
warsim/ |
Optional Problem 4 directory with application to profile |
3 What to Understand
- Ensure you have a good understanding of how C programs can exploit features of the processor and memory hierarchy to improve program performance.
- Knowledge of basic optimizing code transformations and cache effects are a must.
- Know how to use basic timing functions like
clock()
to time specific portions of code and report run duration.
4 Questions
Analyze the files in the provided codepack and answer the questions
given in QUESTIONS.txt
.
_________________ HW 12 QUESTIONS _________________ Write your answers to the questions below directly in this text file to prepare for the associated quiz. Credit for the HW is earned by completing the associated online quiz on Gradescope. Note on Experimentation: Run on csel-kh1250-NN ============================================== As has been the case in the past, timing execution of code is always influenced by the specific machine the code is run on. While you are free to run the benchmark codes anywhere on HWs, TAs will be familiar with the answers for runs on csel-kh1250-NN.cselabs.umn.edu. For the most consistent results, run the codes there. PROBLEM 1: colmins_main.c ========================= (A) Timing ~~~~~~~~~~ Compile and run the provided `colmins_main' program as indicated below. ,---- | > make | gcc -Wall -g -Og -c colmins_main.c | gcc -Wall -g -Og -c colmins_funcs.c | gcc -Wall -g -Og -c matvec_util.c | gcc -Wall -g -Og -o colmins_main colmins_main.o colmins_funcs.o matvec_util.o | | > ./colmins_main 8000 16000 `---- Notice that the size of the matrix being used is quite large: 8000 rows by 16000 columns. You may time other sizes but 8000x16000 is usually large enough to get beyond obvious cache effects on most machines. Run the program several times to ensure that you get a good sense of what it's normal behavior is like: there should be timing differences between the different functions reported on. Paste the timing results produced below for `./colmins_main 8000 16000' (B) Tricks ~~~~~~~~~~ Examine the source code for `colmins_main.c'. Identify the technique that is used to avoid a large amount of repeated code to time the multiple functions. PROBLEM 2: Comparisons in colmins_funcs.c ========================================= (A) col_mins1 baseline ~~~~~~~~~~~~~~~~~~~~~~ Examine the `col_mins1' function in `colmins_funcs.c' and describe the basic approach it uses to solve the problem of finding the minimum of each column of a matrix. - What pattern of access is used? Is this advantageous for the layout of the matrix? - What local variables are used versus main memory gets/sets? (B) col_mins2 Comparison ~~~~~~~~~~~~~~~~~~~~~~~~ Examine the differences between `col_mins1' and `col_mins2'. Particularly comment on - Any differences in memory access pattern - Any differences in use of local variables/main memory - Any differences in speed (C) col_mins3 Comparison ~~~~~~~~~~~~~~~~~~~~~~~~ `col_mins3' implements an optimization called loop unrolling. In this technique, rather than iterate by single increments, larger steps are taken. Since each iteration uses multiple local variables to store partial results that must eventually be combined. All this is meant to allow the processor to execute more instructions in sequence before looping back which may enable more efficient pipelined and superscalar operations. Examine the differences between `col_mins2' and `col_mins3'. Particularly comment on - Any differences in memory access pattern - Any differences in use of local variables/main memory - Any differences in speed that might be due to the new optimizations (D) col_mins4 Comparison ~~~~~~~~~~~~~~~~~~~~~~~~ `col_mins4' also loop unrolling but in a different way than `col_mins3'. Examine the differences between `col_mins3' and `col_mins4'. Particularly comment on - What loops are "unrolled" in each and how does this affect the remaining code? - Any differences in memory access pattern - Any differences in use of local variables/main memory - Any differences in speed that might be due to the new optimizations (E) col_mins5 Comparison: The Real Lesson ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `col_mins5' is inherently different than all of the other routines. Examine its structure carefully and ensure that you understand it as it may prove useful in an assignment. Particularly comment on - Any differences in memory access pattern from the others - Any differences in use of local variables/main memory - Any differences in speed that might be due to the new optimizations