CSCI 2021 Lab12: Functions and Macros and Optimization
- Due: 11:59pm Wed 18-Apr-2023 on Gradescope
- Approximately 1.00% of total grade
CODE DISTRIBUTION: lab12-code.zip
CHANGELOG: Empty
Table of Contents
1 Rationale
Optimization of programs must often happen in a sequence that "unlocks" performance. That is certain optimizations will have no or even detrimental performance effects on their own but only provide benefit once other optimizations have been performed. This lab guides students to write one such series of optimizations that is common and useful for the current project. It involves the following sequence of changes:
- Replace repeated memory references with local variables
- Replace repeated function calls with inlined Macro versions
- Unroll an inner loop to exploit the processor pipeline
The 2nd optimization allows some exploration of C's Macro system,
programmatic "copy-paste" facility that is used for a variety of
purposes such as inserting the contents of another source file (using
header files via #include
) or define a sort of global variable as in
#define MAX 123
. Here it is used to ensure that short fragments of
complex code are presented in a readable fashion that attains high
performance. While it may look like a function call, the underpinnings
are different and akin to the Function Inlining optimization performed
by compilers.
Grading Policy
Credit for this Lab is earned by completing the exercises here and
submitting a Zip of the work to Gradescope. Students are responsible
to check that the results produced locally via make test
are
reflected on Gradescope after submitting their completed
Zip. Successful completion earns 1 Engagement Point.
Lab Exercises are open resource/open collaboration and students are encouraged to cooperate on labs. Students may submit work as groups of up to 5 to Gradescope: one person submits then adds the names of their group members to the submission.
See the full policies in the course syllabus.
2 Codepack
The codepack for the HW contains the following files:
File | Description | |
---|---|---|
QUESTIONS.txt |
EDIT | Questions to answer: fill in the multiple choice selections in this file. |
func_v_macro.c |
EDIT | C code to complete and analyze by filling in TODO items |
more_macros.c |
Optional | Optional file to analyze to see additional uses for preprocessor macros. |
Makefile |
Build | Enables make test and make zip |
QUESTIONS.txt.bk |
Backup | Backup copy of the original file to help revert if needed |
QUESTIONS.md5 |
Testing | Checksum for answers in questions file |
test_quiz_filter |
Testing | Filter to extract answers from Questions file, used in testing |
test_lab12.org |
Testing | Tests for this lab |
testy |
Testing | Test running scripts |
3 QUESTIONS.txt File Contents
Below are the contents of the QUESTIONS.txt
file for the lab.
Follow the instructions in it to complete the QUIZ and CODE questions
for the lab.
__________________ LAB 12 QUESTIONS __________________ Lab Instructions ================ Follow the instructions below to experiment with topics related to this lab. - For sections marked QUIZ, fill in an (X) for the appropriate response in this file. Use the command `make test-quiz' to see if all of your answers are correct. - For sections marked CODE, complete the code indicated. Use the command `make test-code' to check if your code is complete. - DO NOT CHANGE any parts of this file except the QUIZ sections as it may interfere with the tests otherwise. - If your `QUESTIONS.txt' file seems corrupted, restore it by copying over the `QUESTIONS.txt.bk' backup file. - When you complete the exercises, check your answers with `make test' and if all is well, create a zip file with `make zip' and upload it to Gradescope. Ensure that the Autograder there reflects your local results. - IF YOU WORK IN A GROUP only one member needs to submit and then add the names of their group. NOTE: Time on loginNN.cselabs.umn.edu ===================================== Timing comparisons below reflect the behavior of the benchmark on the machines ,---- | login01.cselabs.umn.edu OR csel-remote-lnx-01.cselabs.umn.edu | ... | login06.cselabs.umn.edu OR csel-remote-lnx-06.cselabs.umn.edu `---- Run the benchmark there so that your timing allows you to answer quiz questions correctly. CODE: `function_v_macro.c' Program ================================== There is nearly complete code provided in the `function_v_macro.c' file which implements 4 variants of a `row_sums_XXX()' function. Complete the TODO items for each of the 3 functions so that it compiles and reports run times for the 3 variants. Correct execution will produce output that looks like the following: ,---- | > ./func_v_macro 100 100000 | 1.2345e+00 secs : V1 row_sums_func_p | 1.2345e+00 secs : V2 row_sums_func_s | 1.2345e+00 secs : V3 row_sums_macro | 1.2345e+00 secs : V4 row_sums_unroll4 `---- NOTE: the times above are not accurate but reflect the format of the output. You will analyze several aspects of the timing and reasons for the different variants of the `row_sums_xxx' functions. QUIZ: Analyzing `function_v_macro.c' Runs ========================================= After completing the code in `func_v_macro.c', compile it via `make' and then examine the timing results for the 4 variants by running on the following parameters. ,---- | # RUN ON csel-kh1250-NN machines like in project 4 | > ./func_v_macro 100 100000 | ... `---- ORDERING ~~~~~~~~ Which of the following indicates the relative speed ordering of the 3 variants (slowest to fastest). ,---- | SLOWEST .................................... ..............................FASTEST | - ( ) V4 row_sums_unroll4 / V3 row_sums_macro / V2 row_sums_func_s / V1 row_sums_func_p | - ( ) V4 row_sums_unroll4 / V1 row_sums_func_p / V3 row_sums_macro / V2 row_sums_func_s | - ( ) V1 row_sums_func_p / V3 row_sums_macro / V2 row_sums_func_s / V4 row_sums_unroll4 | - ( ) V1 row_sums_func_p / V2 row_sums_func_s / V3 row_sums_macro / V4 row_sums_unroll4 `---- V1 to V2 ~~~~~~~~ Examine the V1 and V2 versions of the `row_sums_XXX()' functions. Which of the following best describes the difference between these and its affect on performance. - ( ) V1 uses pointers to structs while V2 deference's to have local copies of the struct; V2 runs slightly SLOWER due to requiring more overall memory to store a 2nd copy of the struct - ( ) V1 uses pointers to structs while V2 deference's to have local copies of the struct; V2 runs slightly FASTER due to some data being cached in registers rather than main memory - ( ) V1 uses a Function call while V2 uses a Macro call. Since macro calls inline code, V2 runs modestly FASTER. - ( ) Trick question: these two versions are identical as they both use structs and there is no difference in behavior between pointers to structs and local / actual structs. They run at the SAME speed. Preprocessor Macro Expansion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Examine the results of running ONLY the compiler's preprocessor on the program which can be done via the command: ,---- | > gcc -E func_v_macro.c > preprocessed.c `---- After running this command the file `preprocessed.c' now contains the all text transformations made by the preprocessor to the original source. The file will be quite long, ~2500 lines of code. What appears for the first few thousand lines of code in the preprocessed file? - ( ) Lots of C type declarations and function prototypes for standard functions like `atoi()' and `malloc()'; these are the results of #include'ing header files. Some of the original C code appears after the declarations. - ( ) A long sequence of assembly instructions. These instructions are what allow the C code to be loaded and run. The original C code appears after the initial assembly. - ( ) The translation of the original C code into assembly but before optimization phases in the compiler. Examine the code near the end of the `preprocessed.c' file. Which of the following best describes how the V2 `row_sums_func_s()' code has changed? - ( ) It has not changed much; only comments have been removed. - ( ) The body of the `mget() / vset()' functions have been inserted at the point they were called. - ( ) An optimized assembly code version of these functions appears. Which of the following best describes how the `row_sums_macro()' code has changed? - ( ) It has not changed much; only comments have been removed. - ( ) The body of the `MGET() / VSET()' functions have been inserted at the point they were called. - ( ) An optimized assembly code version of these functions appears. Function vs Macros Calls and Performance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Which of the following are valid reasons that calling a function in a tight computational loop might interfere with the compiler's ability to produce fast code? - ( ) Calling a function requires specific registers to be used to pass arguments - ( ) Calling a function means that all callee save registers might change thus reducing the number of registers available for use in the calling function. - ( ) Calling a function jumps control to a different part of code which may put more pressure on the instruction and data cache. - ( ) Actually all of these are reasons that functions calls mess up optimization and this relation explains why the Macro versions perform better as they force inlining of the function body enabling further optimizations by the compiler. Loop Unrolling ~~~~~~~~~~~~~~ Which of the following best describes the differences between the code in the V3 `row_sums_macro()' and V4 `V4 row_sums_unroll4()' functions? - ( ) V3 iterates through each matrix row by 1 element at time while V4 iterates 4 elements at a time - ( ) V3 adds on single row element to a single sum per iteration while V4 adds 4 different elements to 4 different sums - ( ) Because of the looping pattern in V4, it requires a second loop to "finish" elements at the ends of rows when the length is not evenly divisible by 4 - ( ) All of these items are true - ( ) None of these apply but there are other differences Which of the following best explains the speed differences between V3 and V4? - ( ) V4 is FASTER than V3 because its looping pattern favors cache more effectively thereby improving throughput: the processor has more available data to work on in V4 than in V3 - ( ) V4 is FASTER than V3 because each loop iteration has more independent arithmetic operations that can be executed; this favors efficient execution in pipelined / superscalar processors - ( ) V4 is SLOWER than V3 because it must add a second loop which creates more operations leading to worse performance - ( ) V4 is SLOWER than V3 because the additional complexities and conditionals in its code create hazards in the processor pipeline while P3 has more straight-forward code for the architecture OPTIONAL: more_macros.c ~~~~~~~~~~~~~~~~~~~~~~~ You can observe some other uses for `#define' macros in the file `more_macros.c'. Again, one can preprocess the C file and observer the results using a compiler invocation like ,---- | >> gcc -E more_macros.c > preprocessed.c `---- Examining `preprocessed.c' will show where various capitalized macros have been substituted for their definitions. This includes the useful `__FILE__' and `__LINE__' macros that are provided in the C standard to help print useful debugging information during program runs.
4 Submission
Follow the instructions at the end of Lab01 if you need a refresher on how to upload your completed lab zip to Gradescope.