Last Updated: 2021-03-10 Wed 11:34

CSCI 4061 HW07: Recursive Directory Traversal / pmap

CODE DISTRIBUTION: hw07-code.zip

CHANGELOG:

Wed Mar 10 11:33:39 AM CST 2021
Some minor errors such as a missing Makefile for memory_parts and a mis-numbered problem have been corrected.

1 Rationale

A variety of applications require code to traverse an entire directory structure visiting all files present in to perform some operation. At the shell level, utilities like find are good at doing this. For programs requiring lower-level interactions, such as version control software (e.g. git, svn, cvs), the C and system call interfaces to traversing directories are usually used. As a tree-like structure, code that traverses directories will almost always have a recursive character to it whether explicit or implicit.

This HW explores code to visit all files an a directory hierarchy using standard system calls. The nftw() library call which encodes this functionality is also examined. Completing the HW will inform on how to accomplish this task without breaking the brain.

In addition, the pmap utility is useful to show portions of the page table associated with a running program and is explored in one problem.

Associated Reading / Preparation

Stevens/Rago Ch 4.20-4.23 discuss functions like opendir() / readdir() in the context of traversing a directory hierarchy. They allude to the nftw() function and demonstrate their own implementation of it.

Consult the manual page on pmap for more information on it.

Grading Policy

Credit for this HW is earned by taking the associate Quiz which is linked under Gradescope. The quiz will ask similar questions as those that are present in the QUESTIONS.txt file and those that complete all answers in QUESTIONS.txt should have no trouble with the quiz.

See the full policy in the syllabus.

2 Codepack

The codepack contains the following files:

File Description
QUESTIONS.txt Questions to answer
nftw/ Directory for Problems 1-3
Makefile Run make to set up the required directory structure and build programs for this HW
recursive_listing.c Problem 1 code to analyze
print_file_info.c Problem 2 code to analyze
nftw_listing.c Problem 3 code to analyze
qsort_demo.c Optional file demonstrating qsort() function
pmap-memory-parts/ Directory for Problem 4
memory_parts.c Problem 4 code to analyze

3 What to Understand

Ensure that you understand

  • How to perform a recursive traversal of data in a hierarchy using system calls
  • Why it is usually not necessary to write your own recursive traversal code as library functions often provide this functionality if you know how to access it.
  • How to determine types and other data about files with the stat() / lstat() calls along with associated functions/macros like
  • The type of information that pmap displays and what OS data structures it corresponds to

4 Questions

Analyze the files in the provided codepack and answer the questions given in QUESTIONS.txt.

                           _________________

                            HW 07 QUESTIONS
                           _________________


- Name: (FILL THIS in)
- NetID: (THE kauf0095 IN kauf0095@umn.edu)

Write your answers to the questions below directly in this text file.
HW quiz questions will be related to the questions in this file.


Overview
========

  Typing `make' will now create the subdirectory `subdir-small' which is
  used in recursive traversals.

  Compile and run the programs in the provide directory. A Makefile is
  present to ease this task. After compiling, experiment with the
  compiled programs as shown below and note the results they print.

  ,----
  | > make
  | chmod u+x make-subdir-small.sh
  | ./make-subdir-small.sh
  | gcc -Wall -g -c nftw_listing.c
  | gcc -Wall -g -c print_file_info.c
  | gcc -Wall -g -o nftw_listing nftw_listing.o print_file_info.o
  | gcc -Wall -g -c recursive_listing.c
  | gcc -Wall -g -o recursive_listing recursive_listing.o print_file_info.o
  | 
  | > ./recursive_listing 
  | T MTIME                    BYTES    FILENAME
  | = ======================== ======== ================
  | D Fri Mar 27 14:32:34 2020     4096 .
  | F Thu Mar 26 14:29:47 2020    21912 ./nftw_demo
  | ...
  | 
  | > ./recursive_listing recursive_listing.c
  | ...
  | 
  | > ./recursive_listing subdir-small
  | ...
  | 
  | > ./nftw_listing 
  | ...
  | > ./nftw_listing recursive_listing.c
  | ...
  | > ./nftw_listing subdir-small
  | ...
  `----

  The bulk of this problem is to examine the provided code to determine
  how library and system calls are used to carry out their task: listing
  the contents of an entire directory hierarchy.


PROBLEM 1: Recursive Directory Traversal
========================================

  Examine the code in `recursive_listing.c'. A number of system calls
  are used within it which are worth studying.


(A) lstat()
~~~~~~~~~~~

  Near the top, a call to `lstat()' is made. As the name implies, this
  call is related to the `stat()' system call.  Describe what these two
  do and how they differ. You man need to consult the manual page for
  them which is obtained via `man 2 stat' (section 2 for system calls).


(B) opendir() / readdir()
~~~~~~~~~~~~~~~~~~~~~~~~~

  Later in the code, the system calls `opendir() / readdir()' are
  employed.  Describe what these system calls do and how they are
  employed in the code. Also mention a set of files that should be
  "ignored" and how this is accomplished.


(C) Base and Recursive Cases
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  As the name of the file and constituent functions implies,
  `recursive_listing.c' takes a recursive approach. Some actions are
  performed for all situations in `recursive_print()'. State these. Then
  briefly describe what the Base and Recursive cases are for this code
  are and how they work together to list all files in a directory tree.


PROBLEM 2: Printing File Info
=============================

  Examine the `print_file_info.c' file which contains the function used
  in `recursive_listing.c' to print information on files.


(A) File Kinds
~~~~~~~~~~~~~~

  Describe the technique that is used in `print_file_info()' to
  determine the kind of file (regular, directory, symlink, etc.).
  - Where is this information stored?
  - How can one set a variable to a character like 'D' if the file is a
    directory?

  Note that there are two control mechanisms shown to demonstrate
  determining file kinds, one is used while the other is commented. Both
  work similarly.


(B) Time Printing
~~~~~~~~~~~~~~~~~

  From our previous discussion of the `stat()' system call, recall that
  it produces several pieces of time information about files. One of
  these is the Last Modification Time (`mtime').  This field is stored
  in a in `sb->st_mtime' but is in binary format.

  What function is used to produce a human-readable version of `mtime'
  and what is an annoyning "gotchya" associated with this formatting
  function which is demonstrated `print_file_info()'?


PROBLEM 3: nftw() Library Call
==============================

  Coders have been known to "reinvent the wheel" at times, often due to
  their being unaware of pre-established alternatives.  The code in
  `recursive_listng.c' is simple enough but has a number of subtle
  issues that can be crippling: what if it runs out of file descriptors?
  how should it behave if the files in the directory are changing? what
  if a different operation than print/total are desired?

  To rectify these issues, most modern Unix systems have a library
  function that performs a "File Tree Walk" recursively visiting every
  file in a directory structure and allowing arbitrary actions to be
  performed on it. One such function available on many Unixes is
  `nftw()' with the `n' indicating a limit on the total file descriptors
  used during the file tree walk.

  This problem explores this interesting and useful function.


(A) nftw_listing.c
~~~~~~~~~~~~~~~~~~

  Examine the `nftw_listing.c' program.  Describe the similarities and
  differences you see in this program to `recursive_listing.c'.
  Describe what you infer to be the purpose of the `nftw()' function.


(B) Arguments to nftw()
~~~~~~~~~~~~~~~~~~~~~~~

  One extremely interesting aspect of the `nftw()' function is 2nd
  argument.
  - Describe what is unusual about this argument and relate it to how
    the `print_file_info()' function is written.  You may wish to
    consult `man nftw'.
  - If one wanted to do something other than print file information
    (such as just sum up the total files visited), how could one do this
    with `nftw()'?


(C) Additional Information on nftw()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Several alternatives to `nftw()' exist including the older `ftw()'
  which is a simplified version of it.  Some systems (including Linux)
  feature a `fts()' function which provides an interface similar to
  `opendir() / readdir()' but which recursively visits an entire
  directory tree rather than a single directory.

  While C is not widely known for it Functional Programming facilities,
  there are several places where functions are passed as arguments. The
  C syntax for this is often confusing compared to functional languages
  like Ocaml and Lisp, but it is usually worth the effort as
  higher-order functions that take other functions as arguments can
  provide tremendous flexibility.

  Aside from `nftw()' taking a function to perform on files, a much more
  common application of higher-order functions in C is the standard
  `qsort()' (quick sort) function. It sorts arbitrary types of data
  using a "comparison" function associated with the function. A
  demonstration appears of this appears in `qsort_demo.c' and is worth
  examining.


PROBLEM 4: Virtual Memory and pmap
==================================

(A) memory_parts memory areas
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Examine the source code for the provided `memory_parts.c'
  program. Identify what region of program memory you expect the
  following variables to be allocated into:
  - global_arr[]
  - local_arr[]
  - malloc_arr


(B) Running memory_parts and pmap
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Compile the `memory_parts' using the provided Makefile.
  ,----
  | > make memory_parts
  `----
  Run the program and note that it prints several pieces of information
  - The addresses of several of the variables allocated
  - Its Process ID (PID) which is a unique number used to identify the
    running program. This is an integer.
  For example, the output might be
  ,----
  | > ./memory-parts
  | 0x5605a7c271e9 : main()
  | 0x5605a7c2a0c0 : global_arr
  | 0x7ffe5ff7d600 : local_arr
  | 0x5605a92442a0 : malloc_arr
  | 0x7f1fa7303000 : mmap'd file
  | 0x600000000000 : mmap'd block1
  | 0x600000001000 : mmap'd block2
  | my pid is 8406
  | press any key to continue
  `----
  so the programs PID is 11160

  The program will also stop at this point until a key is pressed. DO
  NOT PRESS A KEY YET.

  Open another terminal and type the following command in that new
  terminal.
  ,----
  | > pmap THE-PID-NUMBER-THAT-WAS-PRINTED-EARLIER
  `----

  Paste the output of pmap below.


(C) Program Addresses vs Mapped Addresses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  pmap prints out the virtual address space table for the program. The
  leftmost column is a virtual address mapped by the OS for the program
  to some physical location.  The next column is the size of the area of
  memory associated with that starting address. The 3rd column contains
  permissions of the program has for the memory area: r for read, w for
  read, x for execute. The final column is contains any identifying
  information about the memory area that pmap can discern.

  Compare the addresses of variables and functions from the paused
  program to the output. Try to determine the virtual address space in
  which each variable resides and what region of program memory that
  virtual address must belong to (stack, heap, globals, text).  In some
  cases, the identifying information provided by pmap may make this
  obvious.


(D) Min Size of Mapped Areas
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The minimum size of any virtual area of memory appears to be 4K. Why
  is this the case?


(E) Additional Observations
~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Notice that in addition to the "normal" variables that are mapped,
  there is also an entry for the mmap()'d file 'gettysburg.txt' in the
  virtual address table.  The mmap() function is explored in the next
  problem but note its calling sequence which involves use of a couple
  system calls:
  1. `open()' which is a low level file opening call which returns a
     numeric file descriptor.
  2. `fstat()' which obtains information such as size for an open file
     based on its numeric file descriptor. The `stat()' system call was
     explored earlier in the class and does the same thing provided the
     name of a file.

  There are additional calls to mmap() which allocate memory to the
  program at a specific virtual address. Similar code to this is often
  used to allocate and expand the heap area of memory for programs in
  implementations of `malloc()'.

  Some Unix implementations, Linux among them, include a `mremap()'
  function. This function is related to `mmap()' as `realloc()' is
  related to `malloc()'
  - `mmap()' makes requests to map virtual memory into the address space
    of the program, possibly associated with a file but sometimes just
    DRAM. A specific size is specified for the mapping which remains
    fixed independent of any underlying file.
  - `mremap()' allows the size of the virtual memory mapping to be
    changed, often to expand it if needed. This is useful in cases where
    a file has been `mmap()''d but changes size: `mremap()' can be used
    to grow the memory mapping to account for additions to the file.
    `mremap()' may fail if there is insufficient space to grow a
    specific region of mapping. However, additional options can be
    passed to have the mapping moved to a new location with sufficient
    size in a similar way to `realloc()'. Like `realloc()' though, this
    movement invalidates all existing pointers to and into that area of
    memory.

Author: Chris Kauffman (kauffman@umn.edu)
Date: 2021-03-10 Wed 11:34