Tool Time Session 3: Unix Text Tools

1. Metadata
2. What's About to Happen?
3. A Drunken Blog Rant
4. Solutions
5. Unix Bread and Butter: Text Tools
6. grep: print lines that match patterns
7. find: Finding Files with Properties
- 7.1. find Basics
- 7.2. Examples on Klobuchar Web Site
8. sed: the Stream Editor
9. awk: the awkward, lovable, text processor

1 Metadata

Session Synopsis: Looking for phone numbers in hundreds of HTML files? Need to rename a variables in an entire source tree? Unix is full of small sharp text processing programs for just such occasions, making them essential tools in any power user's utility belt.

Tooltime Website: http://z.umn.edu/tooltime
Video Recording of Session 3: https://www.youtube.com/watch?v=Konf2fNxcL0
Code pack associated with the talk: 03-text-tools-code.zip
Additional data (Klobuchar web site used in examples): klobuchar.zip (67 Mb)
Org Files used to generate this page: 03-text-tools.org web-header.org

2 What's About to Happen?

We'll talk about some timeless Unix tools
Focus on common tasks that they solve, see some spots where they can be combined, discuss where to find more information
Try to surmount the difficult of getting acquainted with a very old but still relevant pieces of software

Thank Yous

Joe Finnegan: for making the recordings possible and enshrining all my mistakes permanently the in the clogged tubes of the Internet
Computer Science Dept: for supporting and advertising the series
Institute of Mathematics and its Applications: for lending us Keller 3-180 to do this session
Students Past and Present: for showing interest in these tools, pestering me to show them how they work, and showing up today

3 A Drunken Blog Rant

From " The Five Essential Phone-Screen Questions" by Steve Yegge

Let's say you're on my team, and we have to identify the pages having probable U.S. phone numbers in them. To simplify the problem slightly, assume we have 50,000 HTML files in a Unix directory tree, under a directory called "/website". We have 2 days to get a list of file paths to the editorial staff. You need to give me a list of the .html files in this directory tree that appear to contain phone numbers in the following two formats:

(xxx)-xxx-xxxx AND xxx-xxx-xxxx.

How would you solve this problem? Keep in mind our team is on a short (2-day) timeline.

– Steve Yegge

4 Solutions

Here are some facts for you to ponder:

Our Contact Reduction team really did have exactly this problem in 2003. This isn't a made-up example.

Someone on our team produced the list within an hour, and the list supported more than just the 2 formats above.

About 25% to 35% of all software development engineer candidates, independent of experience level, cannot solve this problem, even given the entire interview hour and lots of hints.

Here's one of many possible solutions to the problem:

grep -l -R --perl-regexp "\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b" * > output.txt

If they say, after hearing the question,

"Um… grep?"

then they're probably OK… Heck, if they can tell me where they'd look to find the syntax [for the regular expression], I'm fine with it.

– Steve Yegge

5 Unix Bread and Butter: Text Tools

Unix abounds with text tools such as…

Tool	General Use
	FOCUS ON…
`grep`	Search files for patterns (regexs)
`find`	Find files with certain properties in directory trees
`sed`	Make small transforms to files
`awk`	Make small to medium transforms to files
	MORE SPECIALIZED BUT ALSO USEFUL…
`cat`	Show entire contents of files
`head`	Show first few lines of a file
`tail`	Show last few lines of a file
`tr`	Transform chars to other chars in files
`cut`	Extract columns from columnar files
`paste`	Combine files in a column-wise fashion
`sort`	Print files in sorted order
`uniq`	Show unique lines in sorted files
`split`	Break file into chunks
`diff`	Compare two files and show differences
	…

In a terminal try info coreutils to see a giant list of standard Unix text tools ¹

6 grep: print lines that match patterns

Classic search tool
Takes a regular expression
Searches file(s) for matches to it

6.1 Anatomy of a Regex

Phone number pattern: (xxx)-xxx-xxxx AND xxx-xxx-xxxx

Progressively build up a regex
Often done on a test file or two and then broadly applied

Regex 0

123-456-7890

Matches the exact characters indicated
No special regex chars used to broaden

Regex 1

[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]

[0-9] means chars in range 0 to 9
Also [a-z] or [A-Z] or [abcd] or [aeiou]

Regex 2

[0-9]{3}-[0-9]{3}-[0-9]{4}

{3} means repeated 3 times
Also {1,3}, {0,10}, {5,}

Regex 3

[0-9]{3}-[0-9]{3}-[0-9]{4}|apple|banana
                          OR    OR

Matches 123-456-7890 OR 321-654-0987 OR apple OR banana
Pipe symbol 'this|or' means this OR that

Regex 4

[0-9]{3}-[0-9]{3}-[0-9]{4}|\([0-9]\){3}-[0-9]{3}-[0-9]{4}
                          OR

Matches 123-456-7890 OR (321)-654-0987
Escaping '\(' as '(' is a special regex char like '['

Regex 5

\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}
^^^        ^^^

'x?' means 0 or 1 'x'
Above will match 123-456-7890 OR (123)-456-7890 OR (123-456-7890 OR 123)-456-7890
But badly used parens likely don't matter here

6.2 Sample greps on Phone Numbers

Basic grep invocation

Regex characters like '?' and '{' must be "escaped" via '\?' and '\{' to take on special meeting
Otherwise characters like '(' match exactly

> grep '(\?[0-9]\{3\})\?-[0-9]\{3\}-[0-9]\{4\}' phone-numbers.txt
(218)-589-6764
(507)-209-5649
952-474-0698
612-266-0909
...

grep With -E: Extended regexs

Regex characters like '?' and '{' interpreted specially
Escape characters like '(' via '\(' to inrepret exactly

> grep -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers.txt 
(218)-589-6764
(507)-209-5649
952-474-0698
612-266-0909
(218)-781-1788
...

grep prints whole lines when it finds a match

> grep -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers-irregular.txt 
(218)-589-6764 Landline, from Dalton, MN(state),USA (507)-209-5649
952-474-0698 Landline, from Minneapolis, MN(state),USA 612-266-0909
Landline, from Saint Paul, MN(state),USA (218)-781-1788 Landline, from
(507)-510-6175 Landline, from Sherburn, MN(state),USA 952-843-4789
Landline, from Minneapolis, MN(state),USA 320-254-3105 Landline, from
...

Option -o will print only the text that matches the regex

> grep -o -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers-irregular.txt
(218)-589-6764
(507)-209-5649
952-474-0698
612-266-0909
...

When searching multiple files, use -l to show names of files that match rather than lines.

> grep -l -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers* gettysburg.txt 
phone-numbers-irregular.txt
phone-numbers.txt
>

When searching whole directories, use recursive -r searches, often -l in conjunction.

> grep -r -l -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' .
./phone-numbers.txt
./phone-numbers-irregular.txt
./search-dir/subdir/phone-numbers-irregular.txt
./search-dir/phone-numbers.txt
./search-dir/phone-numbers-irregular.txt

Grep has many more options that are useful in certain contexts such as:

-c: count how many matches
-n: show line number of matches
-v: invert matches (lines that don't match)
-L: show file names that don't have a matching line
-i: case insensitive search (capitalization doesn't matter)

6.3 Example: Finding a Student From a Previous Class

I get asked to for recommendation letters from students and need to find what classes they took, how they scored
Often I grep all class directories for student name/email to quickly figure this out

>  ls 
cs123  cs456  cs789  names-files

>  grep -r -i farrar cs*
cs456/grades-CS456.csv:Mi Farrar,farrar@college.edu,1269,28.64
cs789/grades-CS789.csv:Mi Farrar,farrar@college.edu,3708,68.23

>  grep -r -i blea cs*
cs789/grades-CS789.csv:Meri Blea,blea@college.edu,155,3.24

>  grep -r -i mcnelly cs*
cs123/grades-CS123.csv:Olympia Mcnelly,mcnelly@college.edu,1628,97.64

6.4 Regex Non-uniformity

Some people, when confronted with a problem, think

"I know, I'll use regular expressions."

Now they have two problems.

– Attributed to Jamie Zawinski

Yegge's Solution is:

grep -l -R --perl-regexp "\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b" * > output.txt
             ^^^^^^^^^^^  boundary    whitespae  digit      boundary

Regexes are a family of mini-languages without much standardization
Programs tend to each have their own subtle variants and tricks in Regexs,
Even between grep / sed / awk there are some subtle variations of what is accepted
Emacs, Vi, Java, etc. have their own versions
- Exs: Java, Python, Ocaml, Emacs, Vim
Perl Compatible Regular Expressions offer a large amount of power descending from the Perl language's regex implementation, appear as variants in some places

7 find: Finding Files with Properties

7.1 find Basics

grep is good for searching for text patterns in files
May also want to search for files with other properties
Extension (type), Size, Modification Date, etc.
The find utility allows for this, does recursive searches of a directory

It's simplest invocation reports all files recursively in a directory

  > cd grades
  > find .                    # show current dir recursively
  .
  ./cs789
  ./cs789/grades-CS789.csv
  ./cs456
  ./cs456/grades-CS456.csv
  ./cs123
  ./cs123/grades-CS123.csv
  ./names-files
  ./names-files/names2.txt
  ./names-files/names3.txt
  ./names-files/names-to-csv.awk
  ./names-files/names1.txt

Simple invocations limit extensions reported

  > find -name '*.csv'
  ./cs789/grades-CS789.csv
  ./cs456/grades-CS456.csv
  ./cs123/grades-CS123.csv

Use of *.csv is a Shell Glob, another pattern language separate from regexs, supported by many tools like shells and find
find tons of options to filter as shown in the next few examples

7.2 Examples on Klobuchar Web Site

Following are examples from a web scrape of Amy Klobuchar's web site https://amyklobuchar.com/ on Tue 3/3/2020 (super Tuesday). It was retrieved using

wget https://amyklobuchar.com/ -r -k -p

Filter Extensions then Grep

Find all files which end in the .html extension

> find . -name '*.html'
./amyklobuchar.com/feed/atom/index.html
./amyklobuchar.com/feed/index.html
./amyklobuchar.com/policies/amys-plan-for-economic-justice-and-opportunity-for-communities-of-color/index.html
./amyklobuchar.com/policies/senator-klobuchars-criminal-justice-reform-plan/index.html
./amyklobuchar.com/policies/senator-klobuchars-plan-for-comprehensive-immigration-reform/index.html
./amyklobuchar.com/policies/index.html
...

Find HTML files and run grep on them

> find . -name '*.html' -exec grep -E -o '[0-9]{3}-[0-9]{3}-[0-9]{4}' {} \;
800-452-7570
800-452-7570
800-452-7570
515-214-8933
202-662-7452
202-662-7452
202-662-7452
603-283-0797
603-352-1234
603-668-4321
603-668-4321

Filter on Large Files

Find files sized 1 megabyte or larger

klobuchar> find . -size +1M
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-regular-400.svg
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-light-300.svg
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-solid-900.svg
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-duotone-900.svg
./amyklobuchar.com/wp-content/uploads/2020/01/Screen-Shot-2020-01-18-at-4.04.02-PM-e1579386616249-2000x1069.png
...

Find 1Mb files and show size by exec'ing du -h on them.

klobuchar> find . -size +1M -exec du -h {} \;
1.8M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-regular-400.svg
2.0M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-light-300.svg
1.5M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-solid-900.svg
2.2M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-duotone-900.svg
2.0M	./amyklobuchar.com/wp-content/uploads/2020/01/Screen-Shot-2020-01-18-at-4.04.02-PM-e1579386616249-2000x1069.png
...

Deleting all Large Files

Good for when you suspect there are old useless files that are too large to be worth keeping (but be careful).

klobuchar> find . -size +1M -delete

8 sed: the Stream Editor

8.1 Motivation

Grep and Find allow location and reporting
Sed introduces some limited 'editing' or alteration of files
Short for 'stream editor', will see it used to transform text
Works line-by-line and most operations work on single lines
Sed 'programs' are usually small and specify how to change the text
Advice: Don't write sed scripts that are too long as Awk (or Python) are probably better for complex tasks

8.2 Anatomy of a sed program

> sed 'pattern1 operation1; pattern2 operation2; ...' file1.txt file2.txt

If pattern is present, operation will be applied only to matching lines
Patterns are optional, if not specified, operation applied to all lines of input files

8.3 Gettysburg Examples

# default no transformations, original lines
> sed '' gettysburg.txt |head -3
Four score and seven years ago our fathers brought forth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# matches only line 1, substitue 'F' for 'P'
> sed '1 s/F/P/' gettysburg.txt |head -3
Pour score and seven years ago our fathers brought forth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# matches only line 1, substitue 'F' or 'f' for 'P' for first occurrence
> sed '1 s/[Ff]/P/g' gettysburg.txt |head -3
Pour score and seven years ago our Pathers brought Porth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# matches all lines, substitue 'F' or 'f' for 'P' globally (all occurrences)
> sed 's/[Ff]/P/g' gettysburg.txt |head -3
Pour score and seven years ago our Pathers brought Porth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# 2 actions, swapping 'Ff' for 'P' globally, swapping 'Pp' for 'F' globally
> sed 's/[Ff]/P/g; s/[Pp]/F/g' gettysburg.txt |head -3
Four score and seven years ago our Fathers brought Forth on this continent, a
new nation, conceived in Liberty, and dedicated to the FroFosition that all men
are created equal.

# don't print by default -n, print lines 6-12
> sed -n '6,12p' gettysburg.txt 
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field, as a
final resting place for those who here gave their lives that that nation might
live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate -- we can not consecrate -- we can
not hallow -- this ground. The brave men, living and dead, who struggled here,

# print only lines 13 and 19
> sed -n '13p; 19p' gettysburg.txt 
have consecrated it, far above our poor power to add or detract. The world will
they gave the last full measure of devotion -- that we here highly resolve that

# print only lines which match regex 'dead' : grep-like behavior
> sed -n '/dead/p' gettysburg.txt 
not hallow -- this ground. The brave men, living and dead, who struggled here,
that from these honored dead we take increased devotion to that cause for which
these dead shall not have died in vain -- that this nation, under God, shall

8.4 Regexs to Capture Results

Recall phone number example with phone patterns as
```
(xxx)-xxx-xxxx AND xxx-xxx-xxxx.
```

Suppose want to convert all of 1st form to 2nd form as in

(123)-456-7890 becomes 123-456-7890
(321)-654-0987 becomes 321-654-0987

Possibily eliminate all '(' and ')' characters

  > sed -E '' phone-numbers.txt |head -3
  (218)-589-6764
  
  Landline, from Dalton, MN(state),USA
  > sed -E 's/\(|\)//g;' phone-numbers.txt |head -3
  218-589-6764
  
  Landline, from Dalton, MNstate,USA
                         ^^^^^^^

Unfortunately changes other text in the file as well
Need to use part of matched text in output

Use following regex

  s/\(([0-9]{3})\)-([0-9]{3})-([0-9]{4})/\1-\2-\3/g;
     ( Group1  )   ( Group2 ) ( Group3 ) G1-G2-G3

Special Parentheses chars '(stuff)' set up a Match Group in regexs
Can use the Match Group in the substitution text

Full example

  > sed -E 's/\(([0-9]{3})\)-([0-9]{3})-([0-9]{4})/\1-\2-\3/g;' phone-numbers.txt |head -3
  218-589-6764
  
  Landline, from Dalton, MN(state),USA

8.5 Combined Example

Supposing one wanted to make change all phone numbers in all HTML files in a directory to another format:

# find all .html files and search them for the phone# pattern 123-456-7890
> find . -name '*.html' -exec grep -E '([0-9]{3})-([0-9]{3})-([0-9]{4})' {} \;
...
        <strong>Phone:</strong> 603-668-4321<br>
        <strong>Phone:</strong> 603-352-1234<br>
        <strong>Phone:</strong> 603-668-4321<br>
...

# find all .html files, run sed on them
# sed will replace xxx-xxx-xxxx with (xxx) xxx-xxxx in the file
# sed will make a backup of the original with the extension '.bk' 

> find . -name '*.html' -exec sed -E -i.bk 's/([0-9]{3})-([0-9]{3})-([0-9]{4})/\(\1\) \2-\3/g' {} \;

# search for files with original pattern of phone number
> find . -name '*.html' -exec grep -E -o '[0-9]{3}-[0-9]{3}-[0-9]{4}' {} \;

# no matches reported

# search for files with new pattern of phone number
> find . -name '*.html' -exec grep -E '\([0-9]{3}\) [0-9]{3}-[0-9]{4}' {} \;
...
        <strong>Phone:</strong> (603) 352-1234<br>
        <strong>Phone:</strong> (603) 668-4321<br>
        <strong>Phone:</strong> (603) 668-4321<br>
...

# shell script wizardry to move all backup files to their originals
> for f in $(find . -name '*.bk');do mv $f ${f/.bk/};done

8.6 More Sed

Sed has many other features such as copying patterns into a space and then 'pasting' them into output
Doesn't have variables and loops in the forms that are standard in most languages
My experience has been limited to mostly 's/original/substi/g' scripts after which tasks want for a more complete language like…

9 awk: the awkward, lovable, text processor

9.1 Background

Named after original authors Aho, Weinberger, Kernighan, all at Bell labs, responsible for other inconsequential stuff like C, Fortran, Unix, etc.
Is a small dynamically typed language for text processing, Turing complete, C-like syntax but Python-like feel
Follows Sed's convention of Pattern Action but features variables, loops, functions, etc.
For some reason is not wildly popular but has persisted in Unix systems since 1977

9.2 Hello World

> cat hello.awk 
#!/bin/awk -f

BEGIN{
  print "Hello world!"
}
> awk -f hello.awk 
Hello world!

BEGIN is a pattern, matches 'beginning of run'
Initial line #!/bin/awk -f referred to as a "shebang", short for "shell bang"
- Indicates that the rest of the file is a script
- Shell should use the program /bin/awk to interpret the script
- Shebang's are used for many other 'scripty' langauges like Bash, Python, Perl, etc. to make the scrips directly executable

9.3 Line Patterns and Built-in Vars in awk

More common awk script structures look like:

#!/bin/awk -f

/dead/ {                        #matches lines with 'dead'
  print "This line is morbid:",$0
}

$2 == "hallow" {                  #first 'field' is "dead"
  print "This line is holy:  ",$0
}

NR==3, NR==7 {                  #matches lines 3 to 7
  print "Thes lines are in the middle",NR,":",$0
}

/regex/ is a regular expression
$1, $2, $3 are the 'fields' of a line, default space separated
$0 is the entire current line
NR is the 'record number', usually line number for single files

Awk processes line by line, if a pattern applies to a given line, the action is performed. Running above script on a relevant text file:

> awk -f patterns.awk gettysburg.txt 
Thes lines are in the middle 3 : are created equal.
Thes lines are in the middle 4 : 
Thes lines are in the middle 5 : Now we are engaged in a great civil war, testing whether that nation, or any
Thes lines are in the middle 6 : nation so conceived and so dedicated, can long endure. We are met on a great
Thes lines are in the middle 7 : battle-field of that war. We have come to dedicate a portion of that field, as a
This line is morbid: not hallow -- this ground. The brave men, living and dead, who struggled here,
This line is holy:   not hallow -- this ground. The brave men, living and dead, who struggled here,
This line is morbid: that from these honored dead we take increased devotion to that cause for which
This line is morbid: these dead shall not have died in vain -- that this nation, under God, shall

9.4 Printing all Fields

Can introduce variables such as i without type declarations
Awk features for() loops like C
Awk also has special built-in variables for "Number of Fields" NF and "Number of Records" (NR, line number)
An example is printfields.awk

#!/bin/awk -f
{
  for(i=1; i<=NF; i++){
    print "Line",NR,"Field",i,":",$i;
  }
}

Demonstrated on the gettysburg.txt file

> awk -f printfields.awk gettysburg.txt 
Line 1 Field 1 : Four
Line 1 Field 2 : score
Line 1 Field 3 : and
Line 1 Field 4 : seven
Line 1 Field 5 : years
Line 1 Field 6 : ago
Line 1 Field 7 : our
Line 1 Field 8 : fathers
Line 1 Field 9 : brought
Line 1 Field 10 : forth
Line 1 Field 11 : on
Line 1 Field 12 : this
Line 1 Field 13 : continent,
Line 1 Field 14 : a
Line 2 Field 1 : new
Line 2 Field 2 : nation,
Line 2 Field 3 : conceived
Line 2 Field 4 : in
...

Can change the field separator from space to other things to apply same awk script to differently formatted data

> awk -F , -f printfields.awk grades.csv |head
Line 1 Field 1 : Henrietta Gamez
Line 1 Field 2 : gamez@college.edu
Line 1 Field 3 : 9240
Line 1 Field 4 : 59.39
Line 2 Field 1 : Lang Singleton
Line 2 Field 2 : singleton@college.edu
Line 2 Field 3 : 3063
Line 2 Field 4 : 57.89
...

9.5 Killer feature: Built-in associated arrays

Awk has 'associative' arrays
These behave like normal arrays with numeric indices
BUT they can work as hashes/dictionaries as well
Extremely useful when combined with awk's other built-ins (field separation, string allocation, etc.)
Let's code a 'word frequency' program together to demo this feature
Result is as follows

#!/bin/awk -f

# frequency.awk: calculates the frequency of each word that appears in
# a text file and prints it out (in unsorted order). Leverages awk's
# built-in associative arrays along with several other features.

{                               # match every line
  gsub(/[^a-zA-Z ]/," ");       # eliminate non-word characters like ',' and '!'
  for(i=1; i<=NF; i++){         # iterate over each word
    counts[$i] = counts[$i]+1   # use ith word as a key, increment its frequency
  }                             # leverages built-in semantics: not present -> 0
}
END{                            # after processing all lines
  for(key in counts){           # iterate over all keys in the counts array
    print key,":",counts[key]   # print the key (word) and its count
  }
}

Running on gettysburg.txt:

# Run script on gettysburg.txt, shows word frequency in unsorted order

> awk -f frequency.awk gettysburg.txt 
God : 1
detract : 1
honored : 1
before : 1
their : 1
people : 3
...
Lincoln : 1
proper : 1
who : 3
which : 2

# pipe results to sort and ask to sort on 3rd field (count) in reverse
# numeric order to get top 10 most frequent words

> awk -f frequency.awk gettysburg.txt | sort -k 3rn |head
that : 13
the : 9
here : 8
to : 8
we : 8
a : 7
and : 6
can : 5
for : 5
have : 5

Footnotes:

The info documentation system is a counterpart to man pages; info includes deeper discussion of tools, paging forward via pressing space, hyperlinking of pages via position the cursor on a link and pressing Enter, usually runs through emacs which has direct access to the info system via C-h i. Press 'q' to quit info.