Last Updated: 2020-03-12 Thu 11:41

Tool Time Session 3: Unix Text Tools

Table of Contents

1 Metadata

Session Synopsis: Looking for phone numbers in hundreds of HTML files? Need to rename a variables in an entire source tree? Unix is full of small sharp text processing programs for just such occasions, making them essential tools in any power user's utility belt.

2 What's About to Happen?

  • We'll talk about some timeless Unix tools
  • Focus on common tasks that they solve, see some spots where they can be combined, discuss where to find more information
  • Try to surmount the difficult of getting acquainted with a very old but still relevant pieces of software

Thank Yous

  • Joe Finnegan: for making the recordings possible and enshrining all my mistakes permanently the in the clogged tubes of the Internet
  • Computer Science Dept: for supporting and advertising the series
  • Institute of Mathematics and its Applications: for lending us Keller 3-180 to do this session
  • Students Past and Present: for showing interest in these tools, pestering me to show them how they work, and showing up today

3 A Drunken Blog Rant

From " The Five Essential Phone-Screen Questions" by Steve Yegge

Let's say you're on my team, and we have to identify the pages having probable U.S. phone numbers in them. To simplify the problem slightly, assume we have 50,000 HTML files in a Unix directory tree, under a directory called "/website". We have 2 days to get a list of file paths to the editorial staff. You need to give me a list of the .html files in this directory tree that appear to contain phone numbers in the following two formats:

(xxx)-xxx-xxxx AND xxx-xxx-xxxx.

How would you solve this problem? Keep in mind our team is on a short (2-day) timeline.

– Steve Yegge

4 Solutions

Here are some facts for you to ponder:

Our Contact Reduction team really did have exactly this problem in 2003. This isn't a made-up example.

Someone on our team produced the list within an hour, and the list supported more than just the 2 formats above.

About 25% to 35% of all software development engineer candidates, independent of experience level, cannot solve this problem, even given the entire interview hour and lots of hints.

Here's one of many possible solutions to the problem:

grep -l -R --perl-regexp "\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b" * > output.txt

If they say, after hearing the question,

"Um… grep?"

then they're probably OK… Heck, if they can tell me where they'd look to find the syntax [for the regular expression], I'm fine with it.

– Steve Yegge

5 Unix Bread and Butter: Text Tools

Unix abounds with text tools such as…

Tool General Use
  FOCUS ON…
grep Search files for patterns (regexs)
find Find files with certain properties in directory trees
sed Make small transforms to files
awk Make small to medium transforms to files
  MORE SPECIALIZED BUT ALSO USEFUL…
cat Show entire contents of files
head Show first few lines of a file
tail Show last few lines of a file
tr Transform chars to other chars in files
cut Extract columns from columnar files
paste Combine files in a column-wise fashion
sort Print files in sorted order
uniq Show unique lines in sorted files
split Break file into chunks
diff Compare two files and show differences
 

In a terminal try info coreutils to see a giant list of standard Unix text tools 1

6 grep: print lines that match patterns

  • Classic search tool
  • Takes a regular expression
  • Searches file(s) for matches to it

6.1 Anatomy of a Regex

Phone number pattern: (xxx)-xxx-xxxx AND xxx-xxx-xxxx

  • Progressively build up a regex
  • Often done on a test file or two and then broadly applied

Regex 0

123-456-7890

  • Matches the exact characters indicated
  • No special regex chars used to broaden

Regex 1

[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]

  • [0-9] means chars in range 0 to 9
  • Also [a-z] or [A-Z] or [abcd] or [aeiou]

Regex 2

[0-9]{3}-[0-9]{3}-[0-9]{4}

  • {3} means repeated 3 times
  • Also {1,3}, {0,10}, {5,}

Regex 3

[0-9]{3}-[0-9]{3}-[0-9]{4}|apple|banana
                          OR    OR

  • Matches 123-456-7890 OR 321-654-0987 OR apple OR banana
  • Pipe symbol 'this|or' means this OR that

Regex 4

[0-9]{3}-[0-9]{3}-[0-9]{4}|\([0-9]\){3}-[0-9]{3}-[0-9]{4}
                          OR          

  • Matches 123-456-7890 OR (321)-654-0987
  • Escaping '\(' as '(' is a special regex char like '['

Regex 5

\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}
^^^        ^^^

  • 'x?' means 0 or 1 'x'
  • Above will match 123-456-7890 OR (123)-456-7890 OR (123-456-7890 OR 123)-456-7890
  • But badly used parens likely don't matter here

6.2 Sample greps on Phone Numbers

Basic grep invocation

  • Regex characters like '?' and '{' must be "escaped" via '\?' and '\{' to take on special meeting
  • Otherwise characters like '(' match exactly
> grep '(\?[0-9]\{3\})\?-[0-9]\{3\}-[0-9]\{4\}' phone-numbers.txt
(218)-589-6764
(507)-209-5649
952-474-0698
612-266-0909
...

grep With -E: Extended regexs

  • Regex characters like '?' and '{' interpreted specially
  • Escape characters like '(' via '\(' to inrepret exactly
> grep -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers.txt 
(218)-589-6764
(507)-209-5649
952-474-0698
612-266-0909
(218)-781-1788
...

grep prints whole lines when it finds a match

> grep -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers-irregular.txt 
(218)-589-6764 Landline, from Dalton, MN(state),USA (507)-209-5649
952-474-0698 Landline, from Minneapolis, MN(state),USA 612-266-0909
Landline, from Saint Paul, MN(state),USA (218)-781-1788 Landline, from
(507)-510-6175 Landline, from Sherburn, MN(state),USA 952-843-4789
Landline, from Minneapolis, MN(state),USA 320-254-3105 Landline, from
...

Option -o will print only the text that matches the regex

> grep -o -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers-irregular.txt
(218)-589-6764
(507)-209-5649
952-474-0698
612-266-0909
...

When searching multiple files, use -l to show names of files that match rather than lines.

> grep -l -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers* gettysburg.txt 
phone-numbers-irregular.txt
phone-numbers.txt
> 

When searching whole directories, use recursive -r searches, often -l in conjunction.

> grep -r -l -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' .
./phone-numbers.txt
./phone-numbers-irregular.txt
./search-dir/subdir/phone-numbers-irregular.txt
./search-dir/phone-numbers.txt
./search-dir/phone-numbers-irregular.txt

Grep has many more options that are useful in certain contexts such as:

  • -c: count how many matches
  • -n: show line number of matches
  • -v: invert matches (lines that don't match)
  • -L: show file names that don't have a matching line
  • -i: case insensitive search (capitalization doesn't matter)

6.3 Example: Finding a Student From a Previous Class

  • I get asked to for recommendation letters from students and need to find what classes they took, how they scored
  • Often I grep all class directories for student name/email to quickly figure this out
>  ls 
cs123  cs456  cs789  names-files

>  grep -r -i farrar cs*
cs456/grades-CS456.csv:Mi Farrar,farrar@college.edu,1269,28.64
cs789/grades-CS789.csv:Mi Farrar,farrar@college.edu,3708,68.23

>  grep -r -i blea cs*
cs789/grades-CS789.csv:Meri Blea,blea@college.edu,155,3.24

>  grep -r -i mcnelly cs*
cs123/grades-CS123.csv:Olympia Mcnelly,mcnelly@college.edu,1628,97.64

6.4 Regex Non-uniformity

Some people, when confronted with a problem, think

"I know, I'll use regular expressions."

Now they have two problems.

– Attributed to Jamie Zawinski

Yegge's Solution is:

grep -l -R --perl-regexp "\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b" * > output.txt
             ^^^^^^^^^^^  boundary    whitespae  digit      boundary

  • Regexes are a family of mini-languages without much standardization
  • Programs tend to each have their own subtle variants and tricks in Regexs,
  • Even between grep / sed / awk there are some subtle variations of what is accepted
  • Emacs, Vi, Java, etc. have their own versions
  • Perl Compatible Regular Expressions offer a large amount of power descending from the Perl language's regex implementation, appear as variants in some places

7 find: Finding Files with Properties

7.1 find Basics

  • grep is good for searching for text patterns in files
  • May also want to search for files with other properties
  • Extension (type), Size, Modification Date, etc.
  • The find utility allows for this, does recursive searches of a directory
  • It's simplest invocation reports all files recursively in a directory

      > cd grades
      > find .                    # show current dir recursively
      .
      ./cs789
      ./cs789/grades-CS789.csv
      ./cs456
      ./cs456/grades-CS456.csv
      ./cs123
      ./cs123/grades-CS123.csv
      ./names-files
      ./names-files/names2.txt
      ./names-files/names3.txt
      ./names-files/names-to-csv.awk
      ./names-files/names1.txt
    
  • Simple invocations limit extensions reported

      > find -name '*.csv'
      ./cs789/grades-CS789.csv
      ./cs456/grades-CS456.csv
      ./cs123/grades-CS123.csv
    
  • Use of *.csv is a Shell Glob, another pattern language separate from regexs, supported by many tools like shells and find
  • find tons of options to filter as shown in the next few examples

7.2 Examples on Klobuchar Web Site

Following are examples from a web scrape of Amy Klobuchar's web site https://amyklobuchar.com/ on Tue 3/3/2020 (super Tuesday). It was retrieved using

wget https://amyklobuchar.com/ -r -k -p 

Filter Extensions then Grep

Find all files which end in the .html extension

> find . -name '*.html'
./amyklobuchar.com/feed/atom/index.html
./amyklobuchar.com/feed/index.html
./amyklobuchar.com/policies/amys-plan-for-economic-justice-and-opportunity-for-communities-of-color/index.html
./amyklobuchar.com/policies/senator-klobuchars-criminal-justice-reform-plan/index.html
./amyklobuchar.com/policies/senator-klobuchars-plan-for-comprehensive-immigration-reform/index.html
./amyklobuchar.com/policies/index.html
...

Find HTML files and run grep on them

> find . -name '*.html' -exec grep -E -o '[0-9]{3}-[0-9]{3}-[0-9]{4}' {} \;
800-452-7570
800-452-7570
800-452-7570
515-214-8933
202-662-7452
202-662-7452
202-662-7452
603-283-0797
603-352-1234
603-668-4321
603-668-4321

Filter on Large Files

Find files sized 1 megabyte or larger

klobuchar> find . -size +1M
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-regular-400.svg
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-light-300.svg
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-solid-900.svg
./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-duotone-900.svg
./amyklobuchar.com/wp-content/uploads/2020/01/Screen-Shot-2020-01-18-at-4.04.02-PM-e1579386616249-2000x1069.png
...

Find 1Mb files and show size by exec'ing du -h on them.

klobuchar> find . -size +1M -exec du -h {} \;
1.8M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-regular-400.svg
2.0M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-light-300.svg
1.5M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-solid-900.svg
2.2M	./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-duotone-900.svg
2.0M	./amyklobuchar.com/wp-content/uploads/2020/01/Screen-Shot-2020-01-18-at-4.04.02-PM-e1579386616249-2000x1069.png
...

Deleting all Large Files

Good for when you suspect there are old useless files that are too large to be worth keeping (but be careful).

klobuchar> find . -size +1M -delete

8 sed: the Stream Editor

8.1 Motivation

  • Grep and Find allow location and reporting
  • Sed introduces some limited 'editing' or alteration of files
  • Short for 'stream editor', will see it used to transform text
  • Works line-by-line and most operations work on single lines
  • Sed 'programs' are usually small and specify how to change the text
  • Advice: Don't write sed scripts that are too long as Awk (or Python) are probably better for complex tasks

8.2 Anatomy of a sed program

> sed 'pattern1 operation1; pattern2 operation2; ...' file1.txt file2.txt
  • If pattern is present, operation will be applied only to matching lines
  • Patterns are optional, if not specified, operation applied to all lines of input files

8.3 Gettysburg Examples

# default no transformations, original lines
> sed '' gettysburg.txt |head -3
Four score and seven years ago our fathers brought forth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# matches only line 1, substitue 'F' for 'P'
> sed '1 s/F/P/' gettysburg.txt |head -3
Pour score and seven years ago our fathers brought forth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# matches only line 1, substitue 'F' or 'f' for 'P' for first occurrence
> sed '1 s/[Ff]/P/g' gettysburg.txt |head -3
Pour score and seven years ago our Pathers brought Porth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# matches all lines, substitue 'F' or 'f' for 'P' globally (all occurrences)
> sed 's/[Ff]/P/g' gettysburg.txt |head -3
Pour score and seven years ago our Pathers brought Porth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

# 2 actions, swapping 'Ff' for 'P' globally, swapping 'Pp' for 'F' globally
> sed 's/[Ff]/P/g; s/[Pp]/F/g' gettysburg.txt |head -3
Four score and seven years ago our Fathers brought Forth on this continent, a
new nation, conceived in Liberty, and dedicated to the FroFosition that all men
are created equal.

# don't print by default -n, print lines 6-12
> sed -n '6,12p' gettysburg.txt 
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field, as a
final resting place for those who here gave their lives that that nation might
live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate -- we can not consecrate -- we can
not hallow -- this ground. The brave men, living and dead, who struggled here,

# print only lines 13 and 19
> sed -n '13p; 19p' gettysburg.txt 
have consecrated it, far above our poor power to add or detract. The world will
they gave the last full measure of devotion -- that we here highly resolve that

# print only lines which match regex 'dead' : grep-like behavior
> sed -n '/dead/p' gettysburg.txt 
not hallow -- this ground. The brave men, living and dead, who struggled here,
that from these honored dead we take increased devotion to that cause for which
these dead shall not have died in vain -- that this nation, under God, shall

8.4 Regexs to Capture Results

  • Recall phone number example with phone patterns as

    (xxx)-xxx-xxxx AND xxx-xxx-xxxx.
    
    
  • Suppose want to convert all of 1st form to 2nd form as in

    (123)-456-7890 becomes 123-456-7890
    (321)-654-0987 becomes 321-654-0987
    
    
  • Possibily eliminate all '(' and ')' characters

      > sed -E '' phone-numbers.txt |head -3
      (218)-589-6764
      
      Landline, from Dalton, MN(state),USA
      > sed -E 's/\(|\)//g;' phone-numbers.txt |head -3
      218-589-6764
      
      Landline, from Dalton, MNstate,USA
                             ^^^^^^^
    
  • Unfortunately changes other text in the file as well
  • Need to use part of matched text in output
  • Use following regex

      s/\(([0-9]{3})\)-([0-9]{3})-([0-9]{4})/\1-\2-\3/g;
         ( Group1  )   ( Group2 ) ( Group3 ) G1-G2-G3
    
  • Special Parentheses chars '(stuff)' set up a Match Group in regexs
  • Can use the Match Group in the substitution text
  • Full example

      > sed -E 's/\(([0-9]{3})\)-([0-9]{3})-([0-9]{4})/\1-\2-\3/g;' phone-numbers.txt |head -3
      218-589-6764
      
      Landline, from Dalton, MN(state),USA
    

8.5 Combined Example

Supposing one wanted to make change all phone numbers in all HTML files in a directory to another format:

# find all .html files and search them for the phone# pattern 123-456-7890
> find . -name '*.html' -exec grep -E '([0-9]{3})-([0-9]{3})-([0-9]{4})' {} \;
...
        <strong>Phone:</strong> 603-668-4321<br>
        <strong>Phone:</strong> 603-352-1234<br>
        <strong>Phone:</strong> 603-668-4321<br>
...

# find all .html files, run sed on them
# sed will replace xxx-xxx-xxxx with (xxx) xxx-xxxx in the file
# sed will make a backup of the original with the extension '.bk' 

> find . -name '*.html' -exec sed -E -i.bk 's/([0-9]{3})-([0-9]{3})-([0-9]{4})/\(\1\) \2-\3/g' {} \;

# search for files with original pattern of phone number
> find . -name '*.html' -exec grep -E -o '[0-9]{3}-[0-9]{3}-[0-9]{4}' {} \;

# no matches reported

# search for files with new pattern of phone number
> find . -name '*.html' -exec grep -E '\([0-9]{3}\) [0-9]{3}-[0-9]{4}' {} \;
...
        <strong>Phone:</strong> (603) 352-1234<br>
        <strong>Phone:</strong> (603) 668-4321<br>
        <strong>Phone:</strong> (603) 668-4321<br>
...

# shell script wizardry to move all backup files to their originals
> for f in $(find . -name '*.bk');do mv $f ${f/.bk/};done

8.6 More Sed

  • Sed has many other features such as copying patterns into a space and then 'pasting' them into output
  • Doesn't have variables and loops in the forms that are standard in most languages
  • My experience has been limited to mostly 's/original/substi/g' scripts after which tasks want for a more complete language like…

9 awk: the awkward, lovable, text processor

9.1 Background

  • Named after original authors Aho, Weinberger, Kernighan, all at Bell labs, responsible for other inconsequential stuff like C, Fortran, Unix, etc.
  • Is a small dynamically typed language for text processing, Turing complete, C-like syntax but Python-like feel
  • Follows Sed's convention of Pattern Action but features variables, loops, functions, etc.
  • For some reason is not wildly popular but has persisted in Unix systems since 1977

9.2 Hello World

> cat hello.awk 
#!/bin/awk -f

BEGIN{
  print "Hello world!"
}
> awk -f hello.awk 
Hello world!
  • BEGIN is a pattern, matches 'beginning of run'
  • Initial line #!/bin/awk -f referred to as a "shebang", short for "shell bang"
    • Indicates that the rest of the file is a script
    • Shell should use the program /bin/awk to interpret the script
    • Shebang's are used for many other 'scripty' langauges like Bash, Python, Perl, etc. to make the scrips directly executable

9.3 Line Patterns and Built-in Vars in awk

More common awk script structures look like:

#!/bin/awk -f

/dead/ {                        #matches lines with 'dead'
  print "This line is morbid:",$0
}

$2 == "hallow" {                  #first 'field' is "dead"
  print "This line is holy:  ",$0
}

NR==3, NR==7 {                  #matches lines 3 to 7
  print "Thes lines are in the middle",NR,":",$0
}
  • /regex/ is a regular expression
  • $1, $2, $3 are the 'fields' of a line, default space separated
  • $0 is the entire current line
  • NR is the 'record number', usually line number for single files

Awk processes line by line, if a pattern applies to a given line, the action is performed. Running above script on a relevant text file:

> awk -f patterns.awk gettysburg.txt 
Thes lines are in the middle 3 : are created equal.
Thes lines are in the middle 4 : 
Thes lines are in the middle 5 : Now we are engaged in a great civil war, testing whether that nation, or any
Thes lines are in the middle 6 : nation so conceived and so dedicated, can long endure. We are met on a great
Thes lines are in the middle 7 : battle-field of that war. We have come to dedicate a portion of that field, as a
This line is morbid: not hallow -- this ground. The brave men, living and dead, who struggled here,
This line is holy:   not hallow -- this ground. The brave men, living and dead, who struggled here,
This line is morbid: that from these honored dead we take increased devotion to that cause for which
This line is morbid: these dead shall not have died in vain -- that this nation, under God, shall

9.4 Printing all Fields

  • Can introduce variables such as i without type declarations
  • Awk features for() loops like C
  • Awk also has special built-in variables for "Number of Fields" NF and "Number of Records" (NR, line number)
  • An example is printfields.awk
#!/bin/awk -f
{
  for(i=1; i<=NF; i++){
    print "Line",NR,"Field",i,":",$i;
  }
}

Demonstrated on the gettysburg.txt file

> awk -f printfields.awk gettysburg.txt 
Line 1 Field 1 : Four
Line 1 Field 2 : score
Line 1 Field 3 : and
Line 1 Field 4 : seven
Line 1 Field 5 : years
Line 1 Field 6 : ago
Line 1 Field 7 : our
Line 1 Field 8 : fathers
Line 1 Field 9 : brought
Line 1 Field 10 : forth
Line 1 Field 11 : on
Line 1 Field 12 : this
Line 1 Field 13 : continent,
Line 1 Field 14 : a
Line 2 Field 1 : new
Line 2 Field 2 : nation,
Line 2 Field 3 : conceived
Line 2 Field 4 : in
...

Can change the field separator from space to other things to apply same awk script to differently formatted data

> awk -F , -f printfields.awk grades.csv |head
Line 1 Field 1 : Henrietta Gamez
Line 1 Field 2 : gamez@college.edu
Line 1 Field 3 : 9240
Line 1 Field 4 : 59.39
Line 2 Field 1 : Lang Singleton
Line 2 Field 2 : singleton@college.edu
Line 2 Field 3 : 3063
Line 2 Field 4 : 57.89
...

9.5 Killer feature: Built-in associated arrays

  • Awk has 'associative' arrays
  • These behave like normal arrays with numeric indices
  • BUT they can work as hashes/dictionaries as well
  • Extremely useful when combined with awk's other built-ins (field separation, string allocation, etc.)
  • Let's code a 'word frequency' program together to demo this feature
  • Result is as follows
#!/bin/awk -f

# frequency.awk: calculates the frequency of each word that appears in
# a text file and prints it out (in unsorted order). Leverages awk's
# built-in associative arrays along with several other features.

{                               # match every line
  gsub(/[^a-zA-Z ]/," ");       # eliminate non-word characters like ',' and '!'
  for(i=1; i<=NF; i++){         # iterate over each word
    counts[$i] = counts[$i]+1   # use ith word as a key, increment its frequency
  }                             # leverages built-in semantics: not present -> 0
}
END{                            # after processing all lines
  for(key in counts){           # iterate over all keys in the counts array
    print key,":",counts[key]   # print the key (word) and its count
  }
}
  • Running on gettysburg.txt:
# Run script on gettysburg.txt, shows word frequency in unsorted order

> awk -f frequency.awk gettysburg.txt 
God : 1
detract : 1
honored : 1
before : 1
their : 1
people : 3
...
Lincoln : 1
proper : 1
who : 3
which : 2

# pipe results to sort and ask to sort on 3rd field (count) in reverse
# numeric order to get top 10 most frequent words

> awk -f frequency.awk gettysburg.txt | sort -k 3rn |head
that : 13
the : 9
here : 8
to : 8
we : 8
a : 7
and : 6
can : 5
for : 5
have : 5

Footnotes:

1

The info documentation system is a counterpart to man pages; info includes deeper discussion of tools, paging forward via pressing space, hyperlinking of pages via position the cursor on a link and pressing Enter, usually runs through emacs which has direct access to the info system via C-h i. Press 'q' to quit info.


Author: Chris Kauffman (kauffman@umn.edu)
Date: 2020-03-12 Thu 11:41