Tool Time Session 3: Unix Text Tools
Table of Contents
1 Metadata
Session Synopsis: Looking for phone numbers in hundreds of HTML files? Need to rename a variables in an entire source tree? Unix is full of small sharp text processing programs for just such occasions, making them essential tools in any power user's utility belt.
- Tooltime Website: http://z.umn.edu/tooltime
- Video Recording of Session 3: https://www.youtube.com/watch?v=Konf2fNxcL0
- Code pack associated with the talk: 03-text-tools-code.zip
- Additional data (Klobuchar web site used in examples): klobuchar.zip (67 Mb)
- Org Files used to generate this page: 03-text-tools.org web-header.org
2 What's About to Happen?
- We'll talk about some timeless Unix tools
- Focus on common tasks that they solve, see some spots where they can be combined, discuss where to find more information
- Try to surmount the difficult of getting acquainted with a very old but still relevant pieces of software
Thank Yous
- Joe Finnegan: for making the recordings possible and enshrining all my mistakes permanently the in the clogged tubes of the Internet
- Computer Science Dept: for supporting and advertising the series
- Institute of Mathematics and its Applications: for lending us Keller 3-180 to do this session
- Students Past and Present: for showing interest in these tools, pestering me to show them how they work, and showing up today
3 A Drunken Blog Rant
From " The Five Essential Phone-Screen Questions" by Steve Yegge
Let's say you're on my team, and we have to identify the pages having probable U.S. phone numbers in them. To simplify the problem slightly, assume we have 50,000 HTML files in a Unix directory tree, under a directory called "/website". We have 2 days to get a list of file paths to the editorial staff. You need to give me a list of the .html files in this directory tree that appear to contain phone numbers in the following two formats:
(xxx)-xxx-xxxx AND xxx-xxx-xxxx.
How would you solve this problem? Keep in mind our team is on a short (2-day) timeline.
– Steve Yegge
4 Solutions
Here are some facts for you to ponder:
Our Contact Reduction team really did have exactly this problem in 2003. This isn't a made-up example.
Someone on our team produced the list within an hour, and the list supported more than just the 2 formats above.
About 25% to 35% of all software development engineer candidates, independent of experience level, cannot solve this problem, even given the entire interview hour and lots of hints.
Here's one of many possible solutions to the problem:
grep -l -R --perl-regexp "\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b" * > output.txt
If they say, after hearing the question,
"Um… grep?"
then they're probably OK… Heck, if they can tell me where they'd look to find the syntax [for the regular expression], I'm fine with it.
– Steve Yegge
5 Unix Bread and Butter: Text Tools
Unix abounds with text tools such as…
Tool | General Use |
---|---|
FOCUS ON… | |
grep |
Search files for patterns (regexs) |
find |
Find files with certain properties in directory trees |
sed |
Make small transforms to files |
awk |
Make small to medium transforms to files |
MORE SPECIALIZED BUT ALSO USEFUL… | |
cat |
Show entire contents of files |
head |
Show first few lines of a file |
tail |
Show last few lines of a file |
tr |
Transform chars to other chars in files |
cut |
Extract columns from columnar files |
paste |
Combine files in a column-wise fashion |
sort |
Print files in sorted order |
uniq |
Show unique lines in sorted files |
split |
Break file into chunks |
diff |
Compare two files and show differences |
… |
In a terminal try info coreutils
to see a giant list
of standard Unix text tools 1
6 grep: print lines that match patterns
- Classic search tool
- Takes a regular expression
- Searches file(s) for matches to it
6.1 Anatomy of a Regex
Phone number pattern: (xxx)-xxx-xxxx AND xxx-xxx-xxxx
- Progressively build up a regex
- Often done on a test file or two and then broadly applied
Regex 0
123-456-7890
- Matches the exact characters indicated
- No special regex chars used to broaden
Regex 1
[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
- [0-9] means chars in range 0 to 9
- Also [a-z] or [A-Z] or [abcd] or [aeiou]
Regex 2
[0-9]{3}-[0-9]{3}-[0-9]{4}
- {3} means repeated 3 times
- Also {1,3}, {0,10}, {5,}
Regex 3
[0-9]{3}-[0-9]{3}-[0-9]{4}|apple|banana OR OR
- Matches 123-456-7890 OR 321-654-0987 OR apple OR banana
- Pipe symbol 'this|or' means this OR that
Regex 4
[0-9]{3}-[0-9]{3}-[0-9]{4}|\([0-9]\){3}-[0-9]{3}-[0-9]{4} OR
- Matches 123-456-7890 OR (321)-654-0987
- Escaping '\(' as '(' is a special regex char like '['
Regex 5
\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4} ^^^ ^^^
- 'x?' means 0 or 1 'x'
- Above will match 123-456-7890 OR (123)-456-7890 OR (123-456-7890 OR 123)-456-7890
- But badly used parens likely don't matter here
6.2 Sample greps on Phone Numbers
Basic grep invocation
- Regex characters like '?' and '{' must be "escaped" via '\?' and '\{' to take on special meeting
- Otherwise characters like '(' match exactly
> grep '(\?[0-9]\{3\})\?-[0-9]\{3\}-[0-9]\{4\}' phone-numbers.txt (218)-589-6764 (507)-209-5649 952-474-0698 612-266-0909 ...
grep With -E: Extended regexs
- Regex characters like '?' and '{' interpreted specially
- Escape characters like '(' via '\(' to inrepret exactly
> grep -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers.txt (218)-589-6764 (507)-209-5649 952-474-0698 612-266-0909 (218)-781-1788 ...
grep prints whole lines when it finds a match
> grep -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers-irregular.txt (218)-589-6764 Landline, from Dalton, MN(state),USA (507)-209-5649 952-474-0698 Landline, from Minneapolis, MN(state),USA 612-266-0909 Landline, from Saint Paul, MN(state),USA (218)-781-1788 Landline, from (507)-510-6175 Landline, from Sherburn, MN(state),USA 952-843-4789 Landline, from Minneapolis, MN(state),USA 320-254-3105 Landline, from ...
Option -o will print only the text that matches the regex
> grep -o -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers-irregular.txt (218)-589-6764 (507)-209-5649 952-474-0698 612-266-0909 ...
When searching multiple files, use -l to show names of files that match rather than lines.
> grep -l -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' phone-numbers* gettysburg.txt phone-numbers-irregular.txt phone-numbers.txt >
When searching whole directories, use recursive -r searches, often -l in conjunction.
> grep -r -l -E '\(?[0-9]{3}\)?-[0-9]{3}-[0-9]{4}' . ./phone-numbers.txt ./phone-numbers-irregular.txt ./search-dir/subdir/phone-numbers-irregular.txt ./search-dir/phone-numbers.txt ./search-dir/phone-numbers-irregular.txt
Grep has many more options that are useful in certain contexts such as:
-c
: count how many matches-n
: show line number of matches-v
: invert matches (lines that don't match)-L
: show file names that don't have a matching line-i
: case insensitive search (capitalization doesn't matter)
6.3 Example: Finding a Student From a Previous Class
- I get asked to for recommendation letters from students and need to find what classes they took, how they scored
- Often I grep all class directories for student name/email to quickly figure this out
> ls cs123 cs456 cs789 names-files > grep -r -i farrar cs* cs456/grades-CS456.csv:Mi Farrar,farrar@college.edu,1269,28.64 cs789/grades-CS789.csv:Mi Farrar,farrar@college.edu,3708,68.23 > grep -r -i blea cs* cs789/grades-CS789.csv:Meri Blea,blea@college.edu,155,3.24 > grep -r -i mcnelly cs* cs123/grades-CS123.csv:Olympia Mcnelly,mcnelly@college.edu,1628,97.64
6.4 Regex Non-uniformity
Some people, when confronted with a problem, think
"I know, I'll use regular expressions."
Now they have two problems.
– Attributed to Jamie Zawinski
Yegge's Solution is:
grep -l -R --perl-regexp "\b(\(\d{3}\)\s*|\d{3}-)\d{3}-\d{4}\b" * > output.txt ^^^^^^^^^^^ boundary whitespae digit boundary
- Regexes are a family of mini-languages without much standardization
- Programs tend to each have their own subtle variants and tricks in Regexs,
- Even between grep / sed / awk there are some subtle variations of what is accepted
- Emacs, Vi, Java, etc. have their own versions
- Perl Compatible Regular Expressions offer a large amount of power descending from the Perl language's regex implementation, appear as variants in some places
7 find: Finding Files with Properties
7.1 find Basics
- grep is good for searching for text patterns in files
- May also want to search for files with other properties
- Extension (type), Size, Modification Date, etc.
- The
find
utility allows for this, does recursive searches of a directory It's simplest invocation reports all files recursively in a directory
> cd grades > find . # show current dir recursively . ./cs789 ./cs789/grades-CS789.csv ./cs456 ./cs456/grades-CS456.csv ./cs123 ./cs123/grades-CS123.csv ./names-files ./names-files/names2.txt ./names-files/names3.txt ./names-files/names-to-csv.awk ./names-files/names1.txt
Simple invocations limit extensions reported
> find -name '*.csv' ./cs789/grades-CS789.csv ./cs456/grades-CS456.csv ./cs123/grades-CS123.csv
- Use of
*.csv
is a Shell Glob, another pattern language separate from regexs, supported by many tools like shells andfind
find
tons of options to filter as shown in the next few examples
7.2 Examples on Klobuchar Web Site
Following are examples from a web scrape of Amy Klobuchar's web site https://amyklobuchar.com/ on Tue 3/3/2020 (super Tuesday). It was retrieved using
wget https://amyklobuchar.com/ -r -k -p
Filter Extensions then Grep
Find all files which end in the .html
extension
> find . -name '*.html' ./amyklobuchar.com/feed/atom/index.html ./amyklobuchar.com/feed/index.html ./amyklobuchar.com/policies/amys-plan-for-economic-justice-and-opportunity-for-communities-of-color/index.html ./amyklobuchar.com/policies/senator-klobuchars-criminal-justice-reform-plan/index.html ./amyklobuchar.com/policies/senator-klobuchars-plan-for-comprehensive-immigration-reform/index.html ./amyklobuchar.com/policies/index.html ...
Find HTML files and run grep on them
> find . -name '*.html' -exec grep -E -o '[0-9]{3}-[0-9]{3}-[0-9]{4}' {} \; 800-452-7570 800-452-7570 800-452-7570 515-214-8933 202-662-7452 202-662-7452 202-662-7452 603-283-0797 603-352-1234 603-668-4321 603-668-4321
Filter on Large Files
Find files sized 1 megabyte or larger
klobuchar> find . -size +1M ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-regular-400.svg ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-light-300.svg ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-solid-900.svg ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-duotone-900.svg ./amyklobuchar.com/wp-content/uploads/2020/01/Screen-Shot-2020-01-18-at-4.04.02-PM-e1579386616249-2000x1069.png ...
Find 1Mb files and show size by exec'ing du -h
on them.
klobuchar> find . -size +1M -exec du -h {} \; 1.8M ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-regular-400.svg 2.0M ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-light-300.svg 1.5M ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-solid-900.svg 2.2M ./amyklobuchar.com/wp-content/themes/scotchpress/fonts/fa-duotone-900.svg 2.0M ./amyklobuchar.com/wp-content/uploads/2020/01/Screen-Shot-2020-01-18-at-4.04.02-PM-e1579386616249-2000x1069.png ...
Deleting all Large Files
Good for when you suspect there are old useless files that are too large to be worth keeping (but be careful).
klobuchar> find . -size +1M -delete
8 sed: the Stream Editor
8.1 Motivation
- Grep and Find allow location and reporting
- Sed introduces some limited 'editing' or alteration of files
- Short for 'stream editor', will see it used to transform text
- Works line-by-line and most operations work on single lines
- Sed 'programs' are usually small and specify how to change the text
- Advice: Don't write sed scripts that are too long as Awk (or Python) are probably better for complex tasks
8.2 Anatomy of a sed program
> sed 'pattern1 operation1; pattern2 operation2; ...' file1.txt file2.txt
- If pattern is present, operation will be applied only to matching lines
- Patterns are optional, if not specified, operation applied to all lines of input files
8.3 Gettysburg Examples
# default no transformations, original lines > sed '' gettysburg.txt |head -3 Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. # matches only line 1, substitue 'F' for 'P' > sed '1 s/F/P/' gettysburg.txt |head -3 Pour score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. # matches only line 1, substitue 'F' or 'f' for 'P' for first occurrence > sed '1 s/[Ff]/P/g' gettysburg.txt |head -3 Pour score and seven years ago our Pathers brought Porth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. # matches all lines, substitue 'F' or 'f' for 'P' globally (all occurrences) > sed 's/[Ff]/P/g' gettysburg.txt |head -3 Pour score and seven years ago our Pathers brought Porth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. # 2 actions, swapping 'Ff' for 'P' globally, swapping 'Pp' for 'F' globally > sed 's/[Ff]/P/g; s/[Pp]/F/g' gettysburg.txt |head -3 Four score and seven years ago our Fathers brought Forth on this continent, a new nation, conceived in Liberty, and dedicated to the FroFosition that all men are created equal. # don't print by default -n, print lines 6-12 > sed -n '6,12p' gettysburg.txt nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, # print only lines 13 and 19 > sed -n '13p; 19p' gettysburg.txt have consecrated it, far above our poor power to add or detract. The world will they gave the last full measure of devotion -- that we here highly resolve that # print only lines which match regex 'dead' : grep-like behavior > sed -n '/dead/p' gettysburg.txt not hallow -- this ground. The brave men, living and dead, who struggled here, that from these honored dead we take increased devotion to that cause for which these dead shall not have died in vain -- that this nation, under God, shall
8.4 Regexs to Capture Results
Recall phone number example with phone patterns as
(xxx)-xxx-xxxx AND xxx-xxx-xxxx.
Suppose want to convert all of 1st form to 2nd form as in
(123)-456-7890 becomes 123-456-7890 (321)-654-0987 becomes 321-654-0987
Possibily eliminate all '(' and ')' characters
> sed -E '' phone-numbers.txt |head -3 (218)-589-6764 Landline, from Dalton, MN(state),USA > sed -E 's/\(|\)//g;' phone-numbers.txt |head -3 218-589-6764 Landline, from Dalton, MNstate,USA ^^^^^^^
- Unfortunately changes other text in the file as well
- Need to use part of matched text in output
Use following regex
s/\(([0-9]{3})\)-([0-9]{3})-([0-9]{4})/\1-\2-\3/g; ( Group1 ) ( Group2 ) ( Group3 ) G1-G2-G3
- Special Parentheses chars '(stuff)' set up a Match Group in regexs
- Can use the Match Group in the substitution text
Full example
> sed -E 's/\(([0-9]{3})\)-([0-9]{3})-([0-9]{4})/\1-\2-\3/g;' phone-numbers.txt |head -3 218-589-6764 Landline, from Dalton, MN(state),USA
8.5 Combined Example
Supposing one wanted to make change all phone numbers in all HTML files in a directory to another format:
# find all .html files and search them for the phone# pattern 123-456-7890 > find . -name '*.html' -exec grep -E '([0-9]{3})-([0-9]{3})-([0-9]{4})' {} \; ... <strong>Phone:</strong> 603-668-4321<br> <strong>Phone:</strong> 603-352-1234<br> <strong>Phone:</strong> 603-668-4321<br> ... # find all .html files, run sed on them # sed will replace xxx-xxx-xxxx with (xxx) xxx-xxxx in the file # sed will make a backup of the original with the extension '.bk' > find . -name '*.html' -exec sed -E -i.bk 's/([0-9]{3})-([0-9]{3})-([0-9]{4})/\(\1\) \2-\3/g' {} \; # search for files with original pattern of phone number > find . -name '*.html' -exec grep -E -o '[0-9]{3}-[0-9]{3}-[0-9]{4}' {} \; # no matches reported # search for files with new pattern of phone number > find . -name '*.html' -exec grep -E '\([0-9]{3}\) [0-9]{3}-[0-9]{4}' {} \; ... <strong>Phone:</strong> (603) 352-1234<br> <strong>Phone:</strong> (603) 668-4321<br> <strong>Phone:</strong> (603) 668-4321<br> ... # shell script wizardry to move all backup files to their originals > for f in $(find . -name '*.bk');do mv $f ${f/.bk/};done
8.6 More Sed
- Sed has many other features such as copying patterns into a space and then 'pasting' them into output
- Doesn't have variables and loops in the forms that are standard in most languages
- My experience has been limited to mostly 's/original/substi/g' scripts after which tasks want for a more complete language like…
9 awk: the awkward, lovable, text processor
9.1 Background
- Named after original authors Aho, Weinberger, Kernighan, all at Bell labs, responsible for other inconsequential stuff like C, Fortran, Unix, etc.
- Is a small dynamically typed language for text processing, Turing complete, C-like syntax but Python-like feel
- Follows Sed's convention of
Pattern Action
but features variables, loops, functions, etc. - For some reason is not wildly popular but has persisted in Unix systems since 1977
9.2 Hello World
> cat hello.awk #!/bin/awk -f BEGIN{ print "Hello world!" } > awk -f hello.awk Hello world!
BEGIN
is a pattern, matches 'beginning of run'- Initial line
#!/bin/awk -f
referred to as a "shebang", short for "shell bang"- Indicates that the rest of the file is a script
- Shell should use the program
/bin/awk
to interpret the script - Shebang's are used for many other 'scripty' langauges like Bash, Python, Perl, etc. to make the scrips directly executable
9.3 Line Patterns and Built-in Vars in awk
More common awk script structures look like:
#!/bin/awk -f /dead/ { #matches lines with 'dead' print "This line is morbid:",$0 } $2 == "hallow" { #first 'field' is "dead" print "This line is holy: ",$0 } NR==3, NR==7 { #matches lines 3 to 7 print "Thes lines are in the middle",NR,":",$0 }
/regex/
is a regular expression- $1, $2, $3 are the 'fields' of a line, default space separated
- $0 is the entire current line
- NR is the 'record number', usually line number for single files
Awk processes line by line, if a pattern applies to a given line, the action is performed. Running above script on a relevant text file:
> awk -f patterns.awk gettysburg.txt Thes lines are in the middle 3 : are created equal. Thes lines are in the middle 4 : Thes lines are in the middle 5 : Now we are engaged in a great civil war, testing whether that nation, or any Thes lines are in the middle 6 : nation so conceived and so dedicated, can long endure. We are met on a great Thes lines are in the middle 7 : battle-field of that war. We have come to dedicate a portion of that field, as a This line is morbid: not hallow -- this ground. The brave men, living and dead, who struggled here, This line is holy: not hallow -- this ground. The brave men, living and dead, who struggled here, This line is morbid: that from these honored dead we take increased devotion to that cause for which This line is morbid: these dead shall not have died in vain -- that this nation, under God, shall
9.4 Printing all Fields
- Can introduce variables such as
i
without type declarations - Awk features
for()
loops like C - Awk also has special built-in variables for "Number
of Fields"
NF
and "Number of Records" (NR, line number) - An example is
printfields.awk
#!/bin/awk -f { for(i=1; i<=NF; i++){ print "Line",NR,"Field",i,":",$i; } }
Demonstrated on the gettysburg.txt
file
> awk -f printfields.awk gettysburg.txt Line 1 Field 1 : Four Line 1 Field 2 : score Line 1 Field 3 : and Line 1 Field 4 : seven Line 1 Field 5 : years Line 1 Field 6 : ago Line 1 Field 7 : our Line 1 Field 8 : fathers Line 1 Field 9 : brought Line 1 Field 10 : forth Line 1 Field 11 : on Line 1 Field 12 : this Line 1 Field 13 : continent, Line 1 Field 14 : a Line 2 Field 1 : new Line 2 Field 2 : nation, Line 2 Field 3 : conceived Line 2 Field 4 : in ...
Can change the field separator from space to other things to apply same awk script to differently formatted data
> awk -F , -f printfields.awk grades.csv |head Line 1 Field 1 : Henrietta Gamez Line 1 Field 2 : gamez@college.edu Line 1 Field 3 : 9240 Line 1 Field 4 : 59.39 Line 2 Field 1 : Lang Singleton Line 2 Field 2 : singleton@college.edu Line 2 Field 3 : 3063 Line 2 Field 4 : 57.89 ...
9.5 Killer feature: Built-in associated arrays
- Awk has 'associative' arrays
- These behave like normal arrays with numeric indices
- BUT they can work as hashes/dictionaries as well
- Extremely useful when combined with awk's other built-ins (field separation, string allocation, etc.)
- Let's code a 'word frequency' program together to demo this feature
- Result is as follows
#!/bin/awk -f # frequency.awk: calculates the frequency of each word that appears in # a text file and prints it out (in unsorted order). Leverages awk's # built-in associative arrays along with several other features. { # match every line gsub(/[^a-zA-Z ]/," "); # eliminate non-word characters like ',' and '!' for(i=1; i<=NF; i++){ # iterate over each word counts[$i] = counts[$i]+1 # use ith word as a key, increment its frequency } # leverages built-in semantics: not present -> 0 } END{ # after processing all lines for(key in counts){ # iterate over all keys in the counts array print key,":",counts[key] # print the key (word) and its count } }
- Running on
gettysburg.txt
:
# Run script on gettysburg.txt, shows word frequency in unsorted order > awk -f frequency.awk gettysburg.txt God : 1 detract : 1 honored : 1 before : 1 their : 1 people : 3 ... Lincoln : 1 proper : 1 who : 3 which : 2 # pipe results to sort and ask to sort on 3rd field (count) in reverse # numeric order to get top 10 most frequent words > awk -f frequency.awk gettysburg.txt | sort -k 3rn |head that : 13 the : 9 here : 8 to : 8 we : 8 a : 7 and : 6 can : 5 for : 5 have : 5
Footnotes:
The info
documentation system is a counterpart to man
pages; info
includes deeper discussion of tools, paging forward via
pressing space
, hyperlinking of pages via position the cursor on a
link and pressing Enter
, usually runs through emacs which has direct
access to the info system via C-h i
. Press 'q' to quit info
.