Reading: The Unix Programming
Environment, Chapter 4
Filters and the Unix
philosophy
· Unix has a philosophy of using small programs
that have a specific purpose
· These programs are then combined to produce
the result you want
· By giving you a set of "building
blocks," Unix lets you handle just about any situation
· Many of these "building blocks" are
"filters"
o
They take some input,
do something to it, and produce some output
· We'll cover a few of these in this section
· Generally speaking, grep searches for patterns in files
o
Or in stdin, if no files are given
· The patterns are a class of patterns called
regular expressions
o
grep stands for “get regular expression
and print”
· Variants of grep,
called egrep and fgrep,
are also usually available as grep -E and grep -F
o
egrep extends the regular expression syntax
o
fgrep does a "fast" search using
fixed strings
· Some of the most useful options:
o
grep –v prints lines
that do not match the pattern
o
grep –i is case-insensitive
o
grep –n prints out the
line number before the line (and file if more than one file searched)
o
grep –f filename reads
the patterns from a file (maybe only for fgrep and egrep on some systems)
o
grep –l only prints out
the filenames that have something that matches (very useful on command lines: sort `grep –l …` | …
· Regular expressions are basically
mini-algorithms that specify how to match text
o
Regular expressions
look similar to shell patterns, but are quite a bit different
· The simplest regular expresson
is a single letter, which matches that letter
o
a matches a, abcde, or supercalifragilisticexpialidocious
· A sequence of letters matches that sequence
o
cat matches cat, caterpillar,
or scatalogical
· The character . (a dot) matches any
character
· The character * indicates zero
or more occurrences of the preceeding character
o
car* matches cat, carry, or carolina
o
ar*a matches sarah, saab,
or marrrrrrrrrrrrra, but not marrrrrrrrrtha
· ^ matches the
beginning of a line
· $ matches the end
of a line
o
So ^$ matches a blank
line
· [....] matches any of
the characters given, and ranges can be specified
o
[0-9] matches any digit
o
[0-9]* matches zero or more digits
· [^....] matches any
character other than those listed, and ranges can be specified
o
[^0-9] matches any non-digit
· Note that * doesn't match
anything itself. It just modifies the meaning of the previous character
· egrep (or grep -E) adds a
few more
o
The character + matches one
or more of the previous character
§ car+ matches car, carr, or carrrrr,
but not ca
o
The character ? matches zero or
one of the previous character
§ car?pet matches capet and carpet, but not ca or carrpet
· (expression1|expression2) matches either expression1 or expression2
· Note that ?, and + don't match
anything themselves. They just modify the meaning of the previous character
· The book offers a couple of interesting
regular expressions. If you understand them, you could be considered to have a
good understanding of regular expressions.
· ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
· ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
· The book offers a "thought exercise"
in Exercise 4-2 on p. 105:
o
How would things be
different if grep could match newlines?
o
(Perl makes this
possible.)
· sort
o
Can sort
alphabetically
§ Case sensitive
§ Case insensitive
o
Can sort numerically
o
Can sort ascending or
descending
o
Can sort based on part
of the line
· uniq
o
Note the spelling!
o
Discards duplicate
lines
o
Can include a count of
the number of times each line appears
o
Can print only the
duplicated lines, or only the unique lines
· comm
o
I've actually never
used this one
o
diff and cmp are more commonly used, and more useful, I think
· tr
o
Translates one set of
characters into another
o
Can use ranges, just
like character classes in regular expressions
o
Examples
§ tr a-z A-Z
§ Capitalizes everything
§ tr aeiot 43107
§ Make something 31337 ("eleet")
· dd
o
Copies bits from one
place to another
o
Can do various
transformations on the data (ASCII ß à EBCDIC)
· Combining things
o
cat $* |
tr -sc A-Za-z
'\012' |
sort |
uniq -c |
sort -n |
more
· sed is a version of ed
that's designed to be used as a filter
· While ed is no
longer useful, sed is still quite useful
o
sed does not alter any
named files; the modified version is printed on stdout
§ So, how do you edit a file with sed?
§ Usually with something like:
§ sed [commands] filename >filename.new
mv filename filename.old
mv filename.new
filename
· Common usage
o
By far, the most
common usage of sed is to replace one thing with
another
§ sed 's/from/to/g' replaces all
occurrences off "from" with "to"
§ "from" is a regular expression
§ You can delete regular expressions by putting
a null string for the replacement:
§ sed 's/foo//g'
o
Note that I am coloring
the above to make it more readable, um, you don’t do that when using it…
o
See the text for other
examples and note that grep turns out to be a special
case of sed
· The book makes a "newer" command
with sed, which is of interest for how they do the quoting,
but the find command does a much easier version of "newer" (thought
question...)
Q: With what we now know about sed, is it possible to do something like this:
“Replace all occurrences of ‘P’
followed by any capital letter followed by any lower case letter with ‘M’
followed by that same capital letter then that same lower case letter then ‘Z’”?
Why or why not can we do this? If not, what
kind of primitive/capability are we needing?
Basically, we need the ability to specify some
part that matched from and use it in the to. This is done by
surrounding the part of the subpattern you want to
match with escaped parentheses. The first such subpattern
becomes \1, the second \2, etc.
· sed 's/P\([A-Z]\)\([a-z]\)/M\1\2Z/g'
See the file ~cs224/demo/2012/Oct29/fancy.sed and use it on fancy.in. You have to source the file with the ‘.’ operator, or
type it in.
Note: egrep uses a
parenthesis to group expressions that can be used in alteration, as in (expression1|expression2) above. This is not supported in sed:
to see try the following on your own with some simple inputs:
% sed 's/\(A\)/B/g'
% sed 's/(A)/B/g'
% sed 's/(A|B)/C/g'
Finally, you can put the sed
pattern in a file, and in fact have multiple patterns in that file. If/when you
get real advanced after this class, you may have
complicated setups where you generate those pattern files from another script
before running sed. But here is an example, in the ~cs224/demo/2012/Oct29 directory:
% sed -f pats.sed
Just so you see what is in it, here it is:
% cat pats.sed
s/AA/BB/g
s/DD/EE/g
s/YY/ZZ/g
Note that there are no single quote marks
there: they are not necessary because the shell will never interpret the patterns
in the file like it would a pattern on the command line.
Advanced grep options and patterns
Here
are some more options for the grep family that you
will be responsible for:
-C NUM, --context=NUM
Print NUM lines of
output context. Places a line
containing --- between contiguous groups of matches.
-R, -r, --recursive
Read all files under each directory, recursively; this
is equivalent to the -d recurse option.
And
here are some more egrep pattern "primitives" you are
responsible for learning (we have covered the first few already):
A regular expression may be followed by one of several
repetition operators:
? The preceding item is optional
and matched at most once.
* The preceding item will be
matched zero or more times.
+ The preceding item will be
matched one or more times.
{n} The preceding
item is matched exactly n times.
{n,} The preceding item is
matched n or more times.
{n,m}
The preceding item is matched at least n times, but not
more than m times.
The egrep patterns are actually a good deal more sophisticated than we have covered here. As you get more into using it, you will want to dig ni deeper.