Reading: The Unix Programming Environment, Chapter 4
Filters and the Unix philosophy
· Unix has a philosophy of using small programs
that have a specific purpose
· These programs are then combined to produce
the result you want
· By giving you a set of "building
blocks," Unix lets you handle just about any situation
· Many of these "building blocks" are
"filters"
o
They take some input,
do something to it, and produce some output
· We'll cover a few of these in this section
· Generally speaking, grep searches for patterns in files
o
Or in stdin, if no files are given
· The patterns are a class of patterns called
regular expressions
o
grep stands for “get regular expression
and print”
· Variants of grep,
called egrep and fgrep,
are also usually available as grep -E and grep -F
o
egrep extends the regular expression syntax
o
fgrep does a "fast" search using
fixed strings
· Some of the most useful options:
o
grep –v prints lines
that do not match the pattern
o
grep –i is case-insensitive
o
grep –n prints out the
line number before the line (and file if more than one file searched)
o
grep –f filename reads
the patterns from a file (maybe only for fgrep and egrep on some systems)
o
grep –l only prints out
the filenames that have something that matches (very useful on command lines: sort `grep –l …` | …
· Regular expressions are basically
mini-algorithms that specify how to match text
o
Regular expressions
look similar to shell patterns, but are quite a bit different
· The simplest regular expresson
is a single letter, which matches that letter
o
a matches a, abcde, or supercalifragilisticexpialidocious
· A sequence of letters matches that sequence
o
cat matches cat, caterpillar,
or scatalogical
· The character . (a dot) matches any
character
· The character * indicates zero
or more occurrences of the preceeding character
o
car* matches cat, carry, or carolina
o
ar*a matches sarah, saab,
or marrrrrrrrrrrrra, but not marrrrrrrrrtha
· ^ matches the
beginning of a line
· $ matches the end
of a line
o
So ^$ matches a blank
line
· [....] matches any of
the characters given, and ranges can be specified
o
[0-9] matches any digit
o
[0-9]* matches zero or more digits
· [^....] matches any
character other than those listed, and ranges can be specified
o
[^0-9] matches any non-digit
· Note that * doesn't match
anything itself. It just modifies the meaning of the previous character
· egrep (or grep -E) adds a few more
o
The character + matches one
or more of the previous character
§ car+ matches car, carr, or carrrrr,
but not ca
o
The character ? matches zero or
one of the previous character
§ car?pet matches capet and carpet, but not ca or carrpet
· (expression1|expression2) matches either expression1 or expression2
· Note that ?, and + don't match
anything themselves. They just modify the meaning of the previous character
· The book offers a couple of interesting
regular expressions. If you understand them, you could be considered to have a
good understanding of regular expressions.
· ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
· ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
· The book offers a "thought exercise"
in Exercise 4-2 on p. 105:
o
How would things be
different if grep could match newlines?
o
(Perl makes this
possible.)
GNU Grep 3.0 Goodies
· sort
o
Can sort
alphabetically
§ Case sensitive
§ Case insensitive
o
Can sort numerically
o
Can sort ascending or
descending
o
Can sort based on part
of the line
· uniq
o
Note the spelling!
o
Discards duplicate
lines
o
Can include a count of
the number of times each line appears
o
Can print only the
duplicated lines, or only the unique lines
· comm
o
I've actually never
used this one
o
diff and cmp are more commonly used, and more useful, I think
· tr
o
Translates one set of
characters into another
o
Can use ranges, just
like character classes in regular expressions
o
Examples
§ tr a-z A-Z
§ Capitalizes everything
§ tr aeiot 43107
§ Make something 31337 ("eleet")
· dd
o
Copies bits from one
place to another
o
Can do various
transformations on the data (ASCII ß à EBCDIC)
· Combining things
o
cat $* |
tr -sc A-Za-z
'\012' |
sort |
uniq -c |
sort -n |
more
· sed is a version of ed that's designed to be
used as a filter
· While ed is no longer useful, sed is still quite useful
o
sed does not alter any named files; the modified
version is printed on stdout
§ So, how do you edit a file with sed?
§ Usually with something like:
§ sed [commands] filename >filename.new
mv filename filename.old
mv filename.new
filename
· Common usage
o
By far, the most common usage of sed is to replace one thing with another
§ sed 's/from/to/g' replaces all
occurrences off "from" with "to"
§ "from" is a regular expression
§ You can delete regular expressions by putting
a null string for the replacement:
§ sed 's/foo//g'
o
Note that I am
coloring the above to make it more readable, um, you don’t do that when using
it…
o
See the text for other
examples and note that grep turns out to be a special case of sed
· The book makes a "newer" command
with sed, which is of interest for how they do the
quoting, but the find command does a much easier version of "newer"
(thought question...)
Q: With what we now know about sed, is it possible to do something like this:
“Replace all occurrences of ‘P’
followed by any capital letter followed by any lower case letter with ‘M’
followed by that same capital letter then that same lower case letter then ‘Z’”?
Why or why not can we do this? If not, what
kind of primitive/capability are we needing?
(pause
to think about question….)
Basically, we need the ability to specify some
part that matched from and use it in the to. This
is done by surrounding the part of the subpattern you
want to match with escaped parentheses. The first such subpattern
becomes \1, the second \2, etc.
· sed 's/P\([A-Z]\)\([a-z]\)/M\1\2Z/g'
See the file fancy.sed and use it on fancy.in. You have to
source the file with the ‘.’ operator, or type it in.
Note: egrep
uses a parenthesis to group expressions that can be used in alteration, as
in (expression1|expression2) above. This is not supported in sed: to see try the following on your own with some simple
inputs:
% sed 's/\(A\)/B/g'
% sed 's/(A)/B/g'
% sed 's/(A|B)/C/g'
Finally, you can put the sed pattern in a file, and in fact have multiple
patterns in that file. If/when you get real advanced after this class, you may
have complicated setups where you generate those pattern files from another
script before running sed. But here is an example, in the demo directory:
% sed -f pats.sed
Just so you see what is in it, here it is:
% cat pats.sed
s/AA/BB/g
s/DD/EE/g
s/YY/ZZ/g
Note that there are no single quote marks
there: they are not necessary because the shell will never interpret the
patterns in the file like it would a pattern on the command line.
Advanced grep options and patterns
Here
are some more options for the grep family that you
will be responsible for:
-C NUM, --context=NUM
Print NUM lines of
output context. Places a line
containing --- between contiguous groups of matches.
-R, -r, --recursive
Read all files under each directory, recursively; this
is equivalent to the -d recurse option.
And
here are some more egrep pattern "primitives" you are
responsible for learning (we have covered the first few already):
A regular expression may be followed by one of several
repetition operators:
? The preceding item is optional
and matched at most once.
* The preceding item will be
matched zero or more times.
+ The preceding item will be
matched one or more times.
{n} The preceding
item is matched exactly n times.
{n,} The preceding item is
matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more
than m times.
The egrep patterns are actually a good deal more sophisticated than we have covered here. As you get more into using it, you will want to dig in deeper.