Filters
Filters and the Unix philosophy
grep
Regular
expressions
egrep regular expressions
Fun?
with regular expressions
Other
filters
sed
Reading:
The Unix Programming Environment, Chapter 4
- Unix has a philosophy of
using small programs that have a specific purpose
- These programs are then
combined to produce the result you want
- By giving you a set of
"building blocks," Unix lets you handle just about any situation
- Many of these "building
blocks" are "filters"
- They take some input,
do something to it, and produce some output
- We'll cover a few of these in
this section
grep
- Generally speaking, grep searches
for patterns in files
- Or in stdin, if no files are given
- The patterns are a class of
patterns called regular expressions
- grep stands for “get
regular expression and print”
- Variants of grep,
called egrep
and fgrep,
are also usually available as grep -E and grep -F
- egrep extends the regular
expression syntax
- fgrep does a "fast"
search using fixed strings
- Some of the most useful
options:
- grep –v prints lines that do
not match the pattern
- grep –i is
case-insensitive
- grep –n prints out the line
number before the line (and file if more than one file searched)
- grep –f filename reads the
patterns from a file (maybe only for fgrep and egrep on some systems)
- grep –l only prints out the
filenames that have something that matches (very useful on command lines:
sort `grep –l …` | …
- Regular expressions are
basically mini-algorithms that specify how to match text
- Regular expressions
look similar to shell patterns, but are quite a bit different
- The simplest regular expresson is a single letter, which matches that
letter
- a matches a, abcde, or supercalifragilisticexpialidocious
- A sequence of letters matches
that sequence
- cat matches cat, caterpillar, or scatalogical
- The character . (a dot) matches any character
- The character * indicates zero or more occurrences
of the preceeding character
- car* matches cat, carry, or carolina
- ar*a matches sarah, saab, or marrrrrrrrrrrrra, but not marrrrrrrrrtha
- ^ matches the beginning of a line
- $ matches the end of a line
- So ^$ matches a blank line
- [....] matches any of the characters given, and ranges
can be specified
- [0-9] matches any digit
- [0-9]* matches zero or more digits
- [^....] matches any character other than those listed,
and ranges can be specified
- [^0-9] matches any non-digit
- Note that * doesn't match anything itself. It
just modifies the meaning of the previous character
egrep
regular expressions
- egrep
(or grep -E) adds a few more
- The character + matches one or more of the
previous character
- car+ matches car, carr, or carrrrr, but not ca
- The character ? matches zero or one of the
previous character
- car?pet matches capet and carpet, but not ca or carrpet
- (expression1|expression2) matches either expression1
or expression2
- Note that ?, and + don't match anything themselves. They just modify
the meaning of the previous character
- The book offers a couple of
interesting regular expressions. If you understand them, you could be
considered to have a good understanding of regular expressions.
- ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
- ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
- The book offers a
"thought exercise" in Exercise 4-2 on p. 105:
- How would things be
different if grep could match newlines?
- (Perl makes this
possible.)
GNU Grep 3.0 Goodies
GNU Grep 3.0 addes character classes and other goodies that can save you
time if you do a lot of grepping. Link. (For now, focus on the baseline grep stuff above: I will announce what this is testable
later.)
Other
filters
- Case sensitive
- Case insensitive
- Can sort numerically
- Can sort ascending or
descending
- Can sort based on part
of the line
- Note the spelling!
- Discards duplicate
lines
- Can include a count of
the number of times each line appears
- Can print only the
duplicated lines, or only the unique lines
- I've actually never
used this one
- diff and cmp are more commonly used, and more useful, I think
- Translates one set of
characters into another
- Can use ranges, just
like character classes in regular expressions
- Examples
- Make something 31337
("eleet")
- Copies bits from one
place to another
- Can do various
transformations on the data (ASCII ß
à
EBCDIC)
- cat $* |
tr -sc A-Za-z '\012' |
sort |
uniq -c |
sort -n |
more
sed
- sed
is a version of ed that's designed to be used as
a filter
- While ed
is no longer useful, sed is still quite useful
- sed
does not alter any named files; the modified version is printed on stdout
- So, how do you edit a
file with sed?
- Usually with
something like:
- sed [commands] filename >filename.new
mv filename filename.old
mv filename.new
filename
- By far, the most common
usage of sed is to replace one thing with
another
- sed 's/foo/bar/g' replaces all
occurrences off "foo" with "bar"
- "Foo" is a
regular expression
- You can delete
regular expressions by putting a null string for the replacement
- See the text for other
examples and note that grep turns out to be a
special case of sed
- The book makes a
"newer" command with sed, which is of
interest for how they do the quoting, but the find command does a much
easier version of "newer"