Filters

Reading: The Unix Programming Environment, Chapter 4

Filters and the Unix philosophy

· Unix has a philosophy of using small programs that have a specific purpose

· These programs are then combined to produce the result you want

· By giving you a set of "building blocks," Unix lets you handle just about any situation

· Many of these "building blocks" are "filters"

o They take some input, do something to it, and produce some output

· We'll cover a few of these in this section

grep

· Generally speaking, grep searches for patterns in files

o Or in stdin, if no files are given

· The patterns are a class of patterns called regular expressions

o grep stands for “get regular expression and print”

· Variants of grep, called egrep and fgrep, are also usually available as grep -E and grep -F

o egrep extends the regular expression syntax

o fgrep does a "fast" search using fixed strings

· Some of the most useful options:

o grep –v prints lines that do not match the pattern

o grep –i is case-insensitive

o grep –n prints out the line number before the line (and file if more than one file searched)

o grep –f filename reads the patterns from a file (maybe only for fgrep and egrep on some systems)

o grep –l only prints out the filenames that have something that matches (very useful on command lines: sort `grep –l …` | …

Regular expressions

· Regular expressions are basically mini-algorithms that specify how to match text

o Regular expressions look similar to shell patterns, but are quite a bit different

· The simplest regular expresson is a single letter, which matches that letter

o a matches a, abcde, or supercalifragilisticexpialidocious

· A sequence of letters matches that sequence

o cat matches cat, caterpillar, or scatalogical

· The character . (a dot) matches any character

· The character * indicates zero or more occurrences of the preceeding character

o car* matches cat, carry, or carolina

o ar*a matches sarah, saab, or marrrrrrrrrrrrra, but not marrrrrrrrrtha

· ^ matches the beginning of a line

· $ matches the end of a line

o So ^$ matches a blank line

· [....] matches any of the characters given, and ranges can be specified

o [0-9] matches any digit

o [0-9]* matches zero or more digits

· [^....] matches any character other than those listed, and ranges can be specified

o [^0-9] matches any non-digit

· Note that * doesn't match anything itself. It just modifies the meaning of the previous character

egrep regular expressions

· egrep (or grep -E) adds a few more

o The character + matches one or more of the previous character

§ car+ matches car, carr, or carrrrr, but not ca

o The character ? matches zero or one of the previous character

§ car?pet matches capet and carpet, but not ca or carrpet

· (expression1|expression2) matches either expression1 or expression2

· Note that ?, and + don't match anything themselves. They just modify the meaning of the previous character

Fun? with regular expressions

· The book offers a couple of interesting regular expressions. If you understand them, you could be considered to have a good understanding of regular expressions.

· ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

· ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

· The book offers a "thought exercise" in Exercise 4-2 on p. 105:

o How would things be different if grep could match newlines?

o (Perl makes this possible.)

Other filters

· sort

o Can sort alphabetically

§ Case sensitive

§ Case insensitive

o Can sort numerically

o Can sort ascending or descending

o Can sort based on part of the line

· uniq

o Note the spelling!

o Discards duplicate lines

o Can include a count of the number of times each line appears

o Can print only the duplicated lines, or only the unique lines

· comm

o I've actually never used this one

o diff and cmp are more commonly used, and more useful, I think

· tr

o Translates one set of characters into another

o Can use ranges, just like character classes in regular expressions

o Examples

§ tr a-z A-Z

§ Capitalizes everything

§ tr aeiot 43107

§ Make something 31337 ("eleet")

· dd

o Copies bits from one place to another

o Can do various transformations on the data (ASCII ß à EBCDIC)

· Combining things

sed

· sed is a version of ed that's designed to be used as a filter

· While ed is no longer useful, sed is still quite useful

o sed does not alter any named files; the modified version is printed on stdout

§ So, how do you edit a file with sed?

§ Usually with something like:

§ sed [commands] filename >filename.new
mv filename filename.old
mv filename.new filename

· Common usage

o By far, the most common usage of sed is to replace one thing with another

§ sed 's/from/to/g' replaces all occurrences off "from" with "to"

§ "from" is a regular expression

§ You can delete regular expressions by putting a null string for the replacement:

§ sed 's/foo//g'

o Note that I am coloring the above to make it more readable, um, you don’t do that when using it…

o See the text for other examples and note that grep turns out to be a special case of sed

· The book makes a "newer" command with sed, which is of interest for how they do the quoting, but the find command does a much easier version of "newer" (thought question...)

Q: With what we now know about sed, is it possible to do something like this:

“Replace all occurrences of ‘P’ followed by any capital letter followed by any lower case letter with ‘M’ followed by that same capital letter then that same lower case letter then ‘Z’”?

Why or why not can we do this? If not, what kind of primitive/capability are we needing?

Basically, we need the ability to specify some part that matched from and use it in the to. This is done by surrounding the part of the subpattern you want to match with escaped parentheses. The first such subpattern becomes \1, the second \2, etc.

· sed 's/P$[A-Z]$$[a-z]$/M\1\2Z/g'

See the file ~cs224/demo/2012/Oct29/fancy.sed and use it on fancy.in. You have to source the file with the ‘.’ operator, or type it in.

Note: egrep uses a parenthesis to group expressions that can be used in alteration, as in (expression1|expression2) above. This is not supported in sed: to see try the following on your own with some simple inputs:

% sed 's/$A$/B/g'

% sed 's/(A)/B/g'

% sed 's/(A|B)/C/g'

Finally, you can put the sed pattern in a file, and in fact have multiple patterns in that file. If/when you get real advanced after this class, you may have complicated setups where you generate those pattern files from another script before running sed. But here is an example, in the ~cs224/demo/2012/Oct29 directory:

% sed -f pats.sed

Just so you see what is in it, here it is:

% cat pats.sed

s/AA/BB/g

s/DD/EE/g

s/YY/ZZ/g

Note that there are no single quote marks there: they are not necessary because the shell will never interpret the patterns in the file like it would a pattern on the command line.

Advanced grep options and patterns

Here are some more options for the grep family that you will be responsible for:

-C NUM, --context=NUM

Print NUM lines of output context. Places a line containing --- between contiguous groups of matches.

-R, -r, --recursive

Read all files under each directory, recursively; this is equivalent to the -d recurse option.

And here are some more egrep pattern "primitives" you are responsible for learning (we have covered the first few already):

A regular expression may be followed by one of several repetition operators:

? The preceding item is optional and matched at most once.

* The preceding item will be matched zero or more times.

+ The preceding item will be matched one or more times.

{n} The preceding item is matched exactly n times.

{n,} The preceding item is matched n or more times.

{n,m} The preceding item is matched at least n times, but not more than m times.

The egrep patterns are actually a good deal more sophisticated than we have covered here. As you get more into using it, you will want to dig ni deeper.