Shell¶
Overview¶
This lesson will introduce you to using the shell, which is required for using FutureSystems resources.
Acknowledgments¶
Parts of this lesson were adapted from the Software Carpentry lesson on using the Shell, which is distributed under a Creative Commons Attribution license v4 and is copyright Software Carpentry.
Description¶
A “shell” is a program that facilitates interaction between human and computer. By providing this level of abstraction certain tasks which may be otherwise cumbersome of time-consuming are relatively simple to accomplish. There are numerous types of shells. If you have ever used a computer you have used a shell. For instance, Windows and Mac OS X use a shell based on graphical representations with a mouse and keyboard for interaction. Touch-screen phones and tables also use another type of shell whose mode of interaction is through touch. In addition to these which you might be familiar with, we will be using a command shell to interact with the computer primarily through the keyboard.
Tip
Command line shell is often called a Command Line Interface, or CLI for short.
Note
Adventurous readers may be interested to know that both
Windows and Mac OS X provide a command shell. On Windows you can
run cmd.exe
from Start –> Run. On OS X open Applications
–> Utilities –> Terminal. Be aware that the Windows and OS X CLI
may be different than on FutureSystems.
Tip
If you wish to follow along please log into your FutureSystems account (see Use of FutureSystems).
Introduction¶
Learning Objectives¶
- Explain how the shell relates to the keyboard, the screen, the operating system, and users’ programs.
- Explain when and why command-line interfaces should be used instead of graphical interfaces.
Nelle Pipeline¶
Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 300 samples in all, and now needs to:
- Run each sample through an assay machine that will measure the relative abundance of 300 different proteins. The machine’s output for a single sample is a file with one line for each protein.
- Calculate statistics for each of the proteins separately using a
program her supervisor wrote called
goostat
. - Compare the statistics for each protein with corresponding statistics
for each other protein using a program one of the other graduate
students wrote called
goodiff
. - Write up. Her supervisor would really like her to do this by the end of the month so that her paper can appear in an upcoming special issue of Aquatic Goo Letters.
It takes about half an hour for the assay machine to process each sample. The good news is, it only takes two minutes to set each one up. Since her lab has eight assay machines that she can use in parallel, this step will “only” take about two weeks.
The bad news is that if she has to run goostat
and goodiff
by
hand, she’ll have to enter filenames and click “OK” 45,150 times (300
runs of goostat
, plus 300x299/2 runs of goodiff
). At 30 seconds
each, that will take more than two weeks. Not only would she miss her
paper deadline, the chances of her typing all of those commands right
are practically zero.
The next few lessons will explore what she should do instead. More specifically, they explain how she can use a command shell to automate the repetitive steps in her processing pipeline so that her computer can work 24 hours a day while she writes her paper. As a bonus, once she has put a processing pipeline together, she will be able to use it again whenever she collects more data.
What and Why¶
At a high level, computers do four things:
- run programs
- store data
- communicate with each other
- interact with us
They can do the last of these in many different ways, including direct brain-computer links and speech interfaces. Since these are still in their infancy, most of us use windows, icons, mice, and pointers. These technologies didn’t become widespread until the 1980s, but their roots go back to Doug Engelbart’s work in the 1960s, which you can see in what has been called “The Mother of All Demos”.
Going back even further, the only way to interact with early computers was to rewire them. But in between, from the 1950s to the 1980s, most people used line printers. These devices only allowed input and output of the letters, numbers, and punctuation found on a standard keyboard, so programming languages and interfaces had to be designed around that constraint.
This kind of interface is called a command-line interface, or CLI, to distinguish it from the graphical user interface, or GUI, that most people now use. The heart of a CLI is a read-evaluate-print loop, or REPL: when the user types a command and then presses the enter (or return) key, the computer reads it, executes it, and prints its output. The user then types another command, and so on until the user logs off.
This description makes it sound as though the user sends commands directly to the computer, and the computer sends output directly to the user. In fact, there is usually a program in between called a command shell. What the user types goes into the shell; it figures out what commands to run and orders the computer to execute them. Note, the reason why the shell is called the shell: it encloses the operating system in order to hide some of its complexity and make it simpler to interact with.
A shell is a program like any other. What’s special about it is that its job is to run other programs rather than to do calculations itself. The most popular Unix shell is Bash, the Bourne Again SHell (so-called because it’s derived from a shell written by Stephen Bourne — this is what passes for wit among programmers). Bash is the default shell on most modern implementations of Unix, and in most packages that provide Unix-like tools for Windows.
Using Bash or any other shell sometimes feels more like programming than like using a mouse. Commands are terse (often only a couple of characters long), their names are frequently cryptic, and their output is lines of text rather than something visual like a graph. On the other hand, the shell allows us to combine existing tools in powerful ways with only a few keystrokes and to set up pipelines to handle large volumes of data automatically. In addition, the command line is often the easiest way to interact with remote machines. As clusters and cloud computing become more popular for scientific data crunching, being able to drive them is becoming a necessary skill.
Prompts and Commands¶
Shell Concepts Introduced¶
whoami
: display user id
Prompt¶
Once you log into the appropriate machine you will be presented with the prompt, typically represented as the following:
$
Command¶
At the prompt you enter a command to run a program.
For instance, the whoami
program indicates the username
you logged in under.
To see this type whoami
and press enter (the result will be different
but you should recognize your username):
$ whoami
nelle
Tip
On Windows you start a program by double-clicking an icon to going to Start –> <Program> to launch it (the commands described here are Unix commands and are unlikely to work on Windows). On OS X you might go to the dock at the bottom of the screen. In a commandline shell you type the name of the program.
When you execute the whoami
command the shell:
- finds the program called
whoami
- runs that program
- displays the program’s output
- displays a new shell prompt (ready for more commands)
Files and Directories¶
Shell Concepts Introduced¶
pwd
: print working directoryls
: list directory contentscd
: change directoryTAB
: using tab-completion
Learning Objectives¶
- Explain the similarities and differences between a file and a directory.
- Translate an absolute path into a relative path and vice versa.
- Construct absolute and relative paths that identify specific files and directories.
- Explain the steps in the shell’s read-run-print cycle.
- Identify the actual command, flags, and filenames in a command-line call.
- Demonstrate the use of tab completion, and explain its advantages.
The Filesystem¶
The part of the operating system responsible for managing files and directories is called the file system. It organizes our data into files, which hold information, and directories (also called “folders”), which hold files or other directories.
Next, let’s find out where we are by running a command called pwd
(which stands for “print working directory”). At any moment, our
current working directory is our current default directory, i.e.,
the directory that the computer assumes we want to run commands in
unless we explicitly specify something else. Here, the computer’s
response is /users/nelle
, which is Nelle’s home directory:
$ pwd
/users/nelle
Tip
If the command to find out who we are is whoami
, the command
to find out where we are ought to be called whereami
, so why
is it pwd
instead? The usual answer is that in the early
1970s, when Unix was first being developed, every keystroke
counted: the devices of the day were slow, and backspacing on a
teletype was so painful that cutting the number of keystrokes in
order to cut the number of typing mistakes was actually a win for
usability. The reality is that commands were added to Unix one by
one, without any master plan, by people who were immersed in its
jargon. The result is as inconsistent as the roolz uv Inglish
speling, but we’re stuck with it now.
To understand what a “home directory” is, let’s have a look at how the
file system as a whole is organized. At the top is the root
directory that holds everything else. We refer to it using a slash
character /
on its own; this is the leading slash in
/users/nelle
.
Inside that directory are several other directories: bin
(which is
where some built-in programs are stored), data
(for miscellaneous
data files), users
(where users’ personal directories are
located), tmp
(for temporary files that don’t need to be stored
long-term), and so on:
We know that our current working directory /users/nelle
is stored
inside /users
because /users
is the first part of its name.
Similarly, we know that /users
is stored inside the root directory
/
because its name begins with /
.
Underneath /users
, we find one directory for each user with an
account on this machine. The Mummy’s files are stored in
/users/imhotep
, Wolfman’s in /users/larry
, and ours in
/users/nelle
, which is why nelle
is the last part of the
directory’s name.
Notice that there are two meanings for the/
character. When it appears at the front of a file or directory name, it refers to the root directory. When it appears inside a name, it’s just a separator.
Let’s see what’s in Nelle’s home directory by running ls
, which
stands for “listing”:
$ ls
creatures molecules pizza.cfg
data north-pacific-gyre solar.pdf
Desktop notes.txt writing
ls
prints the names of the files and directories in the current
directory in alphabetical order, arranged neatly into columns. We can
make its output more comprehensible by using the flag -F
,
which tells ls
to add a trailing /
to the names of
directories:
$ ls -F
creatures/ molecules/ pizza.cfg
data/ north-pacific-gyre/ solar.pdf
Desktop/ notes.txt writing/
Here, we can see that /users/nelle
contains seven
sub-directories. The names that don’t have trailing slashes, like
notes.txt
, pizza.cfg
, and solar.pdf
, are plain old files.
And note that there is a space between ls
and -F
: without it,
the shell thinks we’re trying to run a command called ls-F
, which
doesn’t exist.
What’s In A Name?¶
You may have noticed that all of Nelle’s files’ names are “something
dot something”. This is just a convention: we can call a file
mythesis
or almost anything else we want. However, most people use
two-part names most of the time to help them (and their programs) tell
different kinds of files apart. The second part of such a name is
called the filename extension, and indicates what type of data the
file holds: .txt
signals a plain text file, .pdf
indicates a
PDF document, .cfg
is a configuration file full of parameters for
some program or other, and so on.
This is just a convention, albeit an important one. Files contain bytes: it’s up to us and our programs to interpret those bytes according to the rules for PDF documents, images, and so on.
Naming a PNG image of a whale as whale.mp3
doesn’t somehow
magically turn it into a recording of whale song, though it might
cause the operating system to try to open it with a music player when
someone double-clicks it.
Parameters and Arguments¶
Now let’s take a look at what’s in Nelle’s data
directory by
running ls -F data
, i.e., the command ls
with the
arguments -F
and data
. The second argument — the one
without a leading dash — tells ls
that we want a listing of
something other than our current working directory:
$ ls -F data
amino-acids.txt elements/ morse.txt
pdb/ planets.txt sunspot.txt
The output shows us that there are four text files and two sub-sub-directories. Organizing things hierarchically in this way helps us keep track of our work: it’s possible to put hundreds of files in our home directory, just as it’s possible to pile hundreds of printed papers on our desk, but it’s a self-defeating strategy.
Notice, by the way that we spelled the directory name data
. It
doesn’t have a trailing slash: that’s added to directory names by
ls
when we use the -F
flag to help us tell things apart. And
it doesn’t begin with a slash because it’s a relative path, i.e.,
it tells ls
how to find something from where we are, rather than
from the root of the file system.
Tip
- Parameters vs. Arguments
- According to Wikipedia, the terms argument and parameter mean slightly different things. In practice, however, most people use them interchangeably or inconsistently, so we will too.
If we run ls -F /data
(with a leading slash) we get a different
answer, because /data
is an absolute path:
$ ls -F /data
access.log backup/ hardware.cfg
network.cfg
The leading /
tells the computer to follow the path from the root
of the filesystem, so it always refers to exactly one directory, no
matter where we are when we run the command.
Moving around¶
What if we want to change our current working directory? Before we do
this, pwd
shows us that we’re in /users/nelle
, and ls
without any arguments shows us that directory’s contents:
$ pwd
/users/nelle
$ ls
creatures molecules pizza.cfg
data north-pacific-gyre solar.pdf
Desktop notes.txt writing
We can use cd
followed by a directory name to change our working
directory. cd
stands for “change directory”, which is a bit
misleading: the command doesn’t change the directory, it changes the
shell’s idea of what directory we are in.:
$ cd data
cd
doesn’t print anything, but if we run pwd
after it, we can
see that we are now in /users/nelle/data
. If we run ls
without
arguments now, it lists the contents of /users/nelle/data
, because
that’s where we now are:
$ pwd
/users/nelle/data
$ ls -F
amino-acids.txt elements/ morse.txt
pdb/ planets.txt sunspot.txt
We now know how to go down the directory tree: how do we go up? We could use an absolute path:
$ cd /users/nelle
but it’s almost always simpler to use cd ..
to go up one level:
$ pwd
/users/nelle/data
$ cd ..
..
is a special directory name meaning “the directory containing
this one”, or more succinctly, the parent of the current directory.
Sure enough, if we run pwd
after running cd ..
, we’re back in
/users/nelle
:
$ pwd
/users/nelle
The special directory ..
doesn’t usually show up when we run ls
.
If we want to display it, we can give ls
the -a
flag:
$ ls -F -a
./ creatures/ notes.txt
../ data/ pizza.cfg
.bash_profile molecules/ solar.pdf
Desktop/ north-pacific-gyre/ writing/
-a
stands for “show all”; it forcesls
to show us file and directory names that begin with.
, such as..
(which, if we’re in/users/nelle
, refers to the/users
directory). As you can see, it also displays another special directory that’s just called.
, which means “the current working directory”. It may seem redundant to have a name for it, but we’ll see some uses for it soon.- Finally, we also see a file called
.bash_profile
. This file usually contains settings to customize the shell (terminal). There may also be similar files called.bashrc
or.bash_login
. For this lesson material it does not contain any settings.
Tip
- Orthogonality
- The special names
.
and..
don’t belong tols
; they are interpreted the same way by every program. For example, if we are in/users/nelle/data
, the commandls ..
will give us a listing of/users/nelle
. When the meanings of the parts are the same no matter how they’re combined, programmers say they are orthogonal: Orthogonal systems tend to be easier for people to learn because there are fewer special cases and exceptions to keep track of.
Nelle’s Pipeline: Organizing Files¶
Knowing just this much about files and directories, Nelle is ready to
organize the files that the protein assay machine will create. First,
she creates a directory called north-pacific-gyre
(to remind
herself where the data came from). Inside that, she creates a
directory called 2012-07-03
, which is the date she started
processing the samples. She used to use names like
conference-paper
and revised-results
, but she found them hard
to understand after a couple of years. (The final straw was when she
found herself creating a directory called
revised-revised-results-3
.)
Nelle names her directories “year-month-day”, with leading zeroes for months and days, because the shell displays file and directory names in alphabetical order. If she used month names, December would come before July; if she didn’t use leading zeroes, November (‘11’) would come before July (‘7’).
Each of her physical samples is labeled according to her lab’s
convention with a unique ten-character ID, such as “NENE01729A”. This
is what she used in her collection log to record the location, time,
depth, and other characteristics of the sample, so she decides to use
it as part of each data file’s name. Since the assay machine’s output
is plain text, she will call her files NENE01729A.txt
,
NENE01812A.txt
, and so on. All 1520 files will go into the same
directory.
If she is in her home directory, Nelle can see what files she has using the command:
$ ls north-pacific-gyre/2012-07-03/
This is a lot to type, but she can let the shell do most of the work. If she types:
$ ls nor
and then presses tab, the shell automatically completes the directory name for her:
$ ls north-pacific-gyre/
If she presses tab again, Bash will add 2012-07-03/
to the
command, since it’s the only possible completion. Pressing tab again
does nothing, since there are 1520 possibilities; pressing tab twice
brings up a list of all the files, and so on. This is called tab
completion, and we will see it in many other tools as we go on.
Creating and Deleting¶
Shell Concepts Introduced¶
mkdir
: make a directorynano
: a text editorrm
: remove directory entriesrmdir
: remove directoriescp
: copy filesmv
: move files
Learning Objectives¶
- Create a directory hierarchy that matches a given diagram.
- Create files in that hierarchy using an editor or by copying and renaming existing files.
- Display the contents of a directory using the command line.
- Delete specified files and/or directories.
Creating Directories¶
We now know how to explore files and directories, but how do we create
them in the first place? Let’s go back to Nelle’s home directory,
/users/nelle
, and use ls -F
to see what it contains:
$ pwd
/users/nelle
$ ls -F
creatures/ molecules/ pizza.cfg
data/ north-pacific-gyre/ solar.pdf
Desktop/ notes.txt writing/
Let’s create a new directory called thesis
using the command
mkdir thesis
(which has no output):
$ mkdir thesis
As you might (or might not) guess from its name, mkdir
means “make
directory”. Since thesis
is a relative path (i.e., doesn’t have a
leading slash), the new directory is created in the current working
directory:
$ ls -F
creatures/ north-pacific-gyre/ thesis/
data/ notes.txt writing/
Desktop/ pizza.cfg
molecules/ solar.pdf
However, there’s nothing in it yet:
$ ls -F thesis
Creating Files¶
One of the simplest ways to create an empty file is via the touch
command. Change the working directory to thesis
using cd
, then
touch an empty file called draft.txt
:
$ cd thesis
$ touch draft.txt
If we check the directory contents now,:
$ ls -F .
draft.txt
Let’s change our working directory to thesis
using cd
, then
run a text editor called Nano to create a file called draft.txt
:
$ cd thesis
$ nano draft.txt
Tip
- Which Editor?
A text editor is a program that edits plain text. Nano is one example of such a program, there are many more.
Editors will be covered in a later lesson: Editing Files
Let’s type in a few lines of text, then use Control-O to write our data to disk:
Once our file is saved, we can use Control-X to quit the editor and
return to the shell. (Unix documentation often uses the shorthand
^A
to mean “control-A”.) nano
doesn’t leave any output on the
screen after it exits, but ls
now shows that we have created a
file called draft.txt
:
$ ls
draft.txt
Removing Files¶
Let’s tidy up by running rm draft.txt
:
$ rm draft.txt
This command removes files (“rm” is short for “remove”). If we run
ls
again, its output is empty once more, which tells us that our
file is gone:
$ ls
Caution
- Deleting Is Forever
- The Unix shell doesn’t have a trash bin that we can recover deleted files from (though most graphical interfaces to Unix do). Instead, when we delete files, they are unhooked from the file system so that their storage space on disk can be recycled. Tools for finding and recovering deleted files do exist, but there’s no guarantee they’ll work in any particular situation, since the computer may recycle the file’s disk space right away.
Removing Directories¶
Let’s re-create that file and then move up one directory to
/users/nelle
using cd ..
:
$ pwd
/users/nelle/thesis
$ nano draft.txt
$ ls
draft.txt
$ cd ..
If we try to remove the entire thesis
directory using rm
thesis
, we get an error message:
$ rm thesis
rm: cannot remove `thesis': Is a directory
This happens because rm
only works on files, not directories. The
right command is rmdir
, which is short for “remove directory”. It
doesn’t work yet either, though, because the directory we’re trying to
remove isn’t empty:
$ rmdir thesis
rmdir: failed to remove `thesis': Directory not empty
This little safety feature can save you a lot of grief, particularly
if you are a bad typist. To really get rid of thesis
we must first
delete the file draft.txt
:
$ rm thesis/draft.txt
The directory is now empty, so rmdir
can delete it:
$ rmdir thesis
Caution
- With Great Power Comes Great Responsibility
Removing the files in a directory just so that we can remove the directory quickly becomes tedious. Instead, we can use
rm
with the-r
flag (which stands for “recursive”):$ rm -r thesis
This removes everything in the directory, then the directory itself. If the directory contains sub-directories,
rm -r
does the same thing to them, and so on. It’s very handy, but can do a lot of damage if used without care.
Moving Files and Directories¶
Let’s create that directory and file one more time. (Note that this
time we’re running nano
with the path thesis/draft.txt
, rather
than going into the thesis
directory and running nano
on
draft.txt
there.):
$ pwd
/users/nelle
$ mkdir thesis
$ nano thesis/draft.txt
$ ls thesis
draft.txt
draft.txt
isn’t a particularly informative name, so let’s change
the file’s name using mv
, which is short for “move”:
$ mv thesis/draft.txt thesis/quotes.txt
The first parameter tells mv
what we’re “moving”, while the second
is where it’s to go. In this case, we’re moving thesis/draft.txt
to thesis/quotes.txt
, which has the same effect as renaming the
file. Sure enough, ls
shows us that thesis
now contains one
file called quotes.txt
:
$ ls thesis
quotes.txt
Just for the sake of inconsistency, mv
also works on directories
— there is no separate mvdir
command.
Let’s move quotes.txt
into the current working directory. We use
mv
once again, but this time we’ll just use the name of a
directory as the second parameter to tell mv
that we want to keep
the filename, but put the file somewhere new. (This is why the command
is called “move”.) In this case, the directory name we use is the
special directory name .
that we mentioned earlier.:
$ mv thesis/quotes.txt .
The effect is to move the file from the directory it was in to the
current working directory. ls
now shows us that thesis
is
empty:
$ ls thesis
Further, ls
with a filename or directory name as a parameter only
lists that file or directory. We can use this to see that
quotes.txt
is still in our current directory:
$ ls quotes.txt
quotes.txt
Copying Files¶
The cp
command works very much like mv
, except it copies a
file instead of moving it. We can check that it did the right thing
using ls
with two paths as parameters — like most Unix commands,
ls
can be given thousands of paths at once:
$ cp quotes.txt thesis/quotations.txt
$ ls quotes.txt thesis/quotations.txt
quotes.txt thesis/quotations.txt
To prove that we made a copy, let’s delete the quotes.txt
file in
the current directory and then run that same ls
again.:
$ rm quotes.txt
$ ls quotes.txt thesis/quotations.txt
ls: cannot access quotes.txt: No such file or directory
thesis/quotations.txt
This time it tells us that it can’t find quotes.txt
in the current
directory, but it does find the copy in thesis
that we didn’t
delete.
Tip
- Another Useful Abbreviation
- The shell interprets the character
~
(tilde) at the start of a path to mean “the current user’s home directory”. For example, if Nelle’s home directory is/home/nelle
, then~/data
is equivalent to/home/nelle/data
. This only works if it is the first character in the path:here/there/~/elsewhere
is not/home/nelle/elsewhere
.
Exercises¶
Renaming files¶
Suppose that you created a .txt
file in your current directory
to contain a list of the statistical tests you will need to do to
analyze your data, and named it: statstics.txt
After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so?
cp statstics.txt statistics.txt
mv statstics.txt statistics.txt
mv statstics.txt .
cp statstics.txt .
Moving and Copying¶
What is the output of the closing ls
command in the sequence
shown below?
$ pwd
/home/jamie/data
$ ls
proteins.dat
$ mkdir recombine
$ mv proteins.dat recombine
$ cp recombine/proteins.dat ../proteins-saved.dat
$ ls
proteins-saved.dat recombine
recombine
proteins.dat recombine
proteins-saved.dat
Listing Directories and Files¶
Suppose that:
$ ls -F
analyzed/ fructose.dat raw/ sucrose.dat
What command(s) could you run so that the commands below will produce the output shown?
$ ls
analyzed/ raw/
$ ls analyzed
fructose.dat sucrose.dat
Copy with Multiple Filenames¶
What does cp
do when given several filenames and a directory
name, as in:
$ mkdir backup
$ cp thesis/citations.txt thesis/quotations.txt backup
What does cp
do when given three or more filenames, as in:
$ ls -F
intro.txt methods.txt survey.txt
$ cp intro.txt methods.txt survey.txt
Listing Recursively and By Time¶
The command ls -R
lists the contents of directories recursively,
i.e., lists their sub-directories, sub-sub-directories, and so on in
alphabetical order at each level. The command ls -t
lists things
by time of last change, with most recently changed files or
directories first. In what order does ls -R -t
display things?
Pipes and Filters¶
Shell Concepts Introduced¶
wc
: word count*
: globbing>
: redirection to filestdout
: standard output streamcat
: concatenatesort
: sorting|
: pipehead
: get first few linesuniq
: remove duplicate adjacent linescut
: cut selected portions of text
Learning Objectives¶
- Redirect a command’s output to a file.
- Process a file instead of keyboard input using redirection.
- Construct command pipelines with two or more stages.
- Explain what usually happens if a program or pipeline isn’t given any input to process.
- Explain Unix’s “small pieces, loosely joined” philosophy.
Globbing/Wildcards¶
Now that we know a few basic commands, we can finally look at the
shell’s most powerful feature: the ease with which it lets us combine
existing programs in new ways. We’ll start with a directory called
molecules
that contains six files describing some simple organic
molecules. The .pdb
extension indicates that these files are in
Protein Data Bank format, a simple text format that specifies the type
and position of each atom in the molecule.:
$ ls molecules
cubane.pdb ethane.pdb methane.pdb
octane.pdb pentane.pdb propane.pdb
Let’s go into that directory with cd
and run the command wc
*.pdb
. wc
is the “word count” command: it counts the number of
lines, words, and characters in files. The *
in *.pdb
matches
zero or more characters, so the shell turns *.pdb
into a complete
list of .pdb
files:
$ cd molecules
$ wc *.pdb
20 156 1158 cubane.pdb
12 84 622 ethane.pdb
9 57 422 methane.pdb
30 246 1828 octane.pdb
21 165 1226 pentane.pdb
15 111 825 propane.pdb
107 819 6081 total
Tip
- Wildcards
*
is a wildcard. It matches zero or more characters, so*.pdb
matchesethane.pdb
,propane.pdb
, and so on. On the other hand,p*.pdb
only matchespentane.pdb
andpropane.pdb
, because the ‘p’ at the front only matches itself.?
is also a wildcard, but it only matches a single character. This means thatp?.pdb
matchespi.pdb
orp5.pdb
, but notpropane.pdb
. We can use any number of wildcards at a time: for example,p*.p?*
matches anything that starts with a ‘p’ and ends with ‘.’, ‘p’, and at least one more character (since the ‘?’ has to match one character, and the final ‘*’ can match any number of characters). Thus,p*.p?*
would matchpreferred.practice
, and evenp.pi
(since the first ‘*’ can match no characters at all), but notquality.practice
(doesn’t start with ‘p’) orpreferred.p
(there isn’t at least one character after the ‘.p’).When the shell sees a wildcard, it expands the wildcard to create a list of matching filenames before running the command that was asked for. As an exception, if a wildcard expression does not match any file, Bash will pass the expression as a parameter to the command as it is. For example typing
ls *.pdf
in the molecules directory (which contains only files with names ending with.pdb
) results in an error message that there is no file called*.pdf
. However, generally commands likewc
andls
see the lists of file names matching these expressions, but not the wildcards themselves. It is the shell, not the other programs, that deals with expanding wildcards, and this another example of orthogonal design.
If we run wc -l
instead of just wc
, the output shows only the
number of lines per file:
$ wc -l *.pdb
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
We can also use -w
to get only the number of words, or -c
to
get only the number of characters.
Redirecting Output¶
Which of these files is shortest? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:
$ wc -l *.pdb > lengths.txt
The greater than symbol, >
, tells the shell to redirect the
command’s output to a file instead of printing it to the screen. The
shell will create the file if it doesn’t exist, or overwrite the
contents of that file if it does. (This is why there is no screen
output: everything that wc
would have printed has gone into the
file lengths.txt
instead.) ls lengths.txt
confirms that the
file exists:
$ ls lengths.txt
lengths.txt
We can now send the content of lengths.txt
to the screen using
cat lengths.txt
. cat
stands for “concatenate”: it prints the
contents of files one after another. There’s only one file in this
case, so cat
just shows us what it contains:
$ cat lengths.txt
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
Now let’s use the sort
command to sort its contents. We will also
use the -n flag to specify that the sort is numerical instead of
alphabetical. This does not change the file; instead, it sends the
sorted result to the screen:
$ sort -n lengths.txt
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
We can put the sorted list of lines in another temporary file called
sorted-lengths.txt
by putting > sorted-lengths.txt
after the
command, just as we used > lengths.txt
to put the output of wc
into lengths.txt
. Once we’ve done that, we can run another command
called head
to get the first few lines in sorted-lengths.txt
:
$ sort -n lengths.txt > sorted-lengths.txt
$ head -1 sorted-lengths.txt
9 methane.pdb
Using the parameter -1
with head
tells it that we only want
the first line of the file; -20
would get the first 20, and so
on. Since sorted-lengths.txt
contains the lengths of our files
ordered from least to greatest, the output of head
must be the
file with the fewest lines.
Redirecting Input¶
As well as using >
to redirect a program’s output, we can use
<
to redirect its input, i.e., to read from a file instead of from
standard input. For example, instead of writing wc ammonia.pdb
, we
could write wc < ammonia.pdb
. In the first case, wc
gets a
command line parameter telling it what file to open. In the second,
wc
doesn’t have any command line parameters, so it reads from
standard input, but we have told the shell to send the contents of
ammonia.pdb
to wc
‘s standard input.
Pipes¶
If you think having use many intermediate files is confusing, you’re
in good company: even once you understand what wc
, sort
, and
head
do, all those intermediate files make it hard to follow
what’s going on. We can make it easier to understand by running
sort
and head
together:
$ sort -n lengths.txt | head -1
9 methane.pdb
The vertical bar between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right. The computer might create a temporary file if it needs to, or copy data from one program to the other in memory, or something else entirely; we don’t have to know or care.
We can use another pipe to send the output of wc
directly to
sort
, which then sends its output to head
:
$ wc -l *.pdb | sort -n | head -1
9 methane.pdb
This is exactly like a mathematician nesting functions like log(3x)
and saying “the log of three times x”. In our case, the calculation
is “head of sort of line count of *.pdb
”.
Here’s what actually happens behind the scenes when we create a pipe. When a computer runs a program — any program — it creates a process in memory to hold the program’s software and its current state. Every process has an input channel called standard input. (By this point, you may be surprised that the name is so memorable, but don’t worry: most Unix programmers call it “stdin”. Every process also has a default output channel called standard output (or “stdout”).
The shell is actually just another program. Under normal circumstances, whatever we type on the keyboard is sent to the shell on its standard input, and whatever it produces on standard output is displayed on our screen. When we tell the shell to run a program, it creates a new process and temporarily sends whatever we type on our keyboard to that process’s standard input, and whatever the process sends to standard output to the screen.
Here’s what happens when we run wc -l *.pdb > lengths.txt
. The
shell starts by telling the computer to create a new process to run
the wc
program. Since we’ve provided some filenames as parameters,
wc
reads from them instead of from standard input. And since we’ve
used >
to redirect output to a file, the shell connects the
process’s standard output to that file.
If we run wc -l *.pdb | sort -n
instead, the shell creates two
processes (one for each process in the pipe) so that wc
and
sort
run simultaneously. The standard output of wc
is fed
directly to the standard input of sort
; since there’s no
redirection with >
, sort
‘s output goes to the screen. And if
we run wc -l *.pdb | sort -n | head -1
, we get three processes
with data flowing from the files, through wc
to sort
, and from
sort
through head
to the screen.
Filter¶
This simple idea is why Unix has been so successful. Instead of
creating enormous programs that try to do many different things, Unix
programmers focus on creating lots of simple tools that each do one
job well, and that work well with each other. This programming model
is called “pipes and filters”. We’ve already seen pipes; a filter
is a program like wc
or sort
that transforms a stream of input
into a stream of output. Almost all of the standard Unix tools can
work this way: unless told to do otherwise, they read from standard
input, do something with what they’ve read, and write to standard
output.
The key is that any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way as well. You can and should write your programs this way so that you and other people can put those programs into pipes to multiply their power.
Nelle’s Pipeline: Checking Files¶
Nelle has run her samples through the assay machines and created 1520
files in the north-pacific-gyre/2012-07-03
directory described
earlier. As a quick sanity check, she types:
$ cd north-pacific-gyre/2012-07-03
$ wc -l *.txt
The output is 1520 lines that look like this:
300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
300 NENE01751B.txt
300 NENE01812A.txt
... ...
Now she types this:
$ wc -l *.txt | sort -n | head -5
240 NENE02018B.txt
300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
Whoops: one of the files is 60 lines shorter than the others. When she goes back and checks it, she sees that she did that assay at 8:00 on a Monday morning — someone was probably in using the machine on the weekend, and she forgot to reset it. Before re-running that sample, she checks to see if any files have too much data:
$ wc -l *.txt | sort -n | tail -5
300 NENE02040A.txt
300 NENE02040B.txt
300 NENE02040Z.txt
300 NENE02043A.txt
300 NENE02043B.txt
Those numbers look good — but what’s that ‘Z’ doing there in the third-to-last line? All of her samples should be marked ‘A’ or ‘B’; by convention, her lab uses ‘Z’ to indicate samples with missing information. To find others like it, she does this:
$ ls *Z.txt
NENE01971Z.txt NENE02040Z.txt
Sure enough, when she checks the log on her laptop, there’s no depth
recorded for either of those samples. Since it’s too late to get the
information any other way, she must exclude those two files from her
analysis. She could just delete them using rm
, but there are
actually some analyses she might do later where depth doesn’t matter,
so instead, she’ll just be careful later on to select files using the
wildcard expression *[AB].txt
. As always, the ‘*’ matches any
number of characters; the expression [AB]
matches either an ‘A’ or
a ‘B’, so this matches all the valid data files she has.
Exercises¶
What does sort -n
do?¶
If we run sort
on this file:
10
2
19
22
6
the output is:
10
19
2
22
6
If we run sort -n
on the same input, we get this instead:
2
6
10
19
22
Explain why -n
has this effect.
Piping commands together¶
In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
wc -l * > sort -n > head -3
wc -l * | sort -n | head 1-3
wc -l * | head -3 | sort -n
wc -l * | sort -n | head -3
Why does uniq
only remove adjacent duplicates?¶
The command uniq
removes adjacent duplicated lines from its
input. For example, if a file salmon.txt
contains:
coho
coho
steelhead
coho
steelhead
steelhead
then uniq salmon.txt
produces:
coho
steelhead
coho
steelhead
Why do you think uniq
only removes adjacent duplicated
lines? (Hint: think about very large data sets.) What other
command could you combine with it in a pipe to remove all
duplicated lines?
Pipe reading comprehension¶
A file called animals.txt
contains the following data:
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
What text passes through each of the pipes and the final redirect in the pipeline below?
cat animals.txt | head -5 | tail -3 | sort -r > final.txt
Pipe construction¶
The command:
$ cut -d , -f 2 animals.txt
produces the following output:
deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear
What other command(s) could be added to this in a pipeline to find out what animals the file contains (without any duplicates in their names)?
Loops¶
Shell Concepts Introduced¶
for
: starts a for loop$name
: a variable calledname
echo
: display text
Learning Objectives¶
- Write a loop that applies one or more commands separately to each file in a set of files.
- Trace the values taken on by a loop variable during execution of the loop.
- Explain the difference between a variable’s name and its value.
- Explain why spaces and some punctuation characters shouldn’t be used in files’ names.
- Demonstrate how to see what commands have recently been executed.
- Re-run recently executed commands without retyping them.
Loops¶
Wildcards and tab completion are two ways to reduce typing (and typing
mistakes). Another is to tell the shell to do something over and over
again. Suppose we have several hundred genome data files named
basilisk.dat
, unicorn.dat
, and so on. In this example, we’ll
use the creatures
directory which only has two example files, but
the principles can be applied to many many more files at once. We
would like to modify these files, but also save a version of the
original files and rename them as original-basilisk.dat
and
original-unicorn.dat
. We can’t use:
$ mv *.dat original-*.dat
because that would expand to:
$ mv basilisk.dat unicorn.dat original-*.dat
This wouldn’t back up our files, instead we get an error:
mv: target `original-*.dat' is not a directory
This a problem arises when mv
receives more than two inputs. When
this happens, it expects the last input to be a directory where it can
move all the files it was passed to. Since there is no directory named
original-*.dat
in the creatures
directory we get an error.
Instead, we can use a loop to do some operation once for each thing in a list. Here’s a simple example that displays the first three lines of each file in turn:
$ for filename in basilisk.dat unicorn.dat
> do
> head -3 $filename
> done
COMMON NAME: basilisk
CLASSIFICATION: basiliscus vulgaris
UPDATED: 1745-05-02
COMMON NAME: unicorn
CLASSIFICATION: equus monoceros
UPDATED: 1738-11-24
When the shell sees the keyword for
, it knows it is supposed to
repeat a command (or group of commands) once for each thing in a list.
In this case, the list is the two filenames. Each time through the
loop, the name of the thing currently being operated on is assigned to
the variable called filename
. Inside the loop, we get the
variable’s value by putting $
in front of it: $filename
is
basilisk.dat
the first time through the loop, unicorn.dat
the
second, and so on.
Variables¶
By using the dollar sign we are telling the shell interpreter to treat
filename
as a variable name and substitute its value on its place,
but not as some text or external command. When using variables it is
also possible to put the names into curly braces to clearly delimit
the variable name: $filename
is equivalent to ${filename}
, but
is different from ${file}name
. You may find this notation in other
people’s programs.
Finally, the command that’s actually being run is our old friend
head
, so this loop prints out the first three lines of each data
file in turn.
Tip
- The Prompt Changes
- The shell prompt changes from
$
to>
and back again as we were typing in our loop. The second prompt,>
, is different to remind us that we haven’t finished typing a complete command yet.
We have called the variable in this loop filename
in order to make
its purpose clearer to human readers. The shell itself doesn’t care
what the variable is called; if we wrote this loop as:
for x in basilisk.dat unicorn.dat
do
head -3 $x
done
or:
for temperature in basilisk.dat unicorn.dat
do
head -3 $temperature
done
it would work exactly the same way. Don’t do this. Programs are only
useful if people can understand them, so meaningless names (like
x
) or misleading names (like temperature
in this case)
increase the odds that the program won’t do what its readers think it
does.
Informative Loops¶
Here’s a slightly more complicated loop:
for filename in *.dat
do
echo $filename
head -100 $filename | tail -20
done
The shell starts by expanding *.dat
to create the list of files it
will process. The loop body then executes two commands for each of
those files. The first, echo
, just prints its command-line
parameters to standard output. For example:
$ echo hello there
prints:
hello there
In this case, since the shell expands $filename
to be the name of
a file, echo $filename
just prints the name of the file. Note that
we can’t write this as:
for filename in *.dat
do
$filename
head -100 $filename | tail -20
done
because then the first time through the loop, when $filename
expanded to basilisk.dat
, the shell would try to run
basilisk.dat
as a program. Finally, the head
and tail
combination selects lines 81-100 from whatever file is being
processed.
Spaces in Names¶
Filename expansion in loops is another reason you should not use spaces in filenames. Suppose our data files are named:
basilisk.dat
red dragon.dat
unicorn.dat
If we try to process them using:
for filename in *.dat
do
head -100 $filename | tail -20
done
then the shell will expand *.dat
to create:
basilisk.dat red dragon.dat unicorn.dat
With older versions of Bash, or most other shells, filename
will
then be assigned the following values in turn:
basilisk.dat
red
dragon.dat
unicorn.dat
That’s a problem: head
can’t read files called red
and
dragon.dat
because they don’t exist, and won’t be asked to read
the file red dragon.dat
.
We can make our script a little bit more robust by quoting our use of the variable:
for filename in *.dat
do
head -100 "$filename" | tail -20
done
but it’s simpler just to avoid using spaces (or other special characters) in filenames.
Going back to our original file renaming problem, we can solve it using this loop:
for filename in *.dat
do
mv $filename original-$filename
done
This loop runs the mv
command once for each filename. The first
time, when $filename
expands to basilisk.dat
, the shell
executes:
mv basilisk.dat original-basilisk.dat
The second time, the command is:
mv unicorn.dat original-unicorn.dat
Tip
- Measure Twice, Run Once
A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo the commands it would run instead of actually running them. For example, we could write our file renaming loop like this:
for filename in *.dat do echo mv $filename original-$filename done
Instead of running
mv
, this loop runsecho
, which prints out:mv basilisk.dat original-basilisk.dat mv unicorn.dat original-unicorn.dat
without actually running those commands. We can then use up-arrow to redisplay the loop, back-arrow to get to the word
echo
, delete it, and then press “enter” to run the loop with the actualmv
commands. This isn’t foolproof, but it’s a handy way to see what’s going to happen when you’re still learning how loops work.
Nelle’s Pipeline: Processing Files¶
Nelle is now ready to process her data files. Since she’s still learning how to use the shell, she decides to build up the required commands in stages. Her first step is to make sure that she can select the right files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’:
$ cd north-pacific-gyre/2012-07-03
$ for datafile in *[AB].txt
> do
> echo $datafile
> done
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
NENE02043A.txt
NENE02043B.txt
Her next step is to decide what to call the files that the
goostats
analysis program will create. Prefixing each input file’s
name with “stats” seems simple, so she modifies her loop to do that:
$ for datafile in *[AB].txt
> do
> echo $datafile stats-$datafile
> done
NENE01729A.txt stats-NENE01729A.txt
NENE01729B.txt stats-NENE01729B.txt
NENE01736A.txt stats-NENE01736A.txt
...
NENE02043A.txt stats-NENE02043A.txt
NENE02043B.txt stats-NENE02043B.txt
She hasn’t actually run goostats
yet, but now she’s sure she can
select the right files and generate the right output filenames.
Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses the up arrow. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):
$ for datafile in *[AB].txt; do echo $datafile stats-$datafile; done
Using the left arrow key, Nelle backs up and changes the command
echo
to goostats
:
$ for datafile in *[AB].txt; do bash goostats $datafile stats-$datafile; done
When she presses enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the job by typing Control-C, uses up-arrow to repeat the command, and edits it to read:
$ for datafile in *[AB].txt; do echo $datafile; bash goostats $datafile stats-$datafile; done
Tip
- Moving to the Beginning and End
- We can move to the beginning of a line in the shell by typing
^A
(which means Control-A) and to the end using^E
.
When she runs her program now, it produces one line of output every five seconds or so:
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
1518 times 5 seconds, divided by 60, tells her that her script will
take about two hours to run. As a final check, she opens another
terminal window, goes into north-pacific-gyre/2012-07-03
, and uses
cat stats-NENE01729B.txt
to examine one of the output files. It
looks good, so she decides to get some coffee and catch up on her
reading.
History¶
(Those who know history can choose to repeat it)
Another way to repeat previous work is to use the history
command to get a list of the last few hundred commands that have
been executed, and then to use !123
(where “123” is replaced
by the command number) to repeat one of those commands. For
example, if Nelle types this:
$ history | tail -5
456 ls -l NENE0*.txt
457 rm stats-NENE01729B.txt.txt
458 bash goostats NENE01729B.txt stats-NENE01729B.txt
459 ls -l NENE0*.txt
460 history
then she can re-run goostats
on NENE01729B.txt
simply by
typing !458
.
Exercises¶
Variables in loops¶
Suppose that ls
initially displays:
fructose.dat glucose.dat sucrose.dat
What is the output of:
for datafile in *.dat
do
ls *.dat
done
Now, what is the output of:
for datafile in *.dat
do
ls $datafile
done
Why do these two loops give you different outputs?
Saving to a file in a loop - part one¶
In the same directory, what is the effect of this loop?
for sugar in *.dat
do
echo $sugar
cat $sugar > xylose.dat
done
- Prints
fructose.dat
,glucose.dat
, andsucrose.dat
, and copiessucrose.dat
to createxylose.dat
. - Prints
fructose.dat
,glucose.dat
, andsucrose.dat
, and concatenates all three files to createxylose.dat
. - Prints
fructose.dat
,glucose.dat
,sucrose.dat
, andxylose.dat
, and copiessucrose.dat
to createxylose.dat
. - None of the above.
Saving to a file in a loop - part two¶
In another directory, where ls
returns:
fructose.dat glucose.dat sucrose.dat maltose.txt
What would be the output of the following loop?
for datafile in *.dat
do
cat $datafile >> sugar.dat
done
- All of the text from
fructose.dat
,glucose.dat
andsucrose.dat
would be concatenated and saved to a file calledsugar.dat
. - The text from
sucrose.dat
will be saved to a file calledsugar.dat
. - All of the text from
fructose.dat
,glucose.dat
,sucrose.dat
andmaltose.txt
would be concatenated and saved to a file calledsugar.dat
. - All of the text from
fructose.dat
,glucose.dat
andsucrose.dat
would be printed to the screen and saved to a file calledsugar.dat
Doing a Dry Run¶
Suppose we want to preview the commands the following loop will execute without actually running those commands:
for file in *.dat
do
analyze $file > analyzed-$file
done
What is the difference between the the two loops below, and which one would we want to run?:
# Version 1
for file in *.dat
do
echo analyze $file > analyzed-$file
done
# Version 2
for file in *.dat
do
echo "analyze $file > analyzed-$file"
done
Variable Commands¶
Describe in words what the following loop does.
for how in frog11 prcb redig
do
$how -limit 0.01 NENE01729B.txt
done
Finding Things¶
Shell Concepts Introduced¶
grep
: file pattern searcherfind
: walk a file hierarchyman
: display manual pages$()
: execute in a subshell
Learning Objectives¶
- Use
grep
to select lines from text files that match simple patterns.- Use
find
to find files whose names match simple patterns.- Use the output of one command as the command-line parameters to another command.
- Explain what is meant by “text” and “binary” files, and why many common tools don’t handle the latter well.
Searching File Contents¶
You can guess someone’s age by how they talk about search: young people use “Google” as a verb, while crusty old Unix programmers use “grep”. The word is a contraction of “global/regular expression/print”, a common sequence of operations in early Unix text editors. It is also the name of a very useful command-line program.
grep
finds and prints lines in files that match a pattern. For our
examples, we will use a file that contains three haiku taken from a
1998 competition in Salon magazine. For this set of examples we’re
going to be working in the writing subdirectory:
$ cd
$ cd writing
$ cat haiku.txt
The Tao that is seen
Is not the true Tao, until
You bring fresh toner.
With searching comes loss
and the presence of absence:
"My Thesis" not found.
Yesterday it worked
Today it is not working
Software is like that.
Tip
- Forever, or Five Years
- We haven’t linked to the original haiku because they don’t appear to be on Salon‘s site any longer. As Jeff Rothenberg said, “Digital information lasts forever — or five years, whichever comes first.”
Let’s find lines that contain the word “not”:
$ grep not haiku.txt
Is not the true Tao, until
"My Thesis" not found
Today it is not working
Here, not
is the pattern we’re searching for. It’s pretty simple:
every alphanumeric character matches against itself. After the pattern
comes the name or names of the files we’re searching in. The output is
the three lines in the file that contain the letters “not”.
Let’s try a different pattern: “day”.:
$ grep day haiku.txt
Yesterday it worked
Today it is not working
This time, two lines that include the letters “day” are outputted.
However, these letters are contained within larger words. To restrict
matches to lines containing the word “day” on its own, we can give
grep
with the -w
flag. This will limit matches to word
boundaries.:
$ grep -w day haiku.txt
In this case, there aren’t any, so grep
‘s output is empty.
Another useful option is -n
, which numbers the lines that match:
$ grep -n it haiku.txt
5:With searching comes loss
9:Yesterday it worked
10:Today it is not working
Here, we can see that lines 5, 9, and 10 contain the letters “it”.
We can combine flags as we do with other Unix commands. For example,
since -i
makes matching case-insensitive and -v
inverts the
match, using them both only prints lines that don’t match the
pattern in any mix of upper and lower case:
$ grep -i -v the haiku.txt
You bring fresh toner.
With searching comes loss
Yesterday it worked
Today it is not working
Software is like that.
grep
has lots of other options. To find out what they are, we can
type man grep
. man
is the Unix “manual” command: it prints a
description of a command and its options, and (if you’re lucky)
provides a few examples of how to use it.
To navigate through the man
pages, you may use the up and down
arrow keys to move line-by-line, or try the “b” and spacebar keys to
skip up and down by full page. Quit the man
pages by typing “q”.:
$ man grep
GREP(1) GREP(1)
NAME
grep, egrep, fgrep - print lines matching a pattern
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
DESCRIPTION
grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-
minus (-) is given as file name) for lines containing a match to the given PATTERN. By default, grep
prints the matching lines.
... ... ...
OPTIONS
Generic Program Information
--help Print a usage message briefly summarizing these command-line options and the bug-reporting
address, then exit.
-V, --version
Print the version number of grep to the standard output stream. This version number should be
included in all bug reports (see below).
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by
POSIX.)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be
matched. (-F is specified by POSIX.)
... ... ...
Searching for Patterns in Files¶
grep
‘s real power doesn’t come from its options, though; it comes
from the fact that patterns can include wildcards. (The technical name
for these is regular expressions, which is what the “re” in “grep”
stands for.) Regular expressions are both complex and powerful; if you
want to do complex searches, please look at the lesson on our website. As a taster, we can find lines
that have an ‘o’ in the second position like this:
$ grep -E '^.o' haiku.txt
You bring fresh toner.
Today it is not working
Software is like that.
We use the -E
flag and put the pattern in quotes to prevent the
shell from trying to interpret it. (If the pattern contained a ‘*’,
for example, the shell would try to expand it before running
grep
.) The ‘^’ in the pattern anchors the match to the start of
the line. The ‘.’ matches a single character (just like ‘?’ in the
shell), while the ‘o’ matches an actual ‘o’.
Searching for Files¶
While grep
finds lines in files, the find
command finds files
themselves. Again, it has a lot of options; to show how the simplest
ones work, we’ll use the directory tree shown below.
Nelle’s writing
directory contains one file called haiku.txt
and four subdirectories: thesis
(which is sadly empty), data
(which contains two files one.txt
and two.txt
), a tools
directory that contains the programs format
and stats
, and an
empty subdirectory called old
.
For our first command, let’s run find . -type d
. As always, the
.
on its own means the current working directory, which is where
we want our search to start; -type d
means “things that are
directories”. Sure enough, find
‘s output is the names of the five
directories in our little tree (including .
):
$ find . -type d
./
./data
./thesis
./tools
./tools/old
If we change -type d
to -type f
, we get a listing of all the
files instead:
$ find . -type f
./haiku.txt
./tools/stats
./tools/old/oldtool
./tools/format
./thesis/empty-draft.md
./data/one.txt
./data/two.txt
find
automatically goes into subdirectories, their subdirectories,
and so on to find everything that matches the pattern we’ve given
it. If we don’t want it to, we can use -maxdepth
to restrict the
depth of search:
$ find . -maxdepth 1 -type f
./haiku.txt
The opposite of -maxdepth
is -mindepth
, which tells find
to
only report things that are at or below a certain depth. -mindepth 2
therefore finds all the files that are two or more levels below us:
$ find . -mindepth 2 -type f
./data/one.txt
./data/two.txt
./tools/format
./tools/stats
Searching for Files Matching a Pattern¶
Now let’s try matching by name:
$ find . -name *.txt
./haiku.txt
We expected it to find all the text files, but it only prints out
./haiku.txt
. The problem is that the shell expands wildcard
characters like *
before commands run. Since *.txt
in the
current directory expands to haiku.txt
, the command we actually
ran was:
$ find . -name haiku.txt
find
did what we asked; we just asked for the wrong thing.
To get what we want, let’s do what we did with grep
: put *.txt
in single quotes to prevent the shell from expanding the *
wildcard. This way, find
actually gets the pattern *.txt
, not
the expanded filename haiku.txt
:
$ find . -name '*.txt'
./data/one.txt
./data/two.txt
./haiku.txt
Tip
- Listing vs. Finding
ls
andfind
can be made to do similar things given the right options, but under normal circumstances,ls
lists everything it can, whilefind
searches for things with certain properties and shows them.
As we said earlier, the command line’s power lies in combining tools.
We’ve seen how to do that with pipes; let’s look at another technique.
As we just saw, find . -name '*.txt'
gives us a list of all text
files in or below the current directory. How can we combine that with
wc -l
to count the lines in all those files?
Subshell¶
The simplest way is to put the find
command inside $()
:
$ wc -l $(find . -name '*.txt')
11 ./haiku.txt
300 ./data/two.txt
70 ./data/one.txt
381 total
When the shell executes this command, the first thing it does is run
whatever is inside the $()
. It then replaces the $()
expression with that command’s output. Since the output of find
is
the three filenames ./data/one.txt
, ./data/two.txt
, and
./haiku.txt
, the shell constructs the command:
$ wc -l ./data/one.txt ./data/two.txt ./haiku.txt
which is what we wanted. This expansion is exactly what the shell does
when it expands wildcards like *
and ?
, but lets us use any
command we want as our own “wildcard”.
It’s very common to use find
and grep
together. The first
finds files that match a pattern; the second looks for lines inside
those files that match another pattern. Here, for example, we can find
PDB files that contain iron atoms by looking for the string “FE” in
all the .pdb
files above the current directory:
$ grep FE $(find .. -name '*.pdb')
../data/pdb/heme.pdb:ATOM 25 FE 1 -0.924 0.535 -0.518
Tip
- Binary Files
We have focused exclusively on finding things in text files. What if your data is stored as images, in databases, or in some other format? One option would be to extend tools like
grep
to handle those formats. This hasn’t happened, and probably won’t, because there are too many formats to support.The second option is to convert the data to text, or extract the text-ish bits from the data. This is probably the most common approach, since it only requires people to build one tool per data format (to extract information). On the one hand, it makes simple things easy to do. On the negative side, complex things are usually impossible. For example, it’s easy enough to write a program that will extract X and Y dimensions from image files for
grep
to play with, but how would you write something to find values in a spreadsheet whose cells contained formulas?The third choice is to recognize that the shell and text processing have their limits, and to use a programming language such as Python instead. When the time comes to do this, don’t be too hard on the shell: many modern programming languages, Python included, have borrowed a lot of ideas from it, and imitation is also the sincerest form of praise.
Exercises¶
Using grep¶
The Tao that is seen
Is not the true Tao, until
You bring fresh toner.
With searching comes loss
and the presence of absence:
"My Thesis" not found.
Yesterday it worked
Today it is not working
Software is like that.
From the above text, contained in the file haiku.txt
, which
command would result in the following output:
and the presence of absence
grep of haiku.txt
grep -E of haiku.txt
grep -w of haiku.txt
grep -i of haiku.txt
find
pipeline reading comprehension¶
Write a short explanatory comment for the following shell script:
find . -name '*.dat' | wc -l | sort -n
Matching ose.dat
but not temp
¶
The -v
flag to grep
inverts pattern matching, so that only
lines which do not match the pattern are printed. Given that,
which of the following commands will find all files in /data
whose names end in ose.dat
(e.g., sucrose.dat
or
maltose.dat
), but do not contain the word temp
?
find /data -name '*.dat' | grep ose | grep -v temp
find /data -name ose.dat | grep -v temp
grep -v temp $(find /data -name '*ose.dat')
- None of the above.
Little Women¶
You and your friend, having just finished reading Little Women by
Louisa May Alcott, are in an argument. Of the four sisters in the
book, Jo, Meg, Beth, and Amy, your friend thinks that Jo was the most
mentioned. You, however, are certain it was Amy. Luckily, you have a
file LittleWomen.txt
containing the full text of the novel. Using
afor
loop, how would you tabulate the number of times each of
the four sisters is mentioned? Hint: one solution might employ the
commands grep
and wc
and a |
, while another might utilize
grep
options.
Shell Scripting¶
One of the most power uses of the shell is via scripting. Instead of using an interactive prompt, a series of commands can be written to a file and then executed. This has the benefit that changes to scripts can be tracked, and script can be shared.
Let’s create a simple script:
$ nano script.sh
Add the following:
me=$(whoami)
where=$(pwd)
echo "My name is $me"
echo "I am running in $where"
echo "Let's create a directory"
mkdir -v hello-script
echo "Now delete the directory"
rmdir hello-script
You can now execute the script:
$ bash script.sh
My name is ada
I am running in /home/ada
Let's create a directory
/home/data/hello-directory
Now delete the directory"
What happens is that you use the bash
command, which is the name
of the shell, and passed it the script you wrote. Another way of
executing a script is by making it executable and adding a “shebang”
at the top of the file. Edit script.sh
and make this the very
first line:
#!/bin/bash
This is called a “shebang” because the first character is a hash mark and the second the exclamation point, or “bang”, symbol.
Now make it executable:
$ chmod +x script.sh
You can run the script directly now:
$ ./script.sh
The shebang indicates the location of the executable that will
interpret the script. Check for yourself that /bin/bash
exists:
$ file /bin/bash
Conclusion¶
The Unix shell is older than most of the people who use it. It has survived so long because it is one of the most productive programming environments ever created — maybe even the most productive. Its syntax may be cryptic, but people who have mastered it can experiment with different commands interactively, then use what they have learned to automate their work. Graphical user interfaces may be better at the first, but the shell is still unbeaten at the second. And as Alfred North Whitehead wrote in 1911, “Civilization advances by extending the number of important operations which we can perform without thinking about them.”
Further Reading¶
What is covered here is a small overview of using the commandline shell. For further reading please consult the Bash Guide for Beginners Additionally, there are numerous shell summaries a Google Search away
Lab - Shell Usage¶
Log into india
for this.
- Create a cirectory. Create a file in it and write “hello world” in it.
- Which commands are used to list the contents of a file?
- Why should you not use
rm -r *
orrm -r /
? - Alias the
rm
command torm -i
. - Find a text editor you like. Common choices are emacs, vi, vim, pico, nano, but there are many more.
- Write a simple shell script and execute it. The script should
create a file called
hello.txt
and write the string “Hello World” to it.