Chapter 9. Text Utilities

Table of Contents
9.1. Introduction
9.2. The basics

9.1. Introduction

On of the central ideas of UNIX(-like) operating systems is that "everything is a file". Even devices can be treated as a file. Basically there are three types of files in UNIX:

Due to the fact that text files have such an important role in UNIX-like operating systems, like GNU/Linux, this book has a separate chapter dedicated to the subject of processing text files. In the beginning its use may not be that obvious, but once you get used to these tools you will see that you will probably use some of these tools on a daily basis.

Note

This chapter relies heavily on the use of pipes and redirection, so it it a good idea to read Section 7.3.5 if you have not done that yet.

9.2. The basics

9.2.1. cat

The cat command is one of the simplest commands around. Its default behaviour is just to send anything it receives on the standard input to the standard input until the end of file character (^D) is sent to the standard input. You can see this by executing cat, and entering some text:


$ cat
Hello world!
Hello world!
Testing... 1 2 3
Testing... 1 2 3
      

You can also concatenate files with cat, by providing the files you would like to concatenate as an argument. The concatenated files will be sent to the standard output:


$ cat test1
This is the content of test1.
$ cat test2
This is the content of test2.
$ cat test1 test2
This is the content of test1.
This is the content of test2.
      

As you can see it is also possible to send a file to the standard output by specifying one file as cat's argument. This is an alternative to redirection. For example, one could either of these command:


$ less < test1
$ cat test1 | less
      

9.2.2. echo

The echo is used to send something to the standard output by specifying it as an argument to echo. For example, one could use echo to send a simple message to the standard output:


$ echo "Hello world!"
Hello world!
      

9.2.3. wc

One of the common things people often want to do with a text is counting the number of words, or lines in a text. The wc command can be used for this purpose. The file to be counted can be specified as an argument to wc


$ wc essay.txt
 174 1083 8088 essay.txt
      

As you can see the default output returns three numbers. These are (in order): the number of lines, the number of words, and the number of characters. It is also possibly to return only one of these numbers, with respectively -l, -w, and -m. For example, if we only want to know the number of lines in the file, we could do the following:


$ wc -l essay.txt
174 essay.txt
      

In some situations you might want to use the output of wc in a pipeline, or as an argument to another command. The problem with specifying the file name as an argument is that wc will also show the name of the file (as you can see in the example above). You can work around this behavior by redirecting the file contents to wc. For example:


$  wc -l < essay.txt
174
      

9.2.4. tr

The tr is used to translate or delete characters. All uses of tr require one or two sets. A set is a string of characters. You can specify most characters in a set. The tr(1) manual page provides an overview of some character sequences with a special meaning. Table 9-1 describes some of the fequently used special character sequences.

Table 9-1. Special tr character sequences

Sequence Meaning
\\ backslash (\)
\n new line
char1-char2 All characters from char1 to char2 (e.g. "a-z")
[:alnum:] All alphanumeric characters
[:alpha:] All letters in the alphabet
[:punct:] Punctuation characters

Characters can be deleted from one text with the -d, and a set that specifies the characters that should be deleted. Let's start with an easy example: suppose that you want to remove all the new line characters from a text stored in text, and that you would like to redirect the output to a file named text-continuous. Obviously, we need a set with only one character, namely the new line character, which is specified with "\n". This can be accomplished with the following command:


$ cat text | tr -d "\n" > text-continuous
      

It happens quite often that you want to delete everything, but the characters that are specified in a set. You can do this by using the -c, which automatically complements the specified set. For example, if you would like to remove every character from a text, except for alphabetical characters, new lines, and spaces, you can use the -c, and the following set: "[:alpha:]\n ". Combined in a command it would look like this:


$ cat text | tr -c -d "[:alpha:]\n "
      

Using tr for translating characters does not require any extra parameters, but two sets. In the two sets the first character of the first set is replaced with the first character of the second set, etc. Suppose that we use these two sets: "abc" and "def". With these sets the following translations occur: "a -> d", "b -> e", and "c -> f". If the first set is longer than the second set, then the second set is expanded by repeating the last character of the second set. If the second set is longer than the first set, extra characters in the second set are ignored.

Translations can be useful for many purposes. For example, it can be used to make a wordlist from a text. This can be done by replacing all spaces with a newline:


$ cat essay.txt | tr " " "\n" | less
      

As you can imagine the output might still contain non-alphabetic characters. We can combine the command above with the delete functionality of tr to make a wordlist without unwanted characters:


$ cat essay.txt | tr " " "\n" | tr -c -d "[:alpha:]\n" > wordlist
      

9.2.5. sort

The command is used to sort lines in a file. To sort a text alphabetically, you can just pipe or redirect the data to sort. sort also accepts file names as its parameters. When multiple files are specified, the files will be concatenated before sorting the lines. Suppose that you have a word list that is unordered, which is stored in the file wordlist_unsorted. You can sort the contents of the file, and output it to wordlist_sorted with:


$ sort wordlist_unsorted > wordlist_sorted
      

The sort accepts many different parameters, which are all described in the sort(1) manual page. An often-used parameter we will shortly discuss is -u. With this parameter only unique words will be sent to stdout (in other words: double occurences will be ignored). By combining sort -u and the tr, discussed in Section 9.2.4, you can make a sorted wordlist of a text:


$ cat essay.txt | tr " " "\n" | tr -c -d "[:alpha:]\n" | sort -u > wordlist_sorted
      

If the command listed in this example is combined with wc, one could count the size of the used vocabulary in the text. In the sorted word list each line represents an unique word from the original text. So, we can count the total number of words that were used, by counting the total number of lines in the sorted list:


$ cat essay.txt | tr " " "\n" | tr -c -d "[:alpha:]\n" | sort -u | wc -l
      

9.2.6. uniq

The uniq can be compared to the -u parameter of the sort; it removes all but one entry of successive identical lines (in a sorted list). The main difference is that it provides some extra parameters that can be used to manipulate the output. The default behavior works like sort -u, and reduces duplicate entries:


$ sort wordlist_unsorted | uniq > wordlist_sorted
      

To make a list of how often a line occurs in a text, one could count how many identical lines the sorted list contains of every line. Uniq can add the number of identical lines with the -c parameter. To make a list of how many times each word occurs in a text, you can combine tr, sort, and uniq:


$ cat essay.txt | tr " " "\n" | tr -c -d "[:alpha:]\n" | sort | uniq -c > wordlist_sorted