A Beginner’s Guide to Grep | Fikra for Business Development

A Beginner’s Guide to Grep

A Beginner’s Guide to Grep: Basics and Regular Expressions
Tuesday, January 21, 2020

Grep is one among the system administrator’s “Swiss Army knife” set of tools, and is extremely useful to search for strings and patterns in a group of files, or even sub-folders.

This article introduces the basics of Grep, provides examples of advanced use and links you to further reading. Grep (an acronym for “Global Regular Expression Print”) is installed by default on almost every distribution of Linux, BSD and UNIX, and is even available for Windows. GNU and the Free Software Foundation distribute Grep as part of their suite of open source tools. This tutorial focuses primarily on this GNU version, as it is currently the most widely used. Grep finds a string in a given file or input, quickly and efficiently. While most everyday uses of the command are simple, there are a variety of more advanced uses that most people don’t know about — including regular expressions and more, which can become quite complicated. The tool has its roots in an extended regular expression syntax that was added to UNIX after Ken Thompson’s original regular expression implementation. The latter searches for any of a list of fixed strings, using the Aho-Corasick algorithm.

These variants are embodied in most modern Grep implementations as command-line switches (and standardised as -E and -F in POSIX.2). In such combined implementations, Grep may also behave differently depending on the name by which it is invoked, allowing fGrep, eGrep, and Grep to be links to the same program. There are two ways to provide input to Grep, each with its own particular uses. First, Grep can be used to search a given file or files on a system (including a recursive search through sub-folders). Grep also accepts inputs (usually via a pipe) from another command or series of commands. Regular expressions A regular expression, often shortened to “regex” or “regexp”, is a way of specifying a pattern (a particular set of characters or words) in text that can be applied to variable inputs to find all occurrences that match the pattern. Regexes enhance the ability to meaningfully process text content, especially when combined with other commands. Usually, regular expressions are included in the Grep command in the following format: grep [options] [regexp] [filename] GNU Grep uses the GNU version of regular expressions, which is very similar (but not identical) to POSIX regular expressions. In fact, most varieties of regular expressions are quite similar, but have differences in escapes, meta-characters, or special operators. GNU Grep has two regular expression feature sets: Basic and Extended. In basic regular expressions, the meta-characters ?, +, {, |, (, and ) lose their special meaning (whose uses are described later in this article).

As mentioned below, to switch to using extended regular expressions, you need to add the option -E to the grep command. It is customary to enclose the regular expression in single quotation marks, to prevent the shell (Bash or others) from trying to interpret and expand the expression before launching the grep process. For example, if a pair of back-ticks in the regexp is not quoted, it would result in the text between the back-ticks being executed as a Bash sub-process — and if this happens to be a valid command, the text returned by it takes the regular expression’s place in the command-line parameters given to Grep! Not at all what we want. Again, due to shell behaviour, you can also enclose the regex in double quotes — in this case, you can use environment variables in the regex, and the shell will substitute them before calling Grep. This can be very useful, depending on what you’re trying to do — or it could turn out to be a nuisance. Remember the difference in behaviour. Basic usage Now let’s go on to some practical examples of using Grep. To better understand the results, I’ve created a simple text file on which we will run our Grep searches; the file contains the following lines: Hi this is test file to carry out few regular expressions practical with grep 123 456 Abcd ABCD Case-insensitive search (grep -i): [[email protected] ~]$ grep -i 'abcd' testfile Abcd ABCD As you can see, the -i flag causes a search for “abcd” to return matches that have different cases for the characters from what the search string does. Whole-word search (grep -w): [[email protected] ~]$ grep -w 'test' testfile is test file This type of search only returns lines where the sought-for string is a whole word and not part of a larger word.

Recursively search through sub-folders (grep -r ): [[email protected] ~]$ grep -r '456' /root/ /root/testfile:Year is 2010 Inverted search (grep -v): [[email protected] ~]$ grep -v 'practical' testfile Hi this is test file to carry out few regular expressions 123 456 Abcd ABCD This prints all the lines in the file, except the line that contains the word “practical”. An interesting relative is the -L flag (you can also use --files-without-match), which outputs the names of files that do NOT contain matches for your search pattern. The matches for your search pattern are not themselves printed, only the names are. [[email protected] ~]$ grep -r -L "Network" /var/log/* /var/log/anaconda.log /var/log/anaconda.syslog /var/log/audit/audit.log /var/log/boot.log /var/log/boot.log.1 ... The “opposite” flag to -L is -l or --files-with-matches, which prints out (only) the names of files that do contain matches for your search pattern. Print additional (trailing) context lines after match (grep -A ): [[email protected] ~]$ grep -A1 '123' testfile 123 456 Abcd For each line that matches the search, Grep prints the matching line, as well as the next one line after the match. Varying the number provided to -A changes the number of additional lines that are in the output. Print additional (leading) context lines before match (grep -B ): [[email protected] ~]$ grep -B2 'Abcd' testfile practical with grep 123 456 Abcd Print additional (leading and trailing) context lines before and after the match (grep -C ): [[email protected] ~]$ grep -C2 'carry' testfile this is test file to carry out few regular expressions practical with grep 123 456 As you can see, this has printed out two lines before and after the single match found in the file; if there are multiple matches, Grep inserts a line containing -- between each group of lines (each match and its context lines). Print the filename for each match (grep -H filename): [[email protected] ~]$ grep -H 'a' testfile testfile:to carry out few regular expressions testfile:practical with grep Now, let’s run the search a bit differently: [[email protected] ~]$ cat testfile | grep -H 'a' (standard input):to carry out few regular expressions (standard input):practical with grep When the stream that Grep is asked to search is passed to its standard input via a pipe from a previous command in the chain, grep -H displays (standard input) as the filename. Run in “quiet” mode (grep -q): When run with this flag, Grep does not write anything to standard output, but sets its return value (also known as exit status) to reflect whether a match was found or not. This option is mainly used in scripts that need to check if a given file contains a particular match. A return status of 0 (zero) indicates that a match was found; 1 indicates that no match was found. [[email protected] ~]$ grep -q '2010' testfile [[email protected] ~]$ echo $? 1 [[email protected] ~]$ grep -q '456' testfile [[email protected] ~]$ echo $? 0 Using regular expressions [[email protected] ~]$ grep 'c.r' testfile to carry out few regular expressions In the search above, . is used to match any single character — which is why it matches “car” in “carry”. Grep has a powerful regular expression matching engine, which we can’t hope to cover in depth here, but we will include a few important points: •Most characters, including all letters and digits, are actually regular expressions that match themselves. •Any meta-character (with special meaning to Grep, like the . in the example above) may be quoted by preceding it with a backslash. This makes Grep treat it as an ordinary character. [[email protected] ~]$ grep 'c\.r' testfile [[email protected] ~]$ As you can see, preceding . with a backslash has removed its significance as a meta-character.

A regular expression may be followed by one of several repetition operators: The period (.) matches any single character. ? means that the preceding item is optional, and if found, will be matched at the most, once. * means that the preceding item will be matched zero or more times. + means the preceding item will be matched one or more times. {n} means the preceding item is matched exactly n times, while {n,} means the item is matched n or more times. {n,m} means that the preceding item is matched at least n times, but not more than m times. {,m} means that the preceding item is matched, at the most, m times. However, the repetition operators are part of GNU Grep’s extended regular expression syntax, so to use these effectively, remember to add the -E option to your command. Read this tutorial for an introduction to more of Grep regular expression features. For more information on regular expression syntax, refer to the Regular Expressions chapter in the Grep manual. Meanwhile, we will present some examples of regular expressions and try to show how they work. Character classes in regular expressions The “character class” tool is one of the more flexible and often-used features of regular expressions. There are two basic ways to use character classes: to specify a list of characters (for example, [aeiou] is a list of vowel characters), or a range (like [m-t], which expands to [mnopqrst]). Ranges are a convenience that saves having to type an entire sequence of characters.

A character class can also include a list of special characters, but they can’t be used as a range. A single character class instance will match only one character; to match multiple occurrences of the class, you would need to add a repetition operator, like those mentioned above. For example, to find an eleven-letter string comprising only lower-case alphabets, the regex would be: [a-z]{11}. As mentioned earlier, to use the repetition operators, we need to add the option -E. Let’s run this on our test file: [[email protected] ~]$ Grep -E '[a-z]{11}' testfile to carry out few regular expressions Here, “expressions” is the only all-lowercase 11-character string in the file; so this is the only line printed as the output. There are quite a few character classes that are very commonly used in regular expressions, and these are provided as named classes. For example, the [a-z] class of lower-case alphabets that we used above, has the named class [:lower:]. Naturally, [:upper:] is upper-case letters A to Z, and [:alpha:] is all alphabetic characters, equivalent to [:lower:] plus [:upper:]. [:digit:] is the digits 0 to 9, and [:alnum:] is alphanumeric characters — a combination of [:alpha:] and [:digit:]. The Grep manual lists out more of these named classes.

When a carat (^) is used as the first character in a character class, it is a negation of the class, effectively meaning, “none of these characters”. Line and word anchors The ^ anchor specifies that the pattern following it should be at the start of the line: [[email protected] ~]$ grep '^th' testfile this The $ anchor specifies that the pattern before it should be at the end of the line. [[email protected] ~]$ grep 'i$' testfile Hi The operator \< anchors the pattern to the start of a word. [[email protected] ~]$ grep '\ anchors the pattern to the end of a word. [[email protected] ~]$ grep 'le\>' testfile is test file The \b (word boundary) anchor can be used in place of \< and \> to signify the beginning or end of a word: [[email protected] ~]$ grep -e '\breg' testfile to carry out few regular expressions Finally, we look at the | (alternation) operator, which is part of the extended regex features. A pattern containing this operator separately matches the parts on either side of it; if either one is found, the line containing it is a match. The parts can themselves be complex regular expressions, so this means you can check each line in a file for multiple search patterns in one pass. [[email protected] ~]$ grep -E 'hi|bc' testfile this Abcd That was pretty simple; so let’s try a more complicated one. Can you reason out why the output lines for this regex are as shown below? [[email protected] ~]$ grep -E '^[t-z]+|[^a-z]+$' testfile this to carry out few regular expressions 123 456 ABCD Using shell expansions in the pattern input to Grep As mentioned earlier, if you don’t single-quote the pattern passed to Grep, the shell could perform shell expansion on the pattern and actually feed a changed pattern to Grep. This can also be done intentionally, when you need it — let’s look at a few examples. [[email protected] ~]# grep "$HOME" /etc/passwd root:x:0:0:root:/root:/bin/bash operator:x:11:0:operator:/root:/sbin/nologin Here, we intentionally use double quotes to make the Bash shell replace the environment variable $HOME with the actual value of the variable (in this case, /root). Thus, Grep searches the /etc/passwd file for the text /root, yielding the two lines that match. [[email protected] ~]# grep `whoami` /etc/passwd root:x:0:0:root:/root:/bin/bash operator:x:11:0:operator:/root:/sbin/nologin Here, back-tick expansion is done by the shell, replacing `whoami` with the user name (root) that is returned by the whoami command. Well, we hope this has set you on your way to using this very efficient tool. This article was originally published in May 2010 issue.

© 2022 Developed by Fikra for Business Development. All rights reserved