Regular Expressions

Published Date
21 - Oct - 2011
| Last Updated
21 - Oct - 2011
 
Regular Expressions

Regular expressions are a way to describe patterns of text that can be useful for processing text documents or wherever one might want to look for a pattern and possibly replace another.

Introduction

Imagine you have a rather long document with a single misspelling. Imagine a Mr. Verma is displeased that his surname has been misspelled as "Varma".  It's simple enough; even a text editor such as notepad can perform a search and replace operation for something as simple as this.

However if we start making it even a little more complicated, if we are searching for a pattern instead of something fixed, such simple measures start to fail. Imagine then if you have to replace every occurrence of a pattern of URLs with another pattern.  For example, if due to a restructuring on a website, URLs that used to have the pattern:

http://www.website.com/[day]/[month]/[year]/articlename.html

now becomes:

http://www.website.com/[year]/[month]/[day]/articlename.html

What can one do? Searching and replacing will not do the trick here, and unless you are dealing with very few URLS, this is just too much to do manually. Or imagine if you are looking for a sequence of two words, and you would like to count how many times they occur in some text, the catch here being that these two words could be separated by any kind of whitespace, a tab, a space, a line break, etc.

These situations are easily handled by Regular Expressions.

What are Regular Expressions?

Regular expressions (Regex) are a way to define a pattern to be extracted / replaced / processed in a body of text. Many programming languages support regular expressions, either as part of that language or as part of a library.

The exact syntax and usage of Regex in different programs differs, however, the basic principles remain the same. Many applications, such as Notepad , TextPad, even LibreOffice / OpenOffice include support for regular expressions. Linux users might be aware of the grep command, which is a Regex search engine tools for your text files or input.  The Linux grep command is powerful enough to let you search through the files on your system digesting data, looking for the pattern you provide.

 

The Basics

Since it's patterns that we need to match rather that text sequences, regular expressions give special meanings to standard characters so that they can match multiple sequences.

For example, if we want to match any double-digit number, we do that using the regular expression:

dd

Here d stands for any numeric digit.

What if we had to match any double-digit even number though? Here is what we could do:

d[02468]

Here the contents inside the square brackets define a character set, and will match any enclosed character. This also means that we could have used [0123456789] instead of d, but there is an even better way, by using ranges, as follows:

[1-4][3-7]

The above will match any double-digit number that begins with 1, 2, 3 or 4, and has 3, 4, 5, 6, or 7 as the second digit. These kinds of ranges can be used even with alphabets; [a-z] matches any small-case alphabet.

Similarly, it is possible to invert a character set; [^1-5] would mean any character other than the ones between 1 and 5. Note that any character other than 1 to 5 would include alphabets as well!

The pipe character | can be used to provide alternatives. For example if you are looking for occurrences of both "June" and "July" you could do:

Ju(ne|ly)

For convenience sake, most regular expression implementations have easy to use character classes, such as the previously mentioned "d" for matching any digit. One of the most important ones is the period character or "." that matches any character other than a newline. There is a table provided for your reference.

 

Matching repetitions

Right now, we were looking for double-digit numbers, but what about patterns of arbitrary length. For those too Regex has a way.  If we need to match any number, i.e. any uninterrupted sequence of digits, here is what we can do:

d

The "d" matches any digit, and the plus sign ( ) signifies that previous expression should occur once or more. Decimal numbers are a more interesting example. Decimals can have any number of digits before the decimal point (including no number, as .23), and at least one number after the decimal point. This could be matched as follows:

d*.d

This can be divided in three parts:

  • d*
    Here the * matches zero or more occurrences of the previous expression i.e. d or any numeric character
  • .
    We need to match a period / dot character, however that already has a different purpose, which it to match any character. To match the period / dot character itself, we need to use it with the backslash.
  • d
    As mentioned before, this matches one or more digit character in the pattern.

Let us say we have a text document, and we want to match the phone numbers mentioned in that document, here we have a specific pattern with a specific number of digits. We wish to match Indian mobile numbers that take to the pattern " 91xxxxxxxxxx" where each x stands for a digit. Here is a pattern match:

91d{10}

Again, we had to use "" instead of "" since the latter already has a special meaning in Regex. The curly braces can be used to specify how many times the previous expression could be repeated. Here we specify 10, but we are also free to specify a range as {3,7} which would match if the pattern is repeated between 3 and 7 times. Leaving it as {3,} would enable matching three or more times.

Finally, we have the "?" character that can be used to specify that the preceding character may or may not be present. We could, for example, use this to match both the American and British spellings of the word "colour":

colou?r

This will match both "color" (American spelling), and "colour" (British spelling), since the "?" specifies that the "u" is optional.

 

Groups

To apply "*", " ", "?" etc. to a group of multiple characters, we can use regular brackets to group them together. To match both "child" and "children" we could do:

Child(ren)?

Brackets also let you capture parts of an expression.  Let us go back to the example we began with, the URLs. We now know how to match the input URL:

http://www.example.com/d{2}/d{2}/d{4}/articlename.html

The day and month are two-digit numbers, while the year is a four-digit number. But now what? We have matched the URLs we wanted to replace, but that is not enough, we need the day, month, and the year matched separately so we can create a new URL. For this we can put each of the day, month, and year in separate brackets, as this will instruct the Regex engine to record each sub-match separately.

http://www.example.com/(d{2})/(d{2})/(d{4})/articlename.html

Now when the Regex engine matches the whole text, it will also store the three subsets of the match separately so that we can use them later on. Each language has its own way of dealing with this, and often three is a simple way to replace all matches of a regular expression with text containing captured bits from the original string.

In this case, to perform the operation we desire, we would have to use the JavaScript replace function, and replace occurrences of the above Regex pattern with:

http://www.example.com/$3/$2/$1/articlename.html

Here the Regex engine automatically substitutes the $1 for the first captured group, $2 for the second captured group and so on.

If you find yourself using Notepad to play around with Regex, the above example will not work, since Notepad uses a different syntax. In Notepad you will need to use "1" for the first captured group, "2" for the second and so on. Consult the documentation of whatever software you are using to find out the syntax it uses.

Conclusion

As you may have come to realize, Regular Expressions are a very powerful tool indeed, and the best thing is that they go beyond just one programming language, or tool. You will find Regex enabled applications everywhere. File renaming tools, might use Regex to allow complex renaming operations, most IDEs support it, search tools as well. Once you master regular expressions, you will find tons of applications.

What we have covered here are some of the basic principles of regular expressions, many advanced features still lie out there for you to learn. Regex is an intricate tool, and most of its intricacies only become clearer as one begins using it.

 

Table: Regex reference

Character

Function

.

Match all characters except newline

[abc]

Match all characters inside brackets

[^abc]

Match all characters except those inside the brackets

[n-q]

Match all characters in range

d

Match all digits

D

Match all nondigits

n

Match newline character

s

Match all whitespace characters

S

Match all non-whitespace characters

t

Match tab character

w

Match any word character [A-Za-z0-9_]

W

Match any non-word character

b

Match word boundary i.e. match between word ending and space

B

Match non-word boundary i.e. match inside word

 

For use with special characters , , (, etc.

*

Match preceding expression zero or more times

Match preceding expression one or more times

?

Preceding expression is optional, i.e. match zero or one time

{n}

Match preceding expression n times

{n,}

Match preceding expression n or more times

{n,m}

Match preceding expression between n and m times

()

Substring or capturing group