Quick Navigation Bar
records and references :: regular expressions :: files and directories
[ toc | forums ]

Note: If the document URL does not begin with https://randu.org/tutorials/perl/ then you are viewing a copy. Please direct your browser to the correct location for the most recent version.

Regular Expressions and String Manipulation

We will now turn our focus to applying what we have learned so far in Perl to help us accomplish various tasks that involve string scalars.
Regular Expressions are useful and very powerful, and Perl helps you manipulate strings with relative ease.

Regular Expressions :: Definition and Examples

So what is a regular expression (or regex for short)? Regex's are patterns that can be matched against a string. It's basically just a template that is applied over a string that is to be scanned.
If you are familiar with the *NIX utilities grep, sed, awk, lex, or yacc, this section can be skimmed (but pay attention to the subtle nuiances of Perl regex's).
A simple regex where we look for abc in a string:
```
  while (<>) {
    if (/abc/) {
      print $_;
    }
  }  
```
The /abc/ is the regular expression or pattern we wish to match. This code block would loop through standard in and process any matches such as: abc, abcabc, abcabcabc. A similar grep statement would be: grep abc file.txt.
Character Classes are very important in regex's. Character classes contain characters in which each character must match at least one time in the given string. For example:
```
  /[aeiouAEIOU]/
```
Would match any word with a vowel in it (which I believe are all words in the English language, except for those containing "y"). Notice character classes begin and end with brackets, [ ].
You can specify ranges within character classes:
```
  /[a-zA-Z0-9]/
```
Would match any alphanumeric "word". If you wanted to match a literal -, you must escape it by using the backslash, \.
To match any character besides a newline, you use period:
```
  /[.\n]/
```
Would match ALL characters, because you specified a period and then a newline. You can also negate character classes by placing a circumflex inside of the character class:
```
  /[^0-9]/
```
Would match anything not containing a numeric digit.

Perl has special additions or pre-defined character classes:

  \d      # equivalent to [0-9]
  \D      # equivalent to [^0-9]
  \w      # equivalent to [a-zA-Z0-9_]
  \W      # equivalent to [^a-zA-Z0-9_]
  \s      # equivalent to [ \r\t\n\f] (whitespace)
  \S      # equivalent to [^ \r\t\n\f]

Multipliers allow you to repeat matches in any given regex:
```
  /abc*/
  /a+bc/
  /ab?c/
```
The first regex matches an a, then a b then zero or more c's. The second regex matches one or more a's followed by a b and a c. The last regex matches an a, then zero or one b's and then a c. You can even specify the number of matches:
```
  /a.{2,4}c/
  /xyz{3}/
  /j{2,}kl/
```
The brackets on the first one specify that you must match a minimum of 2 of any character, up to 4. in between an a and a c. The second one says that the string must contain an x, y, and exactly 3 z's. The last one specifys at least 2 or more j's followed by a k and an l. So here's an equivalence chart:
```
  .*    # same as .{0,}
  .+    # same as .{1,}
  .?    # same as .{0,1}
```
In Perl, you can use parentheses as memory, or to repeat the same regex within a regex:
```
  /a(.*)b\1c/
```
Matches an a followed by zero or more any non-newline characters and then the \1 specifies the same regex (.*) followed by a c. So this would match, for example: azzzzbzzzzc, but not azzzzbyyyyc. You can have more than one parethesized part of a regex, you can specify other ones by \2, \3 and so on, numbered from the left.
Alternates allows you to match a variety of regex's. You use the pipe character to separate patterns:
```
  /abc|jkl|xyz/
```
Would match "abc", "jkl", or "xyz".
Anchoring allows you to make sure that the specified pattern matches up with specific parts of the string. For example:
```
  /\bmo/     # Matches anything starting with mo
  /^mo/      # same thing
  /mo\b/     # Matches anything that ends with mo
  /mo$/      # same thing
```
/B is the opposite of /b, matches when there is no word boundary.
Other anchors such as, ?=, ?!, ?<=, ?<! are lookarounds and lookaheads. See Wall pg. 203-204 for more information.
If you wanted to select a different target than $_, you can change the target of the regex by using the =~ operator:
```
  $a = "hello world";
  $a =~ /^he/;             # true
```
You can also append options/cases to the closing slash of the matching operator, for instance to ignore case:
```
  if ( =~ /^y/i) {
    # line begins with y, let's do something with it
  }
```

You can also use other scalars as matching patterns:

  $match = "this";
  $target = "This sentence contains this word.";
  if ($target =~ /$match/) {
    # do stuff
  }

Regular Expressions :: String Manipulation

Simple regex match and text replacement involves prepending "s" to the matching operator and then followed by the replacement text:
```
  $_ = "foot fool buffoon";
  s/foo/bar/;
```
Now $_ contains "bart fool buffoon". What if we wanted to match ALL instances of foo? Append "g" to the matching operator:
```
  s/foo/bar/g;
```
would do the trick. $_ contains "bart barl bufbarn".
What if you wanted to split up a string that contained different fields for instance? Use the split function:
```
  $line = "user:600:100:/home/user:/usr/bin/perl";
  @fields = split(/:/, $line);
```
So now we have a list called @fields that contains each part of $line. The /:/ denotes that we will use : as the delimiter. Note that the /:/ is a regex!
What if we wanted to join things together? Simple! Use the join function:
```
  $newline = join(":", @fields);
```
Now $newline is a string with the fields list delimited by :.

String Manipulation :: Transliteration

Similar to the *NIX tr command, Perl provides a transliteration command useful for making changes that could cancel each other out if using standard regex's. For example:
```
  $_ = "fred and barney";
  tr/fb/bf/;
```
$_ now contains "bred and farney". You can even append "d" to the end of the pattern operator to delete patterns not matched. (This could be useful to remove extraneous characters: ^M for example from dos text files.

Strings :: Other Functions

index and rindex functions find an occurrence of a substring inside of a string, it is like strstr in C, but returns the position index instead of a pointer to the occurrence.
You can extract substrings out of strings by using the substr function.
There are many other functions dealing with string scalars, refer to the Larry Wall book for more information.

Tangent :: Sorting

Remember when we were doing data structures, how we sorted by ASCII and not numerically? Well this can easily be done by specifying to sort what the sort criteria is:
```
  sub by_number {
    if ($a < $b) {
      return -1;
    } elsif ($a == $b) {
      return 0;
    } elsif ($a > $b) {
      return 1;
    }
  }

  @sortedlist = sort by_number @list;
```
Seems pretty simple. But Perl provides for an even easier sorting method. Because this three-way evaluation happens regularly with routines like sorting, Perl has a built-in operator <=> or the spaceship operator. So we can rewrite this as:
```
  @sortedlist = sort { $a <=> $b } @list;
```
Very simple! There is a comparable operator for string scalars, it is called cmp instead of <=> (but sort automatically performs an ASCII-based sort by default).

Regular Expressions and String Manipulation Review

Regular Expressions are one of the powerhouses of Perl. With regex's you can manipulate scalars to whatever you wish, especially strings.
There are many useful string functions like the C string library.
Sorting can be easily accomplished via the spaceship operator.

Notice: Please do not replicate or copy these pages and host them elsewhere. This is to ensure that the latest version can always be found here.

Disclaimer: The document author has published these pages with the hope that it may be useful to others. However, the document author does not guarantee that all information contained on these webpages are correct or accurate. There is no warranty, expressed or implied, of merchantability or fitness for any purpose. The author does not assume any liability or responsibility for the use of the information contained on these webpages.

If you see an error, please send an email to the address below indicating the error. Your feedback is greatly appreciated and will help to continually improve these pages.