CSC/ECE 517 Fall 2009/wiki1b 5 13

Regular Expressions:

A pattern matching language used to match user defined patterns within text strings/files. Regular expressions gained popularity with the Unix ed and grep (grep derived from ed's search command g/re/p[1]). Since then regular expressions have been incorporated into text editors, command utilities and programming languages.

Regular Expression Engines:

There are two prominent types of regular expression engines; DFA (Deterministic Finite Automaton) and NFA (Nondeterministic Finite Automaton). NFA is found primarily in modern scripting languages, while DFA is found in older parsers.

DFA (Deterministic Finite Automaton) : State machine based algorithm that keeps track of all matches in progress. As soon as a match is made it returns "leftmost match wins"[2]; with the exception of alternation matches (<substring1>|<substring2>). If all alternation options match then the longest match will be selected - "longest leftmost match wins". Non-greedy searches are not supported since they are nondeterministic in nature. ie A.c =~ AbcvAbcAdd returns Abc DFA Languages : lex, awk, vi(m)

NFA (Nondeterministic Finite Automaton) : Greedy search routine that keeps track where the algorithm branches to determine if the most greedy match succeeds; will roll back to original position for subsequent less greedy matches. Supports non-greedy matches (ie match 0 or 1 times). ie A.c =~ AbcvAbcAdd matches both occurrences of Abc in the string. NFA Types:

Traditional NFA : gives preference to the earliest substring in alternation matches (dog|doggy). (dog|doggy) =~ doggy returns dog. (doggy|dog) =~ doggy returns doggy.
POSIX NFA : handles alternation matches like DFA engines. ie (dog|doggy) =~ doggy return doggy.

Traditional NFA languages : Perl, Python, Ruby, Java, PHP, JavaScript

POSIX NFA : mawk[3]

DFA vs NFA

Since the DFA engine is deterministic its performance can be bounded by its search set. NFA performance isn't as well defined since, the NFA can roll back on itself to find a less greedy match. Effectively redoing the regular expression search for the less greedy case. Those interesting in the theory and performance differences should review the following page: http://swtch.com/~rsc/regexp/regexp1.html

Regular Expression Support in Programming Languages:

All examples are the same that want to get the month, day and year from the following format (mm-dd-yyyy).

Perl

Perl has built in language support for regular expressions without requiring additional package loads or imports. Regular expression matches and returns are stored in special after match variables that are considered "reserved" variable names ($1, $2... or \1, \2....).

Example

   $date = '01-02-2001'
   $date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
   $1 => '01'
   $2 => '02'
   $3 => '2001'

Reg Ex Construct	Explanation
m/pattern1/pattern2/	Matches pattern1 against pattern2
s/search_pat/replace_pat/[g]	Substitution. Search pattern is replaced with replace pattern. If /g is specified then multiple replaces will be performed on the line.
split /separator/, string	Splits the given string based upon the supplied separator string.

Perl.com's detailed discussion of regular expressions and Perl : http://www.perl.com/doc/manual/html/pod/perlre.html

Python

Python has built in support via the import of the 're' package that comes standard in the Python installation libraries. Regular expression matches and returns are stored in regular expression objects that are accessed through '.' notation.

Example

   import re
   date = '01-02-2001'
   date_match = re.search(r'(\d\d)-(\d\d)-(\d\d\d\)', date)
   month   = date_match.group(1) => '01'
   day     = date_match.group(2) => '02'
   year    = date_match.group(3) => '2001'

Python allows for regular expressions to be "pre-compiled" at run time through the use of the re.compile() function. "Pre-compiled" regular expressions will exist while they are scoped which can result in a performance increase if the same regex is going to be used multiple times.

Reg Ex Construct	Explanation
match(string)	returns an object if the test string matches regex pattern (\Astring). Not widely used since the match has to match from beginning of the string.
search(string)	returns an regex object if the test string is in the string.
sub(replace_string, string)	returns the string with substituion
split(split_string[, max_split=x])	splits the string based upon the given split_string. Can be bounded by max_split.

A detailed discussion about Regular Expressions and Python can be found at: http://www.amk.ca/python/howto/regex/

Ruby

Ruby has built in language support for regular expressions without addition package loads or imports. Ruby being "Perl but even better"[4] supports Perl style =~ or =! regular expression matches as well as a more OO approach via <string>.match(). Regular expression returns are stored in after match variables as well as in return regular expression objects.

Option 1, use Ruby's version of Perl syntax:

   date = '01-02-2001'
   date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
   $1 => '01'
   $2 => '02'
   $3 => '2001'

Reg Ex Construct	Explanation
string1 =~ string2	match operator. Returns a match if string2's match is in string1; also returns initial match position.
string1 !~ string2	doesn't match operator. Returns true if string2 is not a substring of string1.

A general overview of Ruby's Regular Expression syntax that has some Perl-style regular expression examples: http://www.tutorialspoint.com/ruby/ruby_regular_expressions.htm

Option 2, use Ruby's OO version of regular expressions by using the STRING objects match method.

   date_match = date.match('(\d\d)-(\d\d)-(\d\d\d\d)')
   month   = date_match[1] => '01'
   day     = date_match[2] => '02'
   year    = date_match[3] => '2001'

Reg Ex Construct	Explanation
string.match(pattern)	Class STRING's match operator.
string.gsub(pattern, replacement)	Replaces the pattern with the replacement string
string.split(pattern[, limit])	Splits the string based upon the supplied pattern. If a limit is given then the string will be split upto the given limit.

Full definition of the STRING class's built-in methods can be found at: http://www.ruby-doc.org/core/classes/String.html

Option 3, use Ruby's OO version of regular expressions by using the Regexp object.

   date_match = Regexp.new('(\d\d)-(\d\d)-(\d\d\d\d)')
   date = date_match.match(date)
   month = date[1] => '01'
   day   = date[2] => '02'
   year  = date[3] => '2001'

Like the python re.compile, the use of Regexp.new/compile also scopes the regular expression object for repeated use in code blocks.

Reg Ex Construct	Explanation
regex_obj.compile(pattern)	Compile/New creates a base pattern to be used in successive Regular Expression opertions.
regex_obj.match(string)	returns a match array if the string has the regex pattern in it.

Full definition of the REGEXP can be found at : http://www.ruby-doc.org/core/classes/Regexp.html

JavaScript

JavaScript has support for pattern matching within the String objects or if more advanced features are desired the RegExp object can be utilized. RegEx matches are returned as an array or -1 (-1 indicates no match).

Example:

   var date  = "01-02-2001";
   var regex = new RegExp("(\d\d)-(\d\d)-(\d\d\d\d)");
   var match_array = RegExp.exec(date)
   match_array[1] => '01'
   match_array[2] => '02'
   match_array[3] => '2001'

Reg Ex Construct after new RegExp is created	Explanation
search(pattern)	searches for the match of the pattern in the string. Returns initial start position if a match is found else it returns a -1.
match(pattern)	searches for the match of the pattern in the string. Returns an array of matches if the pattern is in the string else it returns a -1.
replace(pattern, replacement)	Replaces the match pattern with the replace pattern if the match pattern is in the string.
split(pattern, limit)	Splits the line based upon the split pattern upto the limit.

A basic JavaScript tutorial can be found at: http://www.evolt.org/regexp_in_javascript

Java

Java doesn't have built in native support for regular expressions but Sun and others have developed Regular Expression packages that can be imported to allow for regular expression functionality. Since regular expressions are an add-on and Java is strictly typed there is more syntactic overhead to produce a Java based regular expression.

Example:

   Pattern date_pat = new Pattern("(\d\d)-(\d\d)-(\d\d\d\d)");
   Matcher date_mat = date_pat('01-02-2001');
   date_mat.Group(1)  => '01'
   date_mat.Group(2)  => '02'
   date_mat.Group(3)  => '2001'

Reg Ex Construct Matcher Class	Explanation
Group()	Matched substring patterns that are in the string.
replaceAll(string replacement)	Replaces all of the matched substring patterns with the replacement string.
replaceFirst(string replacement)	Replaces the first substring pattern with the replacement string.

Sun's Java and Regular Expressions article with example code and rules : http://java.sun.com/developer/technicalArticles/releases/1.4regex/

PHP

PHP has built in support via the preg() method routine. Regular expression matches are returned in user defined arrays.

Example:

   preg_match('(\d\d)-(\d\d)-(\d\d\d\d)', '01-02-2001', $date_mat)
   $date_mat[1] => '01'
   $date_mat[2] => '02'
   $date_mat[3] => '2001'

Reg Ex Construct	Explanation
preg_match(pattern, string[, array])	Matches the pattern within the string. The resulting groups are stored in the optional array.
preg_replace(pattern, replacement, string[, limit])	replaces the pattern with the replacement in the supplied string upto the limit if one is given.
preg_split(pattern, string[, limit])	splits the supplied string in to substrings based upon the supplied pattern upto the limit.

PHP regular expression tutorial; also provides sections for other languages : http://www.regular-expressions.info/php.html

Discussion

All of the covered languages have basic Regular Expression support with near identical support of the Regular Expression language constructs (see table below). Once basic support is achieved then the discussion should center more around syntax and use cases then does this language support this feature or that feature.

The "Perl" style syntax is syntactically light and easy to use but the results are stored in special metavariables that almost require the programmer to know how many returns should be expected. Python, PHP and javascript require slightly more initial code but since their results are stored in arrays or objects, the result can be easily iterated and processed. Ruby is a nice compromise that supports the 'Perl' style regular expression syntax or the more array/object based syntax of Python. Java requires more syntax but generally a programmer isn't going to write a Regular Expression heavy program in Java; they may need Regular Expressions sprinkled throughout the program to handle URLs or simple string/line modification.

Regular expressions Characters and support in different languages

Reg Ex Construct	Perl	PHP	Python	Java	Ruby	Java script
Word Character (a-zA-Z)	\w	\w	\w	\w	\w	\w
Non-Word Character	\W	\W	\W	\W	\W	\W
Digit character	\d	\d	\d	\d	\d	\d
Non- digit Chracter	\D	\D	\D	\D	\D	\D
White space character	\s	\s	\s	\s	\s	\s
Non-White space character	\S	\S	\S	\S	\S	\S
Start of string or after newline in multi -line mode	^	^	^	^	^	^
matching end of line	$	$	$	$	$	$
Start of search string in all modes	\A	\A	\A	\A	\A	\A
End of string	\z	\z	\z	\z	\z	\z
End of String before newline	\Z	\Z	\Z	\Z	\Z	\Z
Word boundary (distance between last word character and Non word character)	\b	\b	\b	\b	\b	\b
Non- Word boundary	\B	\B	\B	\B	\B	\B
Matching any single character in bracket	[...]	[...]	[...]	[...]	[...]	[...]
Matching any single character not inside bracket	[^...]	[^...]	[^...]	[^...]	[^...]	[^...]
New line character	\n	\n	\n	\n	\n	\n
Tab character	\t	\t	\t	\t	\t	\t
Form feed	\f	\f	\f	\f	\f	\f
esc character	\e	\e	\e	\e	\e	\e
Octal character	\octal	\octal	\octal	\Ooctal	\octal	\octal
Character specifying one or two digit hexa code	\xhex	\xhex	\xhh	\xhex	-	\xhh
Character specifying 4-digit hexa code	\x{hex}- any hex digit character	\x{hex}-any hexa code	\uhhhh	-	-	\uhhhh
Case insensitive mode	i	i	i	i	i	i
Case match mode	m	m	m	m	m	m
Ignore white space mode	x	x	x	x	x	x
Group sub pattern and capture sub pattern	(...)	(...)	(...)	(...)	(...)	(...)
Match zero or more times or few as possible	*	*	*	*	*	*
Match one or more times or few as possible	+	+	+	+	+	+
Match exactly n times	{n}	{n}	{n}	{n}	{n}	{n}
match n times or more	{n,}	{n,}	{n,}	{n,}	{n,}	{n,}
Positive lookahead	(?=...)	(?=...)	(?=...)	(?=...)	(?=...)	(?=...)
Negative lookahead	(?!...)	(?!...)	(?!...)	(?!...)	(?!...)	(?!...)
Positive look behind	(?<=...)	(?<=...)	(?<=...)	(?<=...)	(?<=...)	(?<=...)
Negative look behind	(?<!...)	(?<!...)	(?<!...)	(?<!...)	(?<!...)	(?<!...).