CSC/ECE 517 Fall 2009/wiki1b 5 13: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
 
Line 2: Line 2:
== Regular Expressions: ==
== Regular Expressions: ==


A pattern matching language used to match patterns within text strings/files.  Regular expressions gained popularity with the Unix ed and grep (grep derived from ed's search command g/re/p[1] - Wikipedia).  Since then regular expressions have been incorporated into text editors, command utilities and programming languages.  
A pattern matching language used to match user defined patterns within text strings/files.  Regular expressions gained popularity with the Unix ed and grep (grep derived from ed's search command g/re/p[http://en.wikipedia.org/wiki/Regular_expressions]).  Since then regular expressions have been incorporated into text editors, command utilities and programming languages.  


Regular Expression Engines:
=== Regular Expression Engines: ===
DFA (Deterministic Finite Automaton) - State machine based algorithm that keeps track of all matches in progress.  As soon as a match is made it returns "leftmost match wins" [2] (MRE - 148-149); with the exception of alternation matches (<substring1>|<substring2>).  If all alternation options match then the longest match will be selected - "longest leftmost match wins"[3](MRE - 155-156).  Non-greedy searches are not supported since they are nondeterministic in nature. ie A.c =~ AbcvAbcAdd returns Abc
There are two prominent types of regular expression engines; DFA (Deterministic Finite Automaton) and NFA (Nondeterministic Finite Automaton).  NFA is found primarily in modern scripting languages, while DFA is found in older parsers.
 
'''DFA (Deterministic Finite Automaton) :''' State machine based algorithm that keeps track of all matches in progress.  As soon as a match is made it returns "leftmost match wins"[http://books.google.com/books?id=GX3w_18-JegC&pg=PA148&lpg=PA148&dq=leftmost+match+wins&source=bl&ots=PHphPngpMZ&sig=PRca-pLRG3sIJS2BofHl9fvoiDE&hl=en&ei=W5W1SunRK6ig8Qb9kdQ3&sa=X&oi=book_result&ct=result&resnum=3#v=onepage&q=leftmost%20match%20wins&f=false]; with the exception of alternation matches (<substring1>|<substring2>).  If all alternation options match then the longest match will be selected - "longest leftmost match wins"[3](MRE - 155-156).  Non-greedy searches are not supported since they are nondeterministic in nature. ie A.c =~ AbcvAbcAdd returns Abc
DFA Languages : lex, awk, vi(m)
DFA Languages : lex, awk, vi(m)


NFA (Nondeterministic Finite Automaton) - Greedy search routine that keeps track where the algorithm branches to determine if the most greedy match succeeds.  Will roll back to original position for subsequent less greedy matches.  ie A.c =~ AbcvAbcAdd matches both occurrences of Abc in the string.
'''NFA (Nondeterministic Finite Automaton) :''' Greedy search routine that keeps track where the algorithm branches to determine if the most greedy match succeeds; will roll back to original position for subsequent less greedy matches.  Supports non-greedy matches (ie match 0 or 1 times).  ie A.c =~ AbcvAbcAdd matches both occurrences of Abc in the string.
    NFA Types:
NFA Types:
    * Traditional NFA : gives preference to the earliest substring in alternation matches (dog|doggy). (dog|doggy) =~ doggy returns dog. (doggy|dog) =~ doggy returns doggy.
* Traditional NFA : gives preference to the earliest substring in alternation matches (dog|doggy). (dog|doggy) =~ doggy returns dog. (doggy|dog) =~ doggy returns doggy.
    * POSIX NFA : handles alternation matches like DFA engines.  ie (dog|doggy) =~ doggy return doggy.
* POSIX NFA : handles alternation matches like DFA engines.  ie (dog|doggy) =~ doggy return doggy.


Traditional NFA languages : Perl, Python, Ruby, Java, PHP, JavaScript
Traditional NFA languages : Perl, Python, Ruby, Java, PHP, JavaScript
POSIX NFA : mawk[4]http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
POSIX NFA : mawk[http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html]
Regular Expression Support in Programming Languages (Traditional NFAs):
===Regular Expression Support in Programming Languages (Traditional NFAs):===
All examples are the same that want to get the month, day and year from the following format (mm-dd-yyyy).
All examples are the same that want to get the month, day and year from the following format (mm-dd-yyyy).


Perl :
=====DFA vs NFA=====
Since the DFA engine is deterministic its performance can be bounded by its search set.  NFA performance isn't as well defined since, the NFA can roll back on itself to find a less greedy match.  Effectively redoing the regular expression search for the less greedy case.  Those interesting in the theory and performance differences should review the following page: http://swtch.com/~rsc/regexp/regexp1.html
 
====Perl====
Perl has built in language support for regular expressions without requiring additional package loads or imports.  Regular expression matches and returns are stored in special after match variables that are considered "reserved" variable names ($1, $2... or \1, \2....).
Perl has built in language support for regular expressions without requiring additional package loads or imports.  Regular expression matches and returns are stored in special after match variables that are considered "reserved" variable names ($1, $2... or \1, \2....).
    Example:
 
Example
 
     $date = '01-02-2001'
     $date = '01-02-2001'
     $date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
     $date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
Line 26: Line 33:
     $2 => '02'
     $2 => '02'
     $3 => '2001'
     $3 => '2001'
Python :
 
====Python====
Python has built in support via the import of the 're' package that comes standard in the Python installation libraries.  Regular expression matches and returns are stored in regular expression objects that are accessed through '.' notation.
Python has built in support via the import of the 're' package that comes standard in the Python installation libraries.  Regular expression matches and returns are stored in regular expression objects that are accessed through '.' notation.
    Example:
 
Example
 
     import re
     import re
     date = '01-02-2001'
     date = '01-02-2001'
Line 37: Line 47:
     year    = date_match.group(3) => '2001'
     year    = date_match.group(3) => '2001'


Ruby :
====Ruby====
Ruby has built in language support for regular expressions without addition package loads or imports.  Ruby being "Perl but even better"[4] supports Perl style =~ or =! regular expression matches as well as a more OO approach via <string>.match().  Regular expression returns are stored in after match variables as well as in return regular expression objects.
Ruby has built in language support for regular expressions without addition package loads or imports.  Ruby being "Perl but even better"[4] supports Perl style =~ or =! regular expression matches as well as a more OO approach via <string>.match().  Regular expression returns are stored in after match variables as well as in return regular expression objects.
    Example:
 
Option 1, use Ruby's version of Perl syntax:
 
     date = '01-02-2001'
     date = '01-02-2001'
     date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
     date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
Line 45: Line 57:
     $2 => '02'
     $2 => '02'
     $3 => '2001'
     $3 => '2001'
Option 2, use Ruby's OO version of regular expressions by using the STRING objects match method.


     date_match = date.match('(\d\d)-(\d\d)-(\d\d\d\d)')
     date_match = date.match('(\d\d)-(\d\d)-(\d\d\d\d)')
Line 50: Line 64:
     day    = date_match[2] => '02'
     day    = date_match[2] => '02'
     year    = date_match[3] => '2001'
     year    = date_match[3] => '2001'
JavaScript :
====JavaScript====
JavaScript has support for pattern matching within the String objects or if more advanced features are desired the RegExp object can be utilized.  RegEx matches are returned as an array or -1 (-1 indicates no match).
JavaScript has support for pattern matching within the String objects or if more advanced features are desired the RegExp object can be utilized.  RegEx matches are returned as an array or -1 (-1 indicates no match).
    Example:
 
Example:
 
     var date = "01-02-2001"
     var date = "01-02-2001"
     var regex =
     var regex =
Java :
====Java====
Java doesn't have built in native support for regular expressions but Sun and others have developed Regular Expression packages that can be imported to allow for regular expression functionality.  Since regular expressions are an add-on and Java is strictly typed there is more syntactic overhead to produce a Java based regular expression.
Java doesn't have built in native support for regular expressions but Sun and others have developed Regular Expression packages that can be imported to allow for regular expression functionality.  Since regular expressions are an add-on and Java is strictly typed there is more syntactic overhead to produce a Java based regular expression.
    Example:
 
Example:
      
      
PHP :
====PHP====
PHP has built in support via the preg() method routine.  Regular expression matches are returned in user defined arrays.
PHP has built in support via the preg() method routine.  Regular expression matches are returned in user defined arrays.
    Example:
 
Example:

Revision as of 02:52, 20 September 2009

Regular Expressions:

A pattern matching language used to match user defined patterns within text strings/files. Regular expressions gained popularity with the Unix ed and grep (grep derived from ed's search command g/re/p[1]). Since then regular expressions have been incorporated into text editors, command utilities and programming languages.

Regular Expression Engines:

There are two prominent types of regular expression engines; DFA (Deterministic Finite Automaton) and NFA (Nondeterministic Finite Automaton). NFA is found primarily in modern scripting languages, while DFA is found in older parsers.

DFA (Deterministic Finite Automaton) : State machine based algorithm that keeps track of all matches in progress. As soon as a match is made it returns "leftmost match wins"[2]; with the exception of alternation matches (<substring1>|<substring2>). If all alternation options match then the longest match will be selected - "longest leftmost match wins"[3](MRE - 155-156). Non-greedy searches are not supported since they are nondeterministic in nature. ie A.c =~ AbcvAbcAdd returns Abc DFA Languages : lex, awk, vi(m)

NFA (Nondeterministic Finite Automaton) : Greedy search routine that keeps track where the algorithm branches to determine if the most greedy match succeeds; will roll back to original position for subsequent less greedy matches. Supports non-greedy matches (ie match 0 or 1 times). ie A.c =~ AbcvAbcAdd matches both occurrences of Abc in the string. NFA Types:

  • Traditional NFA : gives preference to the earliest substring in alternation matches (dog|doggy). (dog|doggy) =~ doggy returns dog. (doggy|dog) =~ doggy returns doggy.
  • POSIX NFA : handles alternation matches like DFA engines. ie (dog|doggy) =~ doggy return doggy.

Traditional NFA languages : Perl, Python, Ruby, Java, PHP, JavaScript POSIX NFA : mawk[3]

Regular Expression Support in Programming Languages (Traditional NFAs):

All examples are the same that want to get the month, day and year from the following format (mm-dd-yyyy).

DFA vs NFA

Since the DFA engine is deterministic its performance can be bounded by its search set. NFA performance isn't as well defined since, the NFA can roll back on itself to find a less greedy match. Effectively redoing the regular expression search for the less greedy case. Those interesting in the theory and performance differences should review the following page: http://swtch.com/~rsc/regexp/regexp1.html

Perl

Perl has built in language support for regular expressions without requiring additional package loads or imports. Regular expression matches and returns are stored in special after match variables that are considered "reserved" variable names ($1, $2... or \1, \2....).

Example

   $date = '01-02-2001'
   $date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
   $1 => '01'
   $2 => '02'
   $3 => '2001'

Python

Python has built in support via the import of the 're' package that comes standard in the Python installation libraries. Regular expression matches and returns are stored in regular expression objects that are accessed through '.' notation.

Example

   import re
   date = '01-02-2001'
   date_regex = re.compile(r'(\d\d)-(\d\d)-(\d\d\d\d)')
   date_match = date_regex.search(date)
   month = date_match.group(1) => '01'
   day     = date_match.group(2) => '02'
   year    = date_match.group(3) => '2001'

Ruby

Ruby has built in language support for regular expressions without addition package loads or imports. Ruby being "Perl but even better"[4] supports Perl style =~ or =! regular expression matches as well as a more OO approach via <string>.match(). Regular expression returns are stored in after match variables as well as in return regular expression objects.

Option 1, use Ruby's version of Perl syntax:

   date = '01-02-2001'
   date =~ '(\d\d)-(\d\d)-(\d\d\d\d)'
   $1 => '01'
   $2 => '02'
   $3 => '2001'

Option 2, use Ruby's OO version of regular expressions by using the STRING objects match method.

   date_match = date.match('(\d\d)-(\d\d)-(\d\d\d\d)')
   month = date_match[1] => '01'
   day     = date_match[2] => '02'
   year    = date_match[3] => '2001'

JavaScript

JavaScript has support for pattern matching within the String objects or if more advanced features are desired the RegExp object can be utilized. RegEx matches are returned as an array or -1 (-1 indicates no match).

Example:

   var date = "01-02-2001"
   var regex =

Java

Java doesn't have built in native support for regular expressions but Sun and others have developed Regular Expression packages that can be imported to allow for regular expression functionality. Since regular expressions are an add-on and Java is strictly typed there is more syntactic overhead to produce a Java based regular expression.

Example:

PHP

PHP has built in support via the preg() method routine. Regular expression matches are returned in user defined arrays.

Example: