CSC/ECE 517 Fall 2009/wiki1b 5 j8

From Expertiza_Wiki
Revision as of 01:39, 28 September 2009 by Lee (talk | contribs) (→‎References)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Regular Expressions

Regular expressions are a critical part of most modern programming languages, especially ones that deal string processing as a core part of their functionality. They allow a developer to easy match or replace a string using patterns that range from very simple to very complex. Although using regular expressions can change from language to language, the general principle is the same and similar syntax can generally used across the board.

This page is intended to to compare the usage, syntax, and features of various programming languages. If you are unfamiliar with the concept of regular expressions please visit some of the websites under Other Useful Links.

Usage

Perl

Perl has regular expresions built into the language itself via the '=~' operator. A simple match could be done like this:

$string = "lee";
if ($string =~ m/l.*/) {
   print "Matches";
}

This would print "Matches" since 'lee' starts with an 'l' and has zero or more characters after the 'l'. Replacements can be done simply by using 's' to indicate substitutions:

$string = "peewee";
$string =~ s/e+/aa/g;
print $string

This would print "paawaa". The 'g' following the regular expression indicates a global replacement, simply omit this to only replace the first instance of 'e+", which would result in print "paawee". [1]

Java

Unlike many languages Java does not have built-in language support for regular expressions. It instead uses Pattern objects to process regular expressions.

    Pattern patt = Pattern.compile("l.*");
    Matcher match = patt.matcher("lee");
    return match.matches();

This would return true. Since the Pattern object is created with the regular expression, it can be reused with different inputs for increased speed. [2]


    Pattern patt = Pattern.compile("l.*");
    Matcher match = patt.matcher("eel");
    return match.matches();

This would return false since 'eel' does not start with an 'l'. If a developer simply wants to a regular expression once and does not care to reuse the Pattern, he or she can simply use the 'matches' static method within Patthern:

   Pattern.matches("l.*", "lee");

or they can simply do operations on the String:

  String str = "lee";
  str.matches("l.*");

Replacements are done using:

 String str = "peewee";
 str.replaceAll("e+", "aa")

This would change the sting 'peewee' to 'paawaa', by replacing one or more instance of the letter 'e' with two 'a's. If you just wanted to replace the first instace you would use:

 String str = "peewee";
 str.replaceAll("e+", "aa")

which would change the string to 'paawee'. [3]

Ruby

Ruby's support for regular expressions is very similar to perl's, but with some differences. Matches are done in the exact same manner:

str = "lee"
if (str =~ /l.*/)
       print "Matches"
end

This would print "Matches".

Substitutions are one point where ruby greatly differs from perl. Instead of using the "s/regex/replace/" format, the functions sub, gsub, sub!, and gsub! can be called on any string. sub and gsub simply return a new string with the specified substitution, whereas sub! and gsub! do an in place substitution. gsub differs from sub in that it does a global replacement instead of simply replacing the first instance.

str = "peewee"
print str.gsub(/e+/, "aa")

would print 'paawaa'. [4]

Python

Python, similarly to java, does not have built in language support for regular expressions. It does however, like java, provide support for regular expressions through built in libraries. In python this is the 're' library. A simple match test can be done as followed:

import re
if re.match("l.*", "lee"):
   print "Match"

The above would print "Match". For substitution, python uses the "sub" function:

re.sub("e+", "aa", "peewee")

Would would return "paawaa". To replace only the first instace of 'ee', you would simply pass in the optional argument of '1':

re.sub("e+", "aa", "peewee", 1)

Which would return "paawee". The 1 argument tells the sub method to only substitute the first match. [5]

Php

Php also does not have built in language support for regular expressions. To do a matching search simply use the preg_match function:

if (preg_match("/l.*", "lee")) echo "Match";

would print "Match". The preg_match function is syntastically equivelant to perl's regular expressions. Substitutions are done via the preg_replace function:

preg_replace( "/e+/" , "aa", "peewee")

This would return "paawaa". Similarly to python, if you provide an optional argument of '1', only the first instance of the pattern is replaced. [6]

Ease of Use

Although ease of use is largely dependent upon the user, generally any language that has built in language support for regular expressions are easier to use. Going by this metric it is no surprise that Perl would be the easiest of them all to use. Since ruby has some built in language support, it would come next, and the rest would probably be rated about the same.

Advanced Features

Unicode

All of the above support Unicode and internationalized strings. There are however many caveats:

  • Ruby did not have support until version 1.9.
  • Perl did not have support until version 5.6 [7]
  • PHP supports it, but requires the use of a /u flag. [8]


POSIX Syntax Support

POSIX-style regular expressions are older and much more limited than Perl style syntax, but are still in use today.

  • PHP supports them simply by using the "eregi" function which can be used similarly to the preg function. PHP support was deprecated in PHP 5.3.
  • None of the other languages discussed in this document support POSIX style regular expressions. [9]

Language Specific Syntax

In most cases, regular expression syntax is the same across all languages. There are a few instances where this is not the case.

  • In most languages \10 - \N is interpreted as a back reference. If there are not N back references, then it is treated as an Octal number. In Java all octal digits must begin with \0 (backslash zero). [10]
  • Java does not support the following [10]:
    • conditions within a regular expression i.e. (?{X}) or (?(condition)X|Y)
    • embedded code i.e. (?{code}
    • embedded comments i.e. (?#comment)

References

[1] Perl Regular Expressions
[2] Java Pattern Documentation
[3] Java String Documentation
[4] Ruby Regular Expressions
[5] Python Regular Expressions
[6] Regular Expressions in PHP
[7] Ruby Unicode Regular Expressions
[8] Unicode Regular Expressions
[9] PHP Regular Expressions (Some posix information)
[10] Java Pattern Documentation

Definitions

  • Perl-style Regular Expressions - The most common regular expression syntax. Supported by most programming languages.
  • Posix Regular Expressions - An older syntax for regular expressions that are much more limited in functionality and availability. Unix utilities such as 'ed', 'awk', and 'grep' use this syntax by default. See http://www.regular-expressions.info/posix.html for more information.
  • Backreference - A feature of regular expression engines that allow the user to reuse one or more of the sub-strings that were matched. For example, to match anything that has a 3 digit number repeated twice, we could simply use ([0-9][0-9][0-9])\1 . For example, this would match 123123 but would not match 123124.

Other Useful Links