CSC/ECE 517 Fall 2009/wiki1b 5 kf)

From Expertiza_Wiki
Jump to navigation Jump to search

Regular Expressions

Introduction

Regular expressions are a key to powerful , flexible and efficient processing of text. Regular expressions are like a mini programming language which helps in describing and parsing of text. This can be done using the support provided by the tools that are being used.Regular expressionscan be as simple as a text editor's search command or as powerful as a full text processing language. Regular expressions are a kind of specific text pattern which can be effectively used with many programming languages as well as modern applications. They are used with methods in searching , replacing and extracting information from strings.


Overview

A regular expression is a simple way to specify a pattern of characters that is to be matched in a string. Regular expressions have become a standard feature in a variety of languages and popular tools such as Perl , Ruby , Python , Java , PHP , VB.Net , MySQL. The .NET types that support regular expressions are based upon Perl5 regular expressions and support both search and search/replace functions. The regular expressions can be used for tasks such as: validation of text inputs , the parsing of textual data into better structured forms , replacement of patterns of text in a document. There is no official standard which can define , in exact words , which text patterns are regular expressions and which are not. As a matter of fact , each programming language has a different idea of regular expression. So,we can look at each programming language as having a different regular expression flavor . These flavors cannot be thought of as corresponding one to one with programming languages. Although not completely , modern regular expression flavors present in different languages have a syntax that is very similar and compatible with each other.


Handling of Regular Expressions

Complex systems like regular expressions in programming languages require a complex packaging so that the system could understand the regex and the ways it can be used. A programming language can take three approaches to regular expressions:

  • Integrated - The regular expression operators are built in. For example: in Perl, Ruby. An integrated approach helps in simplifying things for the programmer. Since many of the mechanics of preparing ,setting up the match, applying the regular expression and deriving the results from that application are hidden. This makes the normal case quite easy to work with.
  • Procedural and Object-Oriented - These two techniques are quite similar to each other. Both divide the functionality of regex by normal functions (procedural) or constructors and methods (Object-oriented). The functions,constructors or methods use normal string arguments that are used to be interpreted as regular expressions. In these languages,Regular Expressions are not part of the low-level syntax.This approach is taken by Object-Oriented languages such as Java and C++.

handling of regular expressions


Importing the Regular expression library

A variety of programming languages have strings , integers , arrays and so on. But the efficient tool of regular expression is built in only scripting languages such as Ruby , JavaScript , Perl etc and there is no need to do anything to enable the support for Regular expressions in these languages. Some languages like C# and java are required to import a library by writing an import statement in the source code. However,there are some languages not having any support for regular expression. For these languages , it is required to compile and link in the regular expression support by the programmer. Some libraries are available for multiple languages. But some languages can have a choice of different libraries.Having the built in support for regular expressions makes the work of pattern matching and substitution convenient as well as concise.

JavaScript

Regular expression support is built in.


Ruby

Regular Expression support is built in.


Perl

Regular Expression support is built in.


Python

import re

It is required to import the re module into the script.Only then,the functions for Python's regular expression can be used.


PHP

The preg functions are built in and available in PHP 4.2.0 and later.


Java

import java.util.regex.*;


Syntax in Different languages

Regular expressions consist of normal characters , character classes , wildcard characters and quantifiers.
Basic syntax reference. Advanced syntax reference

A normal character also known as a literal can be matched as it is. For example: if a pattern consists of "ab" , then only the input sequence "ab" could match it. The characters are specified by using the standard escape sequences beginning with a '\'. A character class is a group of characters which is shown by putting the characters in the class between the brackets. For example: the class [abc] matches a , b or c. The wildcard character is the dot(.) which can match any character. A quantifier is used to determine the number of times an expression is matched.+,*,? are known as quantifiers.

The regular expression has to be compiled first before the regular expression engine can match a regular expression to a string. This happens at the time when the application is running during which the regular expression constructor parses the string holding the regular expression. The string is then converted into a tree structure or a state machine. This tree is then traversed by the function performing the match of actual pattern. The programming languages having support for literal regular expressions compile the code when execution reaches the regular expression operator.

JavaScript

The Regular Expression feature of JavaScript has been borrowed from Perl. The Regular expression in JavaScript are difficult to read in part since comments and whitespace are not allowed. All the parts are organized together tightly hence making them difficult to understand. This becomes a concern when used in security applications.
Tutorial

There are two ways to make a RegExp object. The usual way is to use a regular expression literal. The literals are enclosed in slashes. There are three flags indicated by g,i and m , that can be appended directly to the end of RegExp literal.

Example:

                        var my_regexp = /"(?:\\.|[^\\\"]"/g;

The other way is to use the RegExp constructor.

Example:

                        var my_regexp = new RegExp("\"(?:\\\.|[^\\\\\\\"])\"",'g');

The RegExp objects share a single instance. To use the same object again , it can be assigned to a variable. If it is stored in a string variable , the RegExp() constructor to compile the regular expression can be used. In a JavaScript Program ,the regular expression has to be on single line. Whitespace is quite significant.
JavaScript basic syntax


Ruby

Ruby provides the support for Perl-compatible regular expressions at syntax level. This makes the program short and concise and so, more readable.The Regular expressions in Ruby are considered as objects of type Regexp. The objects can be created by calling the constructor or by using the literal forms: /pattern/ and %r{pattern}

                         m=Regexp.new('n') returns /n/

After the object has been created, it can be used to match against a string by using :

                         Regexp#match(string) 

or using the operators : =~(positive match) and !~(negative match).

                         name="Rains"
                         name=~/n/ returns 3

A pattern that matches a string which contains the text Perl or the text Python can be written as :

                         /Perl|Python/

Repetition within the patterns can also be specified. Another feature is the matching of one of a group of characters within a pattern. For example:character classes such as \s matches a whitespace character,a dot can match (almost) any character. Ruby is quite similar to Java Script. The only difference is that the name of the class is Regexp as one word in Ruby and is RegExp with camel caps in JavaScript.

                         myregexp = /regex pattern/;

Regular expression retrieved from user input,as a string stored in the variable userinput:

                         myregexp=Regexp.new(userinput);


PERL

Regular expressions are a fundamental feature of PERL.The handler for regular expressions has varous features that can be accessed terse sequences.These features make it powerful and short.Recently,some enhancements have also been introduced in PERL 5.8 and PERL 6 that provide better facilities and some more verbose syntaxes.The data processing in Perl program relies heavily on regular expressions. Perl provides regular expression operators meshed with the constructs and operators that make up the Perl language. The literal regular expressions in Perl are used with the pattern-matching operator and the substitution operator. The pattern matching operator starts with m and contains two forward slashes with the required regex between them. Forward slashes should be escaped with the backslash. While using any type of opening or closing punctuation(parentheses, braces or brackets) as a delimiter, they must be matched up.

Example:

                         m{regex}

Using any other punctuation requires the writing of that same character twice.

The substitution operator starts with s. If we are using brackets or similar puctuation as the delimiter, we need to have two pairs:

                         s[regex][replace]

For rest of the punctuation,it should be used three times:

                         s/regex/replace/

Perl clearly differentiates between dollars used as anchors and dollars used for variabe interpolation. In Perl, @ sign is used for variable interpolation. It should be escaped in literal regular expressions in Perl code. The variety and options offered by Perl's operators and functions are its biggest strength as well as its greatest weakness.

To compile a regular expression ,"quote regex" operator can be used and assigned to a variable. The same syntax as match operator is used except that it starts with qr instead of m.

                           $myregex = qr/regex pattern/

The Regular expression is retrieved from the user input ,and stored as a string in the variable $userinput:

                           $myregex = qr/$userinput/


Python

Python has support for regular expressions through its re module.

The RE object from a pattern string and optional flags is built using a compile function. In Python, the literal regular expressions are required to be passed as strings. There are various ways to quote strings provided by Python, depending on the characters. The different ways of quoting may reduce the number of characters needed to escape with backslashes. The raw strings in Python don't require to escape any characters. For example:r"\d+" instead of "\\d+".But the raw strings cannot be used when there are both single and double quoted strings in our regular expression. In such a case,the raw string can be triple quoted.

                         reobj=re.compile("regex pattern")

The Regular Expression retrieved from user input,as a string stored in the variable userinput:

                          reobj = re.compile(userinput)

Example:

                          'pre.*post' matches a string which contains a substring 'pre' followed by a substring 'post'
                          'pre.+post' matches only if 'pre' and 'post' are not adjacent

The pattern string of Regular Expression in Python follows a specific syntax:

  • Alphabetic and numeric characters stand for themselves.
  • The alphanumeric characters get special meaning when preceded by a backslash.
  • When escaped,Punctuation is self matching. But they have a special meaning when unescaped.
  • If there is a backslash character, it is matched by a repeated backslash.


PHP

The three regex engines in PHP are the "preg","ereg" and "mb_ereg" engines. Two of them implement POSIX ERE while the third is based on PCRE. The differences are basically in syntax .But there are some functional differences too.

  • Posix-extended- The regular expressions are taken from the regex pattern matching machinery that is used in Unix command-line shells.

For Example: The special character ^ matches the beginning of a string only and the special character $ matches the end of a string only.The syntax of POSIX-Extended regular expression is supported by the POSIX C regular expression API's . The variations are used by the utilities egrep and awk.

  • Perl-Compatible Regular Expressions(PCRE)- These are Regular Expressions that follow the same syntax as that used in Perl Regular Expressions.They are mainly used with the PREG functions in PHP.The syntax used is a bit confusing but guarantees a specific set of search criteria. They have a completely distinct set of functions and a bit different set of rules for patterns.They are always bookended by one specific character which must be same in the beginning as well as in the end thus showing the beginning and end of the pattern.

PHP does not have a native regular expression type unlike JavaScript and Perl.The Regular expression in PHP are required to be quoted as strings. Within the given string, the regular expression should be quoted as a Perl style literal regular expression.

Example:

                      while writing /regex/ in Perl,in PHP the string becomes '/regex/'

PHP does support both single quoted and double quoted strings. Any pair of punctuation characters can be used as delimiters. It is required to put two backslashes in the pattern string as PHP treats the first slash as an escape character for the second backslash. Regular expressions are compiled at runtime. PHP has a large cache which consists of 4096 entries. So,it can be said that a pattern of string is compiled for only the first time it occurs. But PHP does not provide a way to store a compiled regular expression into a variable. So, it has to be passed a string to one of preg functions. The resulting regular expressions easier to read and to maintain.

Java

The Regular expression processing is supported by the java.util.regex package.

Java has had a native regex package ,java.util.regex which provides a powerful and innovative functionality with an uncluttered API.It has got a good Unicode support combined with clear documentation and fast execution.It is also flexible in its execution a it can match against CharSequence objects.The mechanics of wielding the regular expressions with the package support of java are simple as the functionality is provided by only two classes ,an interface and an unchecked exception:

                           java.util.regex.Pattern
                           java.util.regex.Matcher
                           java.util.regex.MatchResult
                           java.util.regex.PatternSyntaxException

There are two classes which work together and support regular expression processing:Pattern and Matcher. Pattern is used to define the regular expression and the pattern is matched with another sequence called Matcher.The pattern can be created by calling the compile() factory method.

                          static Pattern compile(String pattern)

Once the pattern object is created, it is used to create a matcher by calling the matcher() factory method.

                          Matcher matcher(CharSequence str)

str is the character sequence that the pattern will be matched against. If there is a syntax error,the Pattern.compile() factory throws a PatternSyntaxException.

Working with Regular Expressions in Java requires the creation of objects and sending messages to it. As a consequence, there is large number of method calls which makes it difficult to understand if the program is large.

The literal regular expressions can be to the Pattern.compile() class factory and to various functions of the String class. If the parameter takes the regular expression, it is always declared as the string. Double quotes are used in Java.


An Example

A regular expression <[$"'\n\d/\\]> has been given. This regular expression consists of a single character class matching a dollar sign, a double quote,a single quote, a line feed, any digit between 0 and 9, a forward or a backward slash. To hardcode it into a source code as a string constant or a regular expression operator:

JavaScript : /[$"'\n\d\/\\]/


Ruby : /[$"'\n\d\/\\]/


PERL :

Pattern Matching Operator : /[\$"'\n\d\/\\]/ ,m|[\$"'\n\d/\\]|


Substitution Operator : s|[\$"'\n\d/\\]||


Python : "[$\"'\n\\d/\\\\]"(Normal string)


PHP : '%[$"\'\n\d/\\\\]%'


Java : "[$\"'\n\\d/\\\\]"


References

  • Mastering Regular Expressions, 3rd Edition by Jeffrey E. F. Friedl(ISBN-13: 978-0-596-52812-6)
  • Regular Expressions Cookbook by Jan Goyvaerts, Steven Levithan(The Example)(ISBN-13: 978-0-596-52068-7)
  • JavaScript: The Good Parts, 1st Edition by Douglas Crockford