CSC/ECE 517 Fall 2009/wiki1b 5 kf)

From Expertiza_Wiki
Jump to navigation Jump to search

Introduction

Regular expressions are a key to powerful,flexible and efficient processing of text.Regular expressions are like a mini programming language which helps in describing and parsing of text.This can be done using the support provided by the tools that are being used.Regular expressions can be as simple as a text editor's search command or as powerful as a full text processing language.Regular expressions are a kind of specific text pattern which can be effectively used with many programming languages as well as modern applications.They can be used in various ways some of which include replacing text matching the pattern with other text,finding the text that matches the pattern within a larger body of text,to split a block of text into a list of subtexts and so on.

Overview

A regular expression is a simple way to specify a pattern of characters that is to be matched in a string.Regular expressions have become a standard feature in a variety of languages and popular tools such as Perl,Ruby,Python,Java,PHP,VB.Net MySQL.The .NET types that support regular expressions are based upon Perl5 regular expressions and support both search and search/replace functions.The regular expressions can be used for tasks such as: validation of text inputs,the parsing of textual data into better structured forms,replacement of patterns of text in a document.


Syntax

Regular expressions consist of normal characters,character classes,wildcard characters and quantifiers.
Basic syntax referance

A normal character also known as a literal can be matched as it is.For example-if a pattern consists of "ab",then only the input sequence "ab" could match it.The characters are specified by using the standard escape sequences beginning with a '\'.A character class is a group of characters which is shown by putting the characters in the class between the brackets.For example-the class [abc] matches a,b or c.The wildcard character is the dot(.) which can match any character.A quantifier is used to determine the number of times an expression is matched.+,*,? are known as quantifiers

Importing the Regular expression library

A variety of programming languages have strings,integers,arrays and so on.But the efficient tool of regular expression is built in only scripting languages such as Ruby,JavaScript,Perl etc and there is no need to do anything to enable the support for Regular expressions in these languages.Some languages like C# and java are required to import a library by writing an import statement in the source code.However,there are some languages not having any support for regular expression.For these languages,it is required to compile and link in the regular expression support by the programmer.Some libraries are available for multiple languages.But some languages can have a choice of different libraries.Having the built in support for regular expressions makes the work of pattern matching and substitution convenient as well as concise.

JavaScript

Regular expression support is built in.

Ruby

Regular Expression support is built in.

Perl

Regular Expression support is built in.

Python

import re

It is required to import the re module into the script.Only then,the functions for Python's regular expression can be used.

PHP

The preg functions are built in and available in PHP 4.2.0 and later.

C#

using System.Text.RegularExpressions;

VB.Net

Imports System.Text.RegularExpressions

Java

import java.util.regex.*;

Creating Regular Expression Objects

The regular expression has to be compiled first before the regular expression engine can match a regular expression to a string.This happens at the time when the application is running during which the regular expression constructor parses the string holding the regular expression.The string is then converted into a tree structure or a state machine.This tree is then traversed by the function performing the match of actual pattern.The programming languages having support for literal regular expressions compile the code when execution reaches the regular expression operator.

JavaScript

To use the same object again,it can be assigned to a variable.If it is stored in a string variable,the RegExp() constructor to compile the regular expression can be used.

Ruby

In Ruby,the Regular expressions are considered as objects of type Regexp.The objects can be created by calling the constructor or by using the literal forms: /pattern/ and %r{pattern}

                         m=Regexp.new('n') -> /n/

After the object has been created ,it can be used to match against a string by using :

                         Regexp#match(string) 

or using the operators : =~(positive match) and !~(negative match).

                         name="Rains"
                         name=~/n/ -> 3

A pattern that matches a string which contains the text Perl or the text Python can be written as :

                         /Perl|Python/

Repetition within the patterns can also be specified.Another feature is the matching of one of a group of characters within a pattern.For example:character classes such as \s matches a whitespace character,a dot can match (almost) any character.Ruby is quite smilar to Java Script.The only difference is that the name of the class is Regexp as one word in Ruby and is RegExp with camel caps in JavaScript.

                         myregexp = /regex pattern/;

Regular expression retrieved from user input,as a string stored in the variable userinput:

                         myregexp=Regexp.new(userinput);

PERL

The data processing in Perl program relies heavily on regular expressions.Perl provides regular expression operators meshed with the constructs and operators that make up the Perl language.The literal regular expressions are used with the pattern-matching operator and the substitution operator.The pattern matching operator starts with m and contains two forward slashes with the required regex between them.Forward slashes should be escaped with the backslash.While using any type of opening or closing punctuation(parentheses,braces or brackets) as a delimiter,they must be matched up.

Example:

                         m{regex}

Using any other punctuation requires the writing of that same character twice.

The substitution operator starts with s.If we are using brackets or similar puctuation as the delimiter,we need to have two pairs:

                         s[regex][replace]

For rest of the punctuation,it should be used three times:

                         s/regex/replace/

Perl clearly differentiates between dollars used as anchors and dollars used for variabe interpolation.In Perl, @ sign is used for variable interpolation.It should be escaped in literal regular expressions in Perl code.The variety and options offered by Perl's operators and functions are its biggest strength as well as its greatest weakness.

To compile a regular expression ,"quote regex" operator can be used and assigned to a variable.The same syntax as match operator is used except that it starts with qr instead of m.

                           $myregex = qr/regex pattern/

The Regular expression is retrieved from the user input ,and stored as a string in the variable $userinput:

                           $myregex = qr/$userinput/


Python

Python has support for regular expressions through its re module. In Python,the literal regular expressions are required to be passed as strings.There are various ways to quote strings provided by Python ,depending on the characters.The different ways of quoting may reduce the number of characters needed to escape with backslashes.The raw strings in Python don't require to escape any characters.For example:r"\d+" instead of "\\d+".But the raw strings cannot be used when there are both single and double quoted strings in our regular expression.In such a case,the raw string can be triple quoted.

                          reobj=re.compile("regex pattern")

The Regular Expression retrieved from user input,as a string stored in the variable userinput:

                          reobj = re.compile(userinput)

PHP

The three regex engines in PHP are the "preg","ereg" and "mb_ereg" engines.Two of them implement POSIX ERE while the third is based on PCRE.PHP does not have a native regular expression type unlike JavaScript and Perl.The regular expressions are required to be quoted as strings.Within the given string,the regular expression should be quoted as a Perl style literal regular expression.For example-while writing /regex/ in Perl,in PHP the string becomes '/regex/'.PHP does support both single quoted and double quoted strings.

Regular expressions are compiled at runtime.PHP has a large cache which consists of 4096 entries.So,it can be said that a pattern of string is compiled for only the first time it occurs. But PHP does not provide a way to store a compiled regular expression into a variable.So,it has to be passed a string to one of preg functions.

Java

Regular expression processing is supported by the java.util.regex package.There are two classes which work together and support regular expression processing:Pattern and Matcher.Pattern is used to define the regular expression and the pattern is matched with another sequence called Matcher.The pattern can be created by calling the compile() factory method.

                          static Pattern compile(String pattern)

Once the pattern object is created,it is used to create a matcher by calling the matcher() factory method.

                          Matcher matcher(CharSequence str)

str is the character sequence that the pattern will be matched against.If there is a syntax error,the Pattern.compile() factory throws a PatternSyntaxException.



External Links

http://en.wikipedia.org/wiki/Regular_expression

http://www.regular-expressions.info/