CSC/ECE 517 Fall 2009/wiki1b 5 kf)

From Expertiza_Wiki
Revision as of 18:31, 21 September 2009 by Sandy (talk | contribs) (→‎'''''C#''''')
Jump to navigation Jump to search

Introduction

Regular expressions are a key to powerful,flexible and efficient processing of text.Regular expressions are like a mini programming language which helps in describing and parsing of text.This can be done using the support provided by the tools that are being used.Regular expressions can be as simple as a text editor's search command or as powerful as a full text processing language.Regular expressions are a kind of specific text pattern which can be effectively used with many programming languages as well as modern applications.They are used with methods in searching,replacing and extracting information from strings.

Overview

A regular expression is a simple way to specify a pattern of characters that is to be matched in a string.Regular expressions have become a standard feature in a variety of languages and popular tools such as Perl,Ruby,Python,Java,PHP,VB.Net MySQL.The .NET types that support regular expressions are based upon Perl5 regular expressions and support both search and search/replace functions.The regular expressions can be used for tasks such as: validation of text inputs,the parsing of textual data into better structured forms,replacement of patterns of text in a document. There is no official standard which can define,in exact words,which text patterns are regular expressions and which are not.As a matter of fact,each programming language has a different idea of regular expression.So,we can look at each programming language as having a different regular expression flavor.These flavors cannot be thought of as corresponding one to one with programming languages.Although not completely,modern regular expression flavors present in different languages have a syntax that is very similar and compatible with each other.

Importing the Regular expression library

A variety of programming languages have strings,integers,arrays and so on.But the efficient tool of regular expression is built in only scripting languages such as Ruby,JavaScript,Perl etc and there is no need to do anything to enable the support for Regular expressions in these languages.Some languages like C# and java are required to import a library by writing an import statement in the source code.However,there are some languages not having any support for regular expression.For these languages,it is required to compile and link in the regular expression support by the programmer.Some libraries are available for multiple languages.But some languages can have a choice of different libraries.Having the built in support for regular expressions makes the work of pattern matching and substitution convenient as well as concise.

JavaScript

Regular expression support is built in.

Ruby

Regular Expression support is built in.

Perl

Regular Expression support is built in.

Python

import re

It is required to import the re module into the script.Only then,the functions for Python's regular expression can be used.

PHP

The preg functions are built in and available in PHP 4.2.0 and later.


VB.Net

Imports System.Text.RegularExpressions

Java

import java.util.regex.*;


Syntax in Different languages

Regular expressions consist of normal characters,character classes,wildcard characters and quantifiers.
Basic syntax referance

A normal character also known as a literal can be matched as it is.For example-if a pattern consists of "ab",then only the input sequence "ab" could match it.The characters are specified by using the standard escape sequences beginning with a '\'.A character class is a group of characters which is shown by putting the characters in the class between the brackets.For example-the class [abc] matches a,b or c.The wildcard character is the dot(.) which can match any character.A quantifier is used to determine the number of times an expression is matched.+,*,? are known as quantifiers

The regular expression has to be compiled first before the regular expression engine can match a regular expression to a string.This happens at the time when the application is running during which the regular expression constructor parses the string holding the regular expression.The string is then converted into a tree structure or a state machine.This tree is then traversed by the function performing the match of actual pattern.The programming languages having support for literal regular expressions compile the code when execution reaches the regular expression operator.

JavaScript

The Regular Expression feature of JavaScript has been borrowed from Perl.The Regular expression in JavaScript are difficult to read in part since comments and whitespace are not allowed.All the parts are organized together tightly hence making them difficult to understand.This becomes a concern when used in security applications.
Tutorial

There are two ways to make a RegExp object.The usual way is to use a regular expression literal.The literals are enclosed in slashes.There are three flags indicated by g,i and m,that can be appended directly to the end of RegExp literal.

Example:

                        var my_regexp = /"(?:\\.|[^\\\"]"/g;

The other way is to use the RegExp constructor.

Example:

                        var my_regexp = new RegExp("\"(?:\\\.|[^\\\\\\\"])\"",'g');

The RegExp objects share a single instance.To use the same object again,it can be assigned to a variable.If it is stored in a string variable,the RegExp() constructor to compile the regular expression can be used.In a JavaScript Program,the regular expression has to be on single line.Whitespace is quite significant.
JavaScript basic syntax



Ruby

Ruby provides the support for Perl-compatible regular expressions at syntax level.This makes the program short and concise and so,more readable.In Ruby,the Regular expressions are considered as objects of type Regexp.The objects can be created by calling the constructor or by using the literal forms: /pattern/ and %r{pattern}

                         m=Regexp.new('n') returns /n/

After the object has been created ,it can be used to match against a string by using :

                         Regexp#match(string) 

or using the operators : =~(positive match) and !~(negative match).

                         name="Rains"
                         name=~/n/ returns 3

A pattern that matches a string which contains the text Perl or the text Python can be written as :

                         /Perl|Python/

Repetition within the patterns can also be specified.Another feature is the matching of one of a group of characters within a pattern.For example:character classes such as \s matches a whitespace character,a dot can match (almost) any character.Ruby is quite smilar to Java Script.The only difference is that the name of the class is Regexp as one word in Ruby and is RegExp with camel caps in JavaScript.

                         myregexp = /regex pattern/;

Regular expression retrieved from user input,as a string stored in the variable userinput:

                         myregexp=Regexp.new(userinput);

PERL

The data processing in Perl program relies heavily on regular expressions.Perl provides regular expression operators meshed with the constructs and operators that make up the Perl language.The literal regular expressions are used with the pattern-matching operator and the substitution operator.The pattern matching operator starts with m and contains two forward slashes with the required regex between them.Forward slashes should be escaped with the backslash.While using any type of opening or closing punctuation(parentheses,braces or brackets) as a delimiter,they must be matched up.

Example:

                         m{regex}

Using any other punctuation requires the writing of that same character twice.

The substitution operator starts with s.If we are using brackets or similar puctuation as the delimiter,we need to have two pairs:

                         s[regex][replace]

For rest of the punctuation,it should be used three times:

                         s/regex/replace/

Perl clearly differentiates between dollars used as anchors and dollars used for variabe interpolation.In Perl, @ sign is used for variable interpolation.It should be escaped in literal regular expressions in Perl code.The variety and options offered by Perl's operators and functions are its biggest strength as well as its greatest weakness.

To compile a regular expression ,"quote regex" operator can be used and assigned to a variable.The same syntax as match operator is used except that it starts with qr instead of m.

                           $myregex = qr/regex pattern/

The Regular expression is retrieved from the user input ,and stored as a string in the variable $userinput:

                           $myregex = qr/$userinput/


Python

Python has support for regular expressions through its re module.

The RE object from a pattern string and optional flags is built using a compile function.In Python,the literal regular expressions are required to be passed as strings.There are various ways to quote strings provided by Python,depending on the characters.The different ways of quoting may reduce the number of characters needed to escape with backslashes.The raw strings in Python don't require to escape any characters.For example:r"\d+" instead of "\\d+".But the raw strings cannot be used when there are both single and double quoted strings in our regular expression.In such a case,the raw string can be triple quoted.

                         reobj=re.compile("regex pattern")

The Regular Expression retrieved from user input,as a string stored in the variable userinput:

                          reobj = re.compile(userinput)

Example:

                          'pre.*post' matches a string which contains a substring 'pre' followed by a substring 'post'
                          'pre.+post' matches only if 'pre' and 'post' are not adjacent

The pattern string of Regular Expression in Python follows a specific syntax:

a)Alphabetic and numeric characters stand for themselves.

b)The alphanumeric characters get special meaning when preceded by a backslash.

c)When escaped,Punctuation is self matching.But they have a special meaning when unescaped.

d)If there is a backslash character,it is matched by a repeated backslash.


PHP

There are two different types of regular expressions that are supported by PHP:

a)Posix-extended

b)Perl-Compatible Regular Expressions(PCRE)

The three regex engines in PHP are the "preg","ereg" and "mb_ereg" engines.Two of them implement POSIX ERE while the third is based on PCRE.PHP does not have a native regular expression type unlike JavaScript and Perl.The Regular expression in PHP are required to be quoted as strings.Within the given string,the regular expression should be quoted as a Perl style literal regular expression.

Example:

                      while writing /regex/ in Perl,in PHP the string becomes '/regex/'

PHP does support both single quoted and double quoted strings.

Regular expressions are compiled at runtime.PHP has a large cache which consists of 4096 entries.So,it can be said that a pattern of string is compiled for only the first time it occurs. But PHP does not provide a way to store a compiled regular expression into a variable.So,it has to be passed a string to one of preg functions.


Java

The Regular expression processing is supported by the java.util.regex package.There are two classes which work together and support regular expression processing:Pattern and Matcher.Pattern is used to define the regular expression and the pattern is matched with another sequence called Matcher.The pattern can be created by calling the compile() factory method.

                          static Pattern compile(String pattern)

Once the pattern object is created,it is used to create a matcher by calling the matcher() factory method.

                          Matcher matcher(CharSequence str)

str is the character sequence that the pattern will be matched against.If there is a syntax error,the Pattern.compile() factory throws a PatternSyntaxException.

Working with Regular Expressions in Java requires the creation of objects and sending messages to it.As a consequence,there is large number of method calls which makes it difficult to understand if the program is large.


Example

A regular expression <[$"\d/\\]> has been given as the solution to a problem.This regular expression consists of a single character class matching a dollar sign,a double quote,any digit between 0 and 9,a forward or a backward slash.We have to hardcode it into a sourcecode as a string constant or regular expression operator.

JavaScript

                         /[$"\d\/\\]/

Ruby

                         /[$"\d\/\\]/

Perl

                        Pattern matching operator:
                         /[\$"\d\/\\]/
                         m|[\$"\d/\\]|
                        Substitution Operator:
                         s|[\$"\d/\\]||

Python

                        Normal string:
                        "[$\"\\d/\\\\]"

PHP

                        '%[$"\d/\\\\]%'
                        

Java

                         "[$\"\\d/\\

Referances

External Links

http://en.wikipedia.org/wiki/Regular_expression

http://www.regular-expressions.info/