top

Search

Python Tutorial

A regular expression also known as regex is a sequence of characters that defines a search pattern. Regular expressions are used in search algorithms, search and replace dialogs of text editors, and in lexical analysis.. It is also used for input validation. It is a technique that developed in theoretical computer science and formal language theory.Different syntaxes are used for writing regular expressions. One is the POSIX standard and another, widely used, is the Perl syntax.Manipulation of textual data plays important role in data science projects that require large scale text processing. Many programming languages including Python provide regex capabilities, built-in or via libraries. Python's standard library has 're' module for this purpose.The most common applications of regular expressions are:Search for a pattern in a stringFinding a pattern stringBreak string into a sub stringsReplace part of a stringRaw stringsMethods in re module use raw strings as the pattern argument. A raw string is having prefix 'r' or 'R' to the normal string literal.>>> normal="computer" >>> print (normal) computer >>> raw=r"computer" >>> print (raw) computerBoth strings appear similar. The difference is evident when the string literal embeds escape characters ('\n', '\t' etc.)>>> normal="Hello\nWorld" >>> print (normal) Hello World >>> raw=r"Hello\nWorld" >>> print (raw) Hello\nWorldIn case of normal string, the print() function interprets the escape character. In this case '\n' produces effect of newline character. However because of the raw string operator 'r' the effect of escape character is not translated as per its meaning. The output shows actual construction of string not treating '\n' as newline character.Regular expressions use two types of characters in the matching pattern string: Meta characters are characters having a special meaning, similar to * in wild card. Literals are alphanumeric characters.Following list of characters are called the metacharacters.. ^ $ * + ? { } [ ] \ | ( )The square brackets[ and ] are used for specifying a  set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.[abc]Match any of the characters a, b, or c[a-c]Which uses a range to express the same set of characters.[a-z]Match only lowercase letters.[0-9]Match only digits.'^'Complements the character set in [].[^5] will match any character except '5'.'\'is an escaping metacharacter followed by various characters to signal various special sequences. If you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.Some of the special sequences beginning with '\' represent predefined sets of characters.\dMatches any decimal digit; this is equivalent to the class [0-9].\DMatches any non-digit character; this is equivalent to the class [^0-9].\sMatches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].\SMatches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].\wMatches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].\WMatches any non-alphanumeric character. equivalent to the class [^a-zA-Z0-9_]..Matches with any single character except newline ‘\n’.?Match 0 or 1 occurrence of the pattern to its left+1 or more occurrences of the pattern to its left*0 or more occurrences of the pattern to its left\bBoundary between word and non-word and /B is opposite of /b[..]Matches any single character in a square bracket and [^..] matches any single character not in square bracket\It is used for special meaning characters like \. to match a period or \+ for plus sign.{n,m}Matches at least n and at most m occurrences of precedinga| bMatches either a or bThe re module has following functions:re.match():This method finds match for the pattern if it occurs at start of the string.re.match(pattern, string)This function returns None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, etc.>>> import re >>> string="Simple is better than complex." >>> obj=re.match(r"Simple", string) >>> obj <_sre.SRE_Match object; span=(0, 6), match='Simple'> >>> obj.start() 0 >>> obj.end() 6The match object's start() method returns the starting position of pattern in the string, and end() returns the endpoint.If the pattern is not found, the match object is None.re.search():This function searches for first occurrence of RE pattern within string from any position of the string but it only returns the first occurrence of the search pattern.>>> import re >>> string="Simple is better than complex." >>> obj=re.search(r"is", string) >>> obj.start() 7 >>> obj.end() 9re.findall():It helps to get a list of all matching patterns. The return object is the list of all matches.>>> import re >>> string="Simple is better than complex." >>> obj=re.findall(r"ple", string) >>> obj ['ple', 'ple']To obtain list of all alphabetic characters from the string>>> obj=re.findall(r"\w", string) >>> obj ['S', 'i', 'm', 'p', 'l', 'e', 'i', 's', 'b', 'e', 't', 't', 'e', 'r', 't', 'h', 'a', 'n', 'c', 'o', 'm', 'p', 'l', 'e', 'x']To obtain list of words>>> obj=re.findall(r"\w*", string) >>> obj ['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']re.split():This function helps to split string by the occurrences of given pattern. The returned object is the list of slices of strings.>>> import re >>> string="Simple is better than complex." >>> obj=re.split(r' ',string) >>> obj ['Simple', 'is', 'better', 'than', 'complex.']The string is split at each occurrence of a white space ' ' returning list of slices, each corresponding to a word. Note that output is similar to split() function of built-in str object.>>> string.split(' ') ['Simple', 'is', 'better', 'than', 'complex.']re.sub():This function returns a string by replacing a certain pattern by its substitute string. Usage of this function is :re.sub(pattern, replacement, string)In the example below, the word 'is' gets substituted by 'was' everywhere in the target string.>>> string="Simple is better than complex. Complex is better than complicated." >>> obj=re.sub(r'is', r'was',string) >>> obj'Simple was better than complex. Complex was better than complicated.'re.compile():This function compiles a regular expression pattern into a regular expression object. This is useful when you need to use an expression several times.>>> string 'Simple is better than complex. Complex is better than complicated.' >>> pattern=re.compile(r'is') >>> obj=pattern.match(string) >>> obj=pattern.search(string) >>> obj.start() 7 >>> obj.end() 9 >>> obj=pattern.findall(string) >>> obj ['is', 'is'] >>> obj=pattern.sub(r'was', string) >>> obj 'Simple was better than complex. Complex was better than complicated.'Some important cases of using re moduleFinding word starting with vowels>>> string='Errors should never pass silently. Unless explicitly silenced.' >>> obj=re.findall(r'\b[aeiouAEIOU]\w+', string) >>> obj ['Errors', 'Unless', 'explicitly']Replace domain names of all email IDs in a list.>>> emails=['aa@xyz.com', 'bb@abc.com', 'cc@mnop.com'] >>> gmails=[re.sub(r'@\w+.(\w+)','@gmail.com', x) for x in emails] >>> gmails ['aa@gmail.com', 'bb@gmail.com', 'cc@gmail.com']
Rated 5/5 based on 11 customer reviews
logo

Python Tutorial

A regular expression also known as regex is a sequence of characters that defines a search pattern. Regular expressions are used in search algorithms, search and replace dialogs of text editors, and in lexical analysis.. It is also used for input validation. It is a technique that developed in theoretical computer science and formal language theory.

Different syntaxes are used for writing regular expressions. One is the POSIX standard and another, widely used, is the Perl syntax.

Manipulation of textual data plays important role in data science projects that require large scale text processing. Many programming languages including Python provide regex capabilities, built-in or via libraries. Python's standard library has 're' module for this purpose.

The most common applications of regular expressions are:

  • Search for a pattern in a string
  • Finding a pattern string
  • Break string into a sub strings
  • Replace part of a string

Raw strings

Methods in re module use raw strings as the pattern argument. A raw string is having prefix 'r' or 'R' to the normal string literal.

>>> normal="computer"
>>> print (normal)
computer
>>> raw=r"computer"
>>> print (raw)
computer

Both strings appear similar. The difference is evident when the string literal embeds escape characters ('\n', '\t' etc.)

>>> normal="Hello\nWorld"
>>> print (normal)
Hello
World
>>> raw=r"Hello\nWorld"
>>> print (raw)
Hello\nWorld

In case of normal string, the print() function interprets the escape character. In this case '\n' produces effect of newline character. However because of the raw string operator 'r' the effect of escape character is not translated as per its meaning. The output shows actual construction of string not treating '\n' as newline character.

Regular expressions use two types of characters in the matching pattern string: Meta characters are characters having a special meaning, similar to * in wild card. Literals are alphanumeric characters.

Following list of characters are called the metacharacters.

. ^ $ * + ? { } [ ] \ | ( )

The square brackets[ and ] are used for specifying a  set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.

[abc]Match any of the characters a, b, or c
[a-c]Which uses a range to express the same set of characters.
[a-z]Match only lowercase letters.
[0-9]Match only digits.
'^'Complements the character set in [].[^5] will match any character except '5'.

'\'is an escaping metacharacter followed by various characters to signal various special sequences. If you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of the special sequences beginning with '\' represent predefined sets of characters.

\dMatches any decimal digit; this is equivalent to the class [0-9].
\DMatches any non-digit character; this is equivalent to the class [^0-9].
\sMatches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\SMatches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\wMatches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\WMatches any non-alphanumeric character. equivalent to the class [^a-zA-Z0-9_].
.Matches with any single character except newline ‘\n’.
?Match 0 or 1 occurrence of the pattern to its left
+1 or more occurrences of the pattern to its left
*0 or more occurrences of the pattern to its left
\bBoundary between word and non-word and /B is opposite of /b
[..]Matches any single character in a square bracket and [^..] matches any single character not in square bracket
\It is used for special meaning characters like \. to match a period or \+ for plus sign.
{n,m}Matches at least n and at most m occurrences of preceding
a| bMatches either a or b

The re module has following functions:

re.match():

This method finds match for the pattern if it occurs at start of the string.

re.match(pattern, string)

This function returns None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, etc.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.match(r"Simple", string)
>>> obj
<_sre.SRE_Match object; span=(0, 6), match='Simple'>
>>> obj.start()
0
>>> obj.end()
6

The match object's start() method returns the starting position of pattern in the string, and end() returns the endpoint.

If the pattern is not found, the match object is None.

re.search():

This function searches for first occurrence of RE pattern within string from any position of the string but it only returns the first occurrence of the search pattern.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.search(r"is", string)
>>> obj.start()
7
>>> obj.end()
9

re.findall():

It helps to get a list of all matching patterns. The return object is the list of all matches.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.findall(r"ple", string)
>>> obj
['ple', 'ple']

To obtain list of all alphabetic characters from the string

>>> obj=re.findall(r"\w", string)
>>> obj
['S', 'i', 'm', 'p', 'l', 'e', 'i', 's', 'b', 'e', 't', 't', 'e', 'r', 't', 'h', 'a', 'n', 'c', 'o', 'm', 'p', 'l', 'e', 'x']

To obtain list of words

>>> obj=re.findall(r"\w*", string)
>>> obj
['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']

re.split():

This function helps to split string by the occurrences of given pattern. The returned object is the list of slices of strings.

>>> import re
>>> string="Simple is better than complex."
>>> obj=re.split(r' ',string)
>>> obj
['Simple', 'is', 'better', 'than', 'complex.']

The string is split at each occurrence of a white space ' ' returning list of slices, each corresponding to a word. Note that output is similar to split() function of built-in str object.

>>> string.split(' ')
['Simple', 'is', 'better', 'than', 'complex.']

re.sub():

This function returns a string by replacing a certain pattern by its substitute string. Usage of this function is :

re.sub(pattern, replacement, string)

In the example below, the word 'is' gets substituted by 'was' everywhere in the target string.

>>> string="Simple is better than complex. Complex is better than complicated."
>>> obj=re.sub(r'is', r'was',string)
>>> obj

'Simple was better than complex. Complex was better than complicated.'

re.compile():

This function compiles a regular expression pattern into a regular expression object. This is useful when you need to use an expression several times.

>>> string
'Simple is better than complex. Complex is better than complicated.'
>>> pattern=re.compile(r'is')
>>> obj=pattern.match(string)
>>> obj=pattern.search(string)
>>> obj.start()
7
>>> obj.end()
9
>>> obj=pattern.findall(string)
>>> obj
['is', 'is']
>>> obj=pattern.sub(r'was', string)
>>> obj
'Simple was better than complex. Complex was better than complicated.'

Some important cases of using re module

Finding word starting with vowels

>>> string='Errors should never pass silently. Unless explicitly silenced.'
>>> obj=re.findall(r'\b[aeiouAEIOU]\w+', string)
>>> obj
['Errors', 'Unless', 'explicitly']

Replace domain names of all email IDs in a list.

>>> emails=['aa@xyz.com', 'bb@abc.com', 'cc@mnop.com']
>>> gmails=[re.sub(r'@\w+.(\w+)','@gmail.com', x) for x in emails]
>>> gmails
['aa@gmail.com', 'bb@gmail.com', 'cc@gmail.com']

Leave a Reply

Your email address will not be published. Required fields are marked *

Suggested Tutorials

C# Tutorial

By KnowledgeHut

C# is an object-oriented programming developed by Microsoft that uses the .Net Framework. It utilizes the Common Language Interface (CLI) that describes the executable code as well as the runtime environment. C# can be used for various applications such as web applications, distributed applications, database applications, window applications etc.For greater understanding of this tutorial, a basic knowledge of object-oriented languages such as C++, Java etc. would be beneficial.
Rated 5/5 based on 12 customer reviews
C# Tutorial

C# is an object-oriented programming developed by Microsoft that uses ...

Read More