In This Tutorial, i will teach you about how to use regular expression for advance Text processing.
but readers, if you are new, then please read all posts of this tutorials number wise. because serial wise regular expression is easy to understand.
so here, i am using python 2.7 and Ubuntu.
part3 slides
In [107]:
# # -- Useful Reference Syntax -------------- # # Some special characters are: # # abc Letters # 123 Digits # \d Any Digit # \D Any Non-digit character # . Any Character # \. Period # [abc] Only a, b, or c # [^abc] Not a, b, nor c # [a-z] Characters a to z # [0-9] Numbers 0 to 9 # \w Any Alphanumeric character # \W Any Non-alphanumeric character # {m} m Repetitions # {m,n} m to n Repetitions # * Zero or more repetitions # + One or more repetitions # ? Optional character # \s Any Whitespace # \S Any Non-whitespace character # ^...$ Starts and ends # (...) Capture Group # (a(bc)) Capture Sub-group # (.*) Capture all # (abc|def) Matches abc or def
importre# Python Module For Regular Expression
example_string="""\ suraj singh bisht\ SURAJ SINGH BISHT\ surajsinghbisht054@gmail.com\ www.bitforestinfo.com\ yashwantsinghbisht054@gmail.com\ 0124-100-125-2563\ 124-586-9875\ This is an example text\ """ importre # Meta Characters
# Hmm, Now, You are wondering what are the usages of meta character. # if i guess right then please try to rewind you memory and then you will find that # many character from this list. we have already used in previous examples. # For More Info Check my previous tutorials about regular expressions # # so, let's start this tutorial #
In [108]:
# # In this example we will use more useful functions. # So, keep reading # # In this Techniques, We need to compile patterns. To do, we need to transforme patterns in bytecodes # as shown in the below example
pattern=re.compile('suraj')# Here Compiling pattern in bytecodes
printpattern.match(example_string)#
<_sre.SRE_Match object at 0x7f583c51b098>
In [17]:
# Here Our OUtput is <_sre.SRE_Match object at 0x7f583c65f578> # its means we find match and if we got none in output then its mean no string matched found # # Example 1. # pattern=re.compile('suraj|SURAJ')# Compiling
k=pattern.match(example_string)
# Here, k.start() Parameter For starting index number of matched keyword and # for ending index number of matched keyword k.end() parameter printk.start(),k.end()
# k.span() for retrive both starting index and ending index number both together printk.span()
0 5 (0, 5)
In [150]:
# # Example 2. # example_string="""surajsinghbisht054@gmail.com www.bitforestinfo.com yashwantsinghbisht054@gmail.com 0124-100-125-2563 124-586-9875"""
printresult.group(0)# For all printresult.group(1)# printresult.group(2)#
com 0124-100-125 com 0124-100-125
In [22]:
# # Example 3. # pattern.findall(example_string)# hmm, this function is already explained in previous tutorial
Out[22]:
['suraj', 'SURAJ', 'suraj']
In [151]:
# let's Try some other examples # # Example 4. # Here, I am Searching for Email addresses
example_string=""" suraj singh bisht SURAJ SINGH BISHT surajsinghbisht054@gmail.com www.bitforestinfo.com yashwantsinghbisht054@gmail.com 0124-100-125-2563 124-586-9875 This is an example text """
# Hmm, we got our result # now, let's try to get result with its name. # means here we will give name to pattern group # # Example 5. # # (?P<pattern_name>here_pattern) pattern=re.compile('(?P<email>[a-zA-Z0-9]+@[a-z]+.[a-z]+)')# Comp
result=pattern.search(example_string)
result.groupdict()
Out[41]:
{'email': 'surajsinghbisht054@gmail.com'}
In [44]:
# here, i am trying to search phone number # Example 6. # pattern=re.compile('(?P<phone>\d{3}-\d{3}-\d{4})')# Comp
result=pattern.search(example_string)
result.groupdict()
Out[44]:
{'phone': '100-125-2563'}
In [91]:
# Now, here we will try to search Email and number together # Example 7. # pattern=re.compile('(?P<email>([a-zA-Z0-9]+@[a-z]+.[a-z]+))(\W.*\W.*\W)(?P<phone>\d{3}-\d{3}-\d{4})')# Comp
# In this Examples # We are trying to split string in small parts # and our given pattern is the breaking point of line # let's try this also # Example 9. # re.split('sur.{2}',example_string)# Here, I want to break string from suraj keyword
Out[152]:
['\n', ' singh bisht\nSURAJ SINGH BISHT\n', 'singhbisht054@gmail.com\nwww.bitforestinfo.com\nyashwantsinghbisht054@gmail.com\n0124-100-125-2563\n124-586-9875\nThis is an example text\n']
In [154]:
# one more example # # Example 10. # example_string="The is My Cat and This is my dog but this is my horse" # # let's try to break this line from "and" and "but" keyword re.split('and|but',example_string)
Out[154]:
['The is My Cat ', ' This is my dog ', ' this is my horse']
In [185]:
# Now, Move To Next Level and Discuss About Some Special Cases # # Python Provides Us with some flag that can help us to modify our search results # # Flags # re.s, re.DOTALL # re.I, re.IGNORECASE # re.L, re.LOCALE # re.M, re.MULTILINE # re.U, re.UNICODE # re.X, re.VERBOSE
# For More Info Visit Here: https://docs.python.org/2/library/re.html#module-contents # # Example 11. # example_string=""" suraj singh bisht SURAJ SINGH BISHT surajsinghbisht054@gmail.com www.bitforestinfo.com yashwantsinghbisht054@gmail.com 0124-100-125-2563 124-586-9875 This is an example text """
printre.findall('SURAJ',example_string,re.IGNORECASE) # as you can see In output, my input pattern is in uppercase but my output is in both lower and upper.
['suraj', 'SURAJ', 'suraj']
In [195]:
# # Example 12. # printre.findall('BITF(.*)com',example_string,re.IGNORECASE|re.DOTALL) # # normally . character is for capturing all keyword except new line but because of re.DOTALL flag, now # in this example . is also capturing new lines
['stinfo.blogspot.com\nyashwantsinghbisht054@']
In Our next tutorial, we will discuss about some other techniques of regular expression.