Python Regular Expressions
- Extract phone numbers – v1 , v2, v3, v4, v5
- Extract phone numbers -v2
- Extract phone numbers -v3
- Extract phone numbers -v4
- Extract phone numbers -v5
- Extract emails
- Extract zip codes
- Log file analysis
- Credit cards extractor
- Word occurrences
- HTML tags
- Password verification
- Regular expressions cheatsheet
- Challenge – Extract data to JSON format
Regular expressions are a very important tool for a data scientist or a machine learning engineer. Regular expressions is dry and boring topic to learn. But the problems Regular Expressions solve are very real and interesting. So, we will learn Regular Expressions with a problem solving approach. We will define a series of small problems, solve them step by step and with each problem, we will learn some of the aspects of Regular Expressions.
If you are not comfortable with this kind of non-linear approach, this course might not be for you.
Github Repo
You can find the Jupyter Notebook for all the examples given on this page at the following repository.
Extract phone numbers v1
Problem – Extract all the phone numbers from this text.
numbers = '''There are 3 phone numbers that you need to call in case of medical emergency.
For casualty, call 408-202-2222. For elderly emergencies, call 408-203-2222 and
for everything else call 408-202-4444
'''
Let’s take a much simpler case – just a list of 3 phone numbers and no text. If the patter is at the beginning of the string, you can use the match ( ) function. Also, match ( ) function only returns the first occurance.
import re #import for regular expressions
numbers = '''408-202-2222
408-203-2222
408-202-4444'''
match = re.match("408-\d\d\d-\d\d\d\d",numbers)
print ( match )
match ( ) function returns a match object. It contains the span (the start and end of the match) and the actual match itself. Use the group ( ), span( ), start( ) and end( ) functions to get the specifics of the match.
print ( "matching text = ", match.group())
print ( "start position = ", match.start())
print ( "end position = ",match.end())
matching text = 408-202-2222
start position = 0
end position = 12
Let’s try something slightly different with match ( ) function. Will it be able to pick the pattern from this text ?
numbers = '''
408-202-2222
408-203-2222
408-202-4444'''
match = re.match("408-\d\d\d-\d\d\d\d",numbers)
print ( match )
None
No. Why is that ? match ( ) function can only find out the pattern at the beginning of the string. In this case, the first line is a blank line. So, the match ( ) function fails. In cases like this, use the search ( ) function. In contrast to the match ( ) function, the search ( ) function can extract patterns anywhere in the text.
match = re.search("408-\d\d\d-\d\d\d\d",numbers)
print ( match )
But, we are still getting the first match only. We wanted all the matches, right ? Botht he search ( ) and match ( ) functions return the first match only. To get all the matches, we will have to use other functions like findall ( ) or finditer ( ).
matches = re.findall("408-\d\d\d-\d\d\d\d",numbers)
print ( matches )
matches = re.findall("408-\d\d\d-\d\d\d\d",numbers)
print ( matches )
['408-202-2222', '408-203-2222', '408-202-4444']
That’s much better, right ? The findall ( ) function returns all the matches it finds in the text. The other function finditer ( ) just returns the same results in an iterable.
matches = re.finditer("408-\d\d\d-\d\d\d\d",numbers)
for match in matches :
print ( match )
<re.Match object; span=(14, 26), match='408-202-2222'>
<re.Match object; span=(40, 52), match='408-203-2222'>
<re.Match object; span=(66, 78), match='408-202-4444'>
If you wanted just the match, use the group () function to extract the matching text.
matches = re.finditer("408-\d\d\d-\d\d\d\d",numbers)
for match in matches :
print ( match.group() )
408-202-2222
408-203-2222
408-202-4444
Now, we can solve the problem we started out with.
numbers = '''There are 3 phone numbers that you need to call in case of medical emergency.
For casualty, call 408-202-2222. For elderly emergencies, call 408-203-2222 and
for everything else call 408-202-4444
'''
matches = re.finditer("408-\d\d\d-\d\d\d\d",numbers)
for match in matches :
print ( match.group() )
408-202-2222
408-203-2222
408-202-4444
In fact, even if the starting phone number is not always constant, like a 408 in this case, still we should be able to extract the matches.
numbers = '''There are 3 phone numbers that you need to call in case of medical emergency.
For casualty, call 408-202-2222. For elderly emergencies, call 408-203-2222 and
for everything else call 800-202-4444
'''
matches = re.finditer("\d\d\d-\d\d\d-\d\d\d\d",numbers)
for match in matches :
print ( match.group() )
408-202-2222
408-203-2222
800-202-4444
See, all the numbers have been extracted.
Points to Remember
- \d represents a single digit
- match ( ) function returns the first match only, but only start at the beginning of the line.
- search ( ) function returns the first match only.
- findall ( ) and finditer ( ) functions return all the matches.
Extract phone numbers v2
Problem – Extract all the phone numbers from this text message.
numbers = '''408-222-2222,
(408)-333-3333,
(800)-444-4444'''
Let’s try what we know so far.
match = re.findall("\d\d\d-\d\d\d-\d\d\d\d",numbers)
print ( match )
['408-222-2222']
But this only matches the phone numbers without brackets. What about the ones with paranthesis ? We can try something like this.
match = re.findall("(\d\d\d)-\d\d\d-\d\d\d\d",numbers)
print ( match )
['408']
oops.. it is not working. Why ? Because, paranthesis represents a special character – It is used to make groups out of regular expressions (which, we will see later). To represent an actual paranthesis, escape it with a backslash.
match = re.findall("\(\d\d\d\)-\d\d\d-\d\d\d\d",numbers)
print ( match )
['(408)-333-3333', '(800)-444-4444']
OK. Now, we got the phone numbers with paranthesis, but we missed the ones without paranthesis. We want to capture either of these combinations. That’s when we use the OR operator. In regular expressions, we use the pipe operator (|) to represent either/or type of patterns.
match = re.findall("\(\d\d\d\)-\d\d\d-\d\d\d\d|\d\d\d-\d\d\d-\d\d\d\d",numbers)
print ( match )
['408-222-2222', '(408)-333-3333', '(800)-444-4444']
There we go – we were able to capture both the patterns. However, the \d in the pattern repeats a lot making the string too long. Instead, we can use quantifiers to specify how long a particular sub-pattern can be. For example, the following pattern is exactly equivalent to the pattern above.
match = re.findall("\(\d{3}\)-\d{3}-\d{4}|\d{3}-\d{3}-\d{3}",numbers)
print ( match )
['408-222-222', '(408)-333-3333', '(800)-444-4444']
As you can see, quantifiers make the pattern much more compact in case there is a lot of repetition.
Points to Remember
- If paranthesis (or ) needs to be used in the pattern, escape them with a backslash ( \ ). This is done because, paranthesis is used to represent groups, which we will look into later.
- | or pipe character is used to represent a logical OR operator in regular expressions.
- { } Flower brackets are used to quantify the number of occurrances of a particular part of a regular expression. For example, a{3} is used to indicate that exactly 3 a‘s should be looked for.
Extract phone numbers v3
Problem – Extract all the phone numbers from this text message.
numbers = '''408-222-2222,
408.333.3333,
800 444 4444'''
match = re.findall("\d{3}-\d{3}-\d{4}|\d{3}.\d{3}.\d{3}|\d{3}\d{3}\d{3}",numbers)
print ( match )
['408-222-2222', '408.333.333', '800 444 444']
This works. But, can we make it any more concise ? There seems to be a lot of repetition. This is where character sets come in. In this case, the separator between the phone numbers is either a dash or a dot or a blank space. Can we somehow represent all of these characters to be searched for as separators, as opposed to specifying each pattern separately ?
match = re.findall("\d{3}[-.]\d{3}[-.]\d{4}",numbers)
print ( match )
['408-222-2222', '408.333.3333']
But what about phone numbers with spaces ? How do we represent a space in regular expressions ? We use the special character \s.
match = re.findall("\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )
['408-222-2222', '408.333.3333', '800 444 4444']
There we go – we are able to capture all of the phone numbers.
Points to Remember
- Characters enclosed in [] (square brackets) are called character sets. Regular expressions search for any character inside the charater set for matches.
- \s is used to represent a space or blank character.
Extract phone numbers v4
Problem – Extract all the phone numbers from this text.
numbers = ''' 408-222-2222,
1 408.333.3333,
1 408-444-4444,
1 (800) 444 4444'''
match = re.findall("\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )
['408-222-2222', '408.333.3333', '408-444-4444']
But, how about the 1 before the numbers ? How do we capture them ? Some phone numbers have it and some don’t. That’s where the ? quantifier comes in. If a pattern needs to be checked for occurance zero or 1 time, use the ? quantifier.
match = re.findall("1?\s\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )
[' 408-222-2222', '1 408.333.3333', '1 408-444-4444']
Much better. Now, what about the 800 number with paranthesis ? How do we look for paranthesis ? We have seen previously that paranthesis is a special character and to extract that we need to escape it. Let’s try that.
match = re.findall("1?\s\(?\d{3}\)?[\s]\d{3}[\s]\d{3}",numbers)
print ( match )
['1 (800) 444 444']
Alright, we got that as well. Now, to combine all of these, we can use the OR operator.
match = re.findall("1?\s\(?\d{3}\)?[\s]\d{3}[\s]\d{3}|1?\s\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )
[' 408-222-2222', '1 408.333.3333', '1 408-444-4444', '1 (800) 444 444']
Or, we can combine them like this.
match = re.findall("1?\s\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{3}",numbers)
print ( match )
[' 408-222-222', '1 408.333.333', '1 408-444-444', '1 (800) 444 444']
Learning
- ? is used to represent a pattern that repeats zero or one time. It is a type of quantifier like {n}
Extract phone numbers v5
Problem – Extract all the phone numbers from this text.
numbers = '''+1 408-222-2222,
+91 98989-99898,
+86 10-1234-5678,
+263 10-234-5678'''
The first one is a US phone number, the second one is India and the third one is Chinese number. How to extract these. Let’s start with the plus (+) at the beginning of the string. How to extract that ?
match = re.findall("+",numbers)
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-164-2fe5ce5c5168> in <module>
----> 1 match = re.findall("+",numbers)
.....
error: nothing to repeat at position 0
oops.. doesn’t work. That is because, + is a special character. It is used to represent a quantifer. + means that a patter repeats one or more time. So, to find + as a pattern, you would have to escape it.
match = re.findall("\+", numbers)
print ( match )
OK, Now, we are able to get the + in the string. Let’s extract the country code next. It is the set of numbers right next to the + symbol. It could be 1 (like US ) or 2 (like India, China ), or may be 3 (Zimbabwe is +263 ). We can use the flower brackets to specify a pattern length of 1 to 3 like so –
{1,3}
match = re.findall("\+\d{1,3}", numbers)
print ( match )
Next, we have a set of numbers separated by dashes. However, the count of numbers between the dashes is arbitrary. So, we need some kind of a quantifier again to find out repetitive pattern of count between 1 and n. We could just assume a higher nuber say 5 for n and proceed like so.
match = re.findall("\+\d{1,3}\s{1,3}\d{1,6}-\d{1,6}-\d{1,6}", numbers)
print ( match )
Instead of using the {m,n} quantifier to identify digits that repeat atleast once, you can use the quantifier +.
match = re.findall("\+\d+\s+\d+-\d+-\d+", numbers)
print ( match )
We are still missing another number , +91 98989-99898. This is because, the number is divided into 2 parts (and not 3 parts separated by dashes). So, a simple solution would be to create another pattern and do an OR operation. That should capture all of the possible phone numbers in this case.
match = re.findall("\+\d+\s+\d+-\d+-\d+|\+\d+\s+\d+-\d+", numbers)
print ( match )
Learning
- {m,n} is used to represent a pattern that repeats m to n number of times. It is a type of quantifier.
- Since + is a special character (used to identify patterns that repeat 1 or more times), to identify + itself, escape it with a backslash (\)
Extract emails
Problem – Extract all the emails from this text.
text = ''' accounts@boa.com,
sales@boa.com,
cancellations@tesla.com,
accounts@delloitte.com,
cancellations@farmers.com,
accounts@dell@com'''
To solve text based patterns, one of the fundamental character set is \w. It represents any character that can be found in a word – it could be alphabetic or numeric or underscore. These are the only 3 types of characters that \w can find. For example, a single \w on this text, basically captures all the word characters (a to z characters, 0-9 digits and underscore ). You can see that in the output below.
matches = re.findall("\w",text)
print ( matches )
We need to step up from letters, to identify words. A word is just a repetition of a set of letters, numbers and underscores. So, we use a quantifier + to identify a word.
matches = re.findall("\w+",text)
print ( matches )
Now that we have all the words, all we have to do is to put together the pattern that includes the @ symbol and dot.
matches = re.findall("\w+@\w+.\w+",text)
print ( matches )
We are almost there, except the last email – accounts@dell@com. This is not a valid email. So, why is our pattern capturing it ? When we mentioned dot (.) in our pattern (\w+@\w+.\w+), it basically captures any character. So, in order to capture a dot, all we have to do is to escape it – prepend it with a backslash (\)
matches = re.findall("\w+@\w+\.\w+",text)
print ( matches )
There you go, we have succesfully found out all the emails in the text.
Learning
- \w is used to represent a character in a word – it could be an alphabet (a-z) or a number ( 0-9) or an underscore.
- \w+ – to identify words, all you have to do is append \w with a plus (+).
- . (dot) is used to identify ANY character. It is a special character. To actually identify a . (dot) itself, just escape it with a backslash (\).
Extract zip codes
Problem : Say we have a text with US zip codes. The valid format for US zip codes are
- 99999
- 99999-9999
where 9 represents any digit. Write a regular expression to extract all zip codes from the text.
text = '''08820, 08820-1245, zip code, 98487, 98487-0000, ABCD '''
matches = re.findall ( "\d{5}-\d{4}|\d{5}", text)
print ( matches )
html = '''<font size=2>
<font size=2>
<font size = 2>
< font size=2 >
<font size = 2 >'''
Quiz Which of the following regular expression captures all of the above combinations. Observe the spaces precisely.
- “<\s+font”
Exercise : Say we have a text with Canadian zip codes. The format for canadian zip codes is
- A1A A1A
where A represents an alphabet and 1 represents any digit. There is a space at the 4th character.
text = '''M1R 0E9
M3C 0C1
M3C 0C2
M3C 0C3
M3C 0E3
M3C 0E4
M3C 0H9
M3C 0J1
1M1 A1A
11M 1A1
M11 A1A
M3C0J1
M3C JJ1'''
# Test - The last five elements should NOT match
Solution
matches = re.findall ("[A-Z]\d[A-Z] \d[A-Z]\d", text)
print ( matches )
Log file analysis
Problem – Say there is a web server log file, find out how many times the login file was succesfully hit and how many times it failed. For now, we will work with a sample snippet from the file. We will work with the real file in the next challenge.
log = '''
10.128.2.1 [29/Nov/2017:06:58:55 GET /login.php HTTP/1.1 Status Code - 302
10.128.2.1 [29/Nov/2017:06:59:02 POST /process.php HTTP/1.1 Status Code - 302
10.128.2.1 [29/Nov/2017:06:59:03 GET /home.php HTTP/1.1 Status Code - 200
10.131.2.1 [29/Nov/2017:06:59:04 GET /js/vendor/moment.min.js HTTP/1.1 Status Code - 200
10.130.2.1 [29/Nov/2017:06:59:06 GET /bootstrap-3.3.7/js/bootstrap.js HTTP/1.1 Status Code - 200
10.130.2.1 [29/Nov/2017:06:59:19 GET /profile.php?user=bala HTTP/1.1 Status Code - 200
10.128.2.1 [29/Nov/2017:06:59:19 GET /js/jquery.min.js HTTP/1.1 Status Code - 200
10.131.2.1 [29/Nov/2017:06:59:19 GET /js/chart.min.js HTTP/1.1 Status Code - 200
10.131.2.1 [29/Nov/2017:06:59:30 GET /edit.php?name=bala HTTP/1.1 Status Code - 200
10.131.2.1 [29/Nov/2017:06:59:37 GET /logout.php HTTP/1.1 Status Code - 302
10.131.2.1 [29/Nov/2017:06:59:37 GET /login.php HTTP/1.1 Status Code - 200
10.130.2.1 [29/Nov/2017:07:00:19 GET /login.php HTTP/1.1 Status Code - 200
10.130.2.1 [29/Nov/2017:07:00:21 GET /login.php HTTP/1.1 Status Code - 200
10.130.2.1 [29/Nov/2017:13:31:27 GET / HTTP/1.1 Status Code - 302
10.130.2.1 [29/Nov/2017:13:31:28 GET /login.php HTTP/1.1 Status Code - 200
10.129.2.1 [29/Nov/2017:13:38:03 POST /process.php HTTP/1.1 Status Code - 302
10.131.0.1 [29/Nov/2017:13:38:04 GET /home.php HTTP/1.1 Status Code - 200'''
solution
pattern = "(\d+\.\d+\.\d+\.\d+).*(login.php)\s(HTTP).*-\s(\d{3})"
matches = re.findall (pattern, log)
print (matches)
count_200 = 0
count_not_200 = 0
for match in matches :
if match[3] == "200" :
count_200 += 1
else :
count_not_200 += 1
success_perc = ( count_200 / (count_200 + count_not_200) ) * 100
print ( " login was succesfully hit ", success_perc , "% of time")
Learning
- (…) is used to represent groups in a regular expression
- There can be multiple groups in a single regular expression
- Each of the groups can be extracted out per each match of the regular expression
- . (dot) represents ANY character
Challenge
Say there is a web server log file, find out how many times the login file was succesfully hit and how many times it failed. The file is available in the data directory. If the HTTP code ( at the end of each line in the log file ) is 200 the page is succesfully rendered. Otherwise, it is a failure.
Solution
# read file
data = [] # will contain the log data as a list
with open ( "./data/log_file.txt", "r") as f :
for line in f :
data.append(line)
# print the read data
for line in data [0:5]:
print ( line, end="")
# parse the data using regular expression and find matches for login.php
import re
login_data = []
pattern = "(\d+\.\d+\.\d+\.\d+).*(login.php)\s(HTTP).*-\s(\d{3})"
for line in data :
matches = re.findall (pattern, line)
if len(matches) > 0 :
login = []
login.append(matches[0][0])
login.append(matches[0][1])
login.append(matches[0][2])
login.append(matches[0][3])
login_data.append(login)
# print a sample
for line in login_data[0:5]:
print ( line)
# calculate the success ratio
count_200 = 0 # succesful
count_not_200 = 0 # unsuccesful
for element in login_data :
if element[3] == "200":
count_200 += 1
else :
count_not_200 += 1
percentage_success = ( count_200 / (count_200 + count_not_200) ) * 100
print ( "Login page was succesfully hit ", percentage_success, "% of the time")
text = '''Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science is related to data mining and big data.
Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.'''
pattern = re.findall(".{10}\\s{2,}.{10}",text)
print ( pattern )
Quiz – The pattern above can be used to find out
- words that have 2 or more spaces in between them
- sentences that have 2 or more space in between them
Credit cards extractor
Problem – Find credit card numbers identified by their category (Visa, Master Card ). These credit card numbers follow a certain pattern. Use the following pattern to identify the category.
Patterns –
- All visa card numbers start with a 4 and are either 13 or 16 numbers
- All Master card numbers start with 51 through 55 or 2221 through 2720 and have exactly 16 digits
- All Amex cards start with 34 or 37 and have exactly 15 digits
text = '''4018-2345-0303-0339,
5335-6092-0182-9739,
4076-2929-0000-2222,
3777-5074-1547-439,
5451-3970-1507-5040,
3425-2515-2514-202,
3752-2681-2429-924,
4004-4759-3761-924,
2228-2545-5555-2542,
2296-2542-2587-2555,
2321-2547-5145-2222,
2650-2545-2222-5555,
2706-2546-2589-2515,
2713-9874-5263-6253,
2720-2541-3256-6985
'''
Solution
Let’s see if the following solution works for Visa cards.
visa_matches = re.findall("4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9][0-9]{3}?", text)
print ( visa_matches )
['4018-2345-0303-0339', '4076-2929-0000-2222']
The last card is not being picked up. Let’s wrap the last 3 digits in a group and try it.
visa_matches = re.findall("4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9]([0-9]{3})?", text)
print ( visa_matches )
['339', '222', '']
ooh.. this time it only picks up the group. But we wanted the entire number, right ? There are a couple of options.
- We can either put all of the remaining pattern also into groups.. like so
visa_matches = re.findall("(4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9])([0-9]{3})?", text)
print ( visa_matches )
- or, we can let the last element in the pattern to not be a non-capturing group – meaning, it will still be a group from a syntax perspective, but will not be captured as a group. To do that, we use ?:.
visa_matches = re.findall("4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9](?:[0-9]{3})?", text)
print ( visa_matches )
Whenever we use ?: at the beginning of the group, it will be used to capture the pattern, but will not be captured into the group. Now, let’s work on master card.
Master card represents a different pattern. It has a pretty broad range of numbers – The beginning numbers start with
- 51 through 55 OR
- 2221 through 2720
The first one is easy enough. Let’s work on that first.
mc_matches = re.findall("5[1-5][0-9]{2}-[0-9]{4}-[0-9]{4}-[0-9]{4}", text)
print ( mc_matches )
The range 2221-2720 cannot be specified that easily. We need a different strategy for that. We can split this range as follows.
- 2221-2229
- 223x-229x ( 2230 to 2299 )
- 23xx-26xx ( 2300 to 2699 )
- 270x-2709 ( 2700 to 2709 )
- 271x-2719 ( 2710 to 2719 )
- 2720
We need to code all these patters using an OR operator.
mc_matches = re.findall("222[1-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
mc_matches = re.findall("22[3-9][0-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
mc_matches = re.findall("2[3-6][0-9]{2}-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
[]
mc_matches = re.findall("270[0-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
[]
mc_matches = re.findall("271[0-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
[]
mc_matches = re.findall("2720-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
[]
In all these examples, the first four digits are the ones that are different. The pattern for the rest of the 12 numbers remain the same. So, let’s compress all of these into an OR based patter for the first 4 digits and let the remaining 12 digits remain constant.
mc_matches = re.findall ("(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|270[0-9]|271[0-9]|2720)-[0-9]{4}-[0-9]{4}-[0-9]{4}", text)
print ( mc_matches)
[]
There we go – that should cover all the possible combinations of Master cards. Now, lets move on to Amex. Amex has a really simple pattern –
- All Amex cards start with 34 or 37 and have exactly 13 digits
amex_matches = re.findall("3[47][0-9]{2}-[0-9]{4}-[0-9]{4}-[0-9]{3}", text)
print ( amex_matches )
[]
Learning
- (?:…) is used to represent non-capturing groups – meaning, they will be used to identify patterns, but the specific pattern within that paranthesis will not be captured as a group. We have seen how this will be useful in case of Visa pattern.
- Cycling through a range of numbers. We have seen how to cycle through a large range of numbers when we discussed the pattern for Master card.
Word occurrences
Problem – Find all the occurances of a word in a text and segregate them into 2 categories
- 1 – standalone occurrence of the word
- 2 – The word is part of another word.
For example, the word “bat” can occur in isolation , like in the sentence (“His cricket bat is awesome”), or as part of a different word , like in (“Aeriel combat vs land-based combat”).
Solution – Finding the pattern is quite easy. However, the trick is to find out if the word occurred individually or is part of another word. word boundaries can help in this case. \b is used to specify a word boundary.
text = '''Python is a general purpose programming language. Python's design philosopy ...
Let's pythonify some of the code...
while python is a high level language...'''
matches = re.findall("\bPython\b", text)
print ( matches )
[]
But it is not working. Why ? That is because \b is a special escape character for backspace. When you specify that in the pattern string, it is not treated literally, but interpreted as a backspace. To avoid confusion, always use raw strings to define patterns. Raw strings can be specified in Python by prepending the string with a r. Let’s try this again.
matches = re.findall(r"\bPython\b", text)
print ( matches )
['Python', 'Python']
That’s better. However, there are 3 occurrences of “Python”. Why is python in the last line not being picked up ? That is because, regular expressions are case sensitive. The “p” in the word “python” in the last line is lower case. If you wanted to do a case sensitive search, use global flags. These can be specified as a third parameter in the findall ( ) function.
matches = re.findall(r"\bPython\b", text, re.IGNORECASE) # you can also use re.I as a shortcut
print ( matches )
['Python', 'Python', 'python']
Learning
- \b is used to specify a word boundary.
- Always prepend the pattern string with r to make it a raw string. This way, the characters in the pattern are taken literally.
- global flags can be used to alter the way regular expressions work. One of the global flags we have seen is re.IGNORECASE. It can be used to do a case insensitive search.
HTML tags
Problem – Find all the tags in a HTML or XML.
For example, here is a small snippet of HTML. There are many tags like , , etc. We have to identify all the tags used in the following HTML.
import re
text = '''<html>
<head>
<title> What's in a title</title
</head>
<body>
<tr>
<td>text one </td>
<td>text two </td>
</tr>
</body>
</html>
'''
Solution
matches = re.findall(r"<([^>]*)>", text)
print ( matches )
['html', 'head', 'title', '/title\n </head', 'body', 'tr', 'td', '/td', 'td', '/td', '/tr', '/body', '/html']
This gives the start and end tags. Say, we don’t want the end tags.
matches = re.findall(r"<([^>/]*)>", text)
print ( matches )
['html', 'head', 'title', 'body', 'tr', 'td', 'td']
Another way to do it is to use non-greedy quantification. When you start a pattern with < and consume any character with ., it consumes it all the way to the end. That is why and + are greedy quantifiers. To negate the effect of it, use the ? operator. That way it allows the * to match the least amount of text before the regular expression is satisfied.
import re
matches = re.findall("<.*?>", text)
print ( matches )
['<html>', '<head>', '<title>', '</head>', '<body>', '<tr>', '<td>', '</td>', '<td>', '</td>', '</tr>', '</body>', '</html>']
Learning
- and * are greedy quantifier. They consume the most amount of text before a pattern can be satisfied.
Password verification
Problem – Verify if a password is
- Has atleast One upper case character
- Has atleast one digit.
- Has atleast one special character ( let’s limit special characters to @ , # , $ , % )
text = '''Aw@som$passw0rd
Awesomepassw0rd
Awesomepassword
Aw!som!passw0rd
aw!som!passw0rd'''
# All combinations except for the first one is valid
Solution
This can be solved easily using regular python lists. However, we wanted a more concise solution using regular expressions. In these kinds of situations, we are looking for some kind of validation. Regular expression’s lookaround function is very useful in these cases. The syntax for that is (?=…) where … represents any regular expression. Let’s start with the first condition
matches = re.findall("[^A-Z]*[A-Z].*", text)
print(matches)
['aWesome']
text = "Awesome1"
matches = re.findall("[^A-Z]*[A-Z]\D*\d.*", text)
print(matches)
['Awesome1']
This works too. Now, lets try a different combimation – put the digits before the letters.
text = "1Awesome"
matches = re.findall("[^A-Z]*[A-Z]\D*\d.*", text)
print(matches)
[]
That failed. why ? Because regular expressions consume text and move forward. So, the expession [^A-Z]*[A-Z] consumed all the text including the 1 at the beginning. And it is now looking for a number at \d, which it cannot find after the capital letter. This is where lookarounds help.
text = "1Awesome"
matches = re.findall("(?=[^A-Z]*[A-Z])(?=\D*\d).*", text)
print(matches)
['1Awesome']
This time it works. The reason is that we have converted the digit search \D*\d into a lookahead(?=\D\d).
An important aspect of lookarounds ( look ahead or look behind ) is that it does not consume any characters. For example, look at the example below. We want to find out all the words that are preceded by a comma, but we don’t want to show the comma.
text = "Hi there, how are you doing ?"
# \b for word boundary
# \w+ for a word
#(?=,) will ensure that the word is followed by a comma
matches = re.findall(r"\b\w+(?=,)", text)
print ( matches )
['there']
See, the comman is not shown in the output. Granted it is not a big deal. We can do that using groups. However, there are many situations (like the password example above) that cannot be achieved using groups. That’s where lookarounds come in. Let’s continue the same example as above and find out all the words, preceded by a comma.
# (?=,\s) => verify (assert) that before the word, there is a comma followed by a space
# \w+ is a word
matches = re.findall(r"(?<=,\s)\w+", text)
print ( matches )
['how']
Learning
- There are 2 type of Lookarounds – look ahead and look behind.
- (?=…) is used to do look ahead search.
- Lookarounds are also called assertions
Regular expressions cheatsheet
Special Character | Description |
---|---|
. | Matches any character – except new line |
[XYZ] | Character set |
[^XYZ] | Negation of the Character set |
[A-Z] | Matches any character – except new line |
pipe | Logical OR |
. | Matches any character – except new line |
\w | Matches any word character. Equivalent to [A-Za-z0-9_] |
\W | Negation of any word character. Equivalent to [^A-Za-z0-9_] |
\d | Matches any digit. Equivalent to [0-9] |
\D | Matches any non-digit. Equivalent to [^0-9] |
\s | Matches any whitespace character ( spaces, tabs or linebreaks ) |
\S | Matches any non-whitespace character |
^ | Matches beginning of line |
$ | Matches end of line |
\b | Word boundary |
\B | not a word boundary |
* | Zero or more |
+ | One or more |
? | Zero or one |
(XYZ) | Capturing group |
(?:XYZ) | non-capturing group |
(?=XYZ) | Positive lookahead |
(?!XYZ) | Negative lookahead |
(?<=XYZ) | Positive lookbehind |
(?<!XYZ) | Negative lookbehind |
Challenge
Extract data to JSON
Say, we gave a bunch of cities along with their nick names in the following format in a text file. Extract the city and it nick name in a JSON format with the structure as follows.
cities = '''
1. Paris – The City of Love, The City of Light, La Ville-Lumiere
2. Prague – The City of Hundred Spires, The Golden City, The Mother of Cities
3. New York – The Big Apple
4. Las Vegas – Sin City'''
# required output format
{
"city_1" : ["nick name 1", "nick name 2"],
"city_2" : ["nick name 1", "nick name 2.."]
}
#import the file
with open("./data/cities.txt","r") as f :
data = f.read()
import re
matches = re.findall("\d+\.\s+(\w+\s?\w+)\s+–\s+(.*)", data)
print ( matches[0:5])
import json
city_dict = {}
for city in matches :
city_dict[city[0]] = city[1].split(",")
city_json = json.dumps(city_dict)