Explain advanced concepts of Regular Expressions through practical recipes:
In this recipe we are going to cover:
- Negated Character class
- Non-capturing group
- Non-Greedy quantifier.
- Python’s
findall
, and JavaScript’sexec
Problem Description
We need to extract the html links, or the anchor tags in an html element. We want to extract the url link and the text description for that link.
The input will be an HTML document, The output we need has the following format:
1url, Text description
For example
1<div class="portal" role="navigation" id='p-navigation'>
2<h3>Navigation</h3>
3<div class="body">
4<ul>
5 <li id="n-mainpage-description"><a href="/wiki/Main_Page"
6 title="Visit the main page [z]" accesskey="z">Main page</a></li>
7 <li id="n-contents"><a href="/wiki/Portal:Contents"
8 title="Guides to browsing Wikipedia">Contents</a></li>
9 <li id="n-featuredcontent"><a href="/wiki/Portal:Featured_content"
10 title="Featured content the best of Wikipedia">Featured content</a></li>
11<li id="n-currentevents"><a href="/wiki/Portal:Current_events"
12 title="Find background information on current events">Current events</a></li>
13<li id="n-randompage"><a href="/wiki/Special:Random"
14 title="Load a random article [x]" accesskey="x">Random article</a></li>
15<li id="n-sitesupport"><a href="//donate.wikimedia.org/wiki/Special:
16FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=
17C13_en.wikipedia.org&uselang=en" title="Support us">Donate to Wikipedia</a></li>
18</ul>
19</div>
20</div>
Will have the follwoing output:
1/wiki/Main_Page,Main page
2/wiki/Portal:Contents,Contents
3/wiki/Portal:Featured_content,Featured content
4/wiki/Portal:Current_events,Current events
5/wiki/Special:Random,Random article
6//donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=\
7donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en,Donate to Wikipedia
Code
The final code in Python:
1import re
2def extract_links(lines):
3 pat = r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^>]*>\s*(?:<[^>]*>)?([^>]*)</'
4 r = re.compile(pat, re.M)
5 return [','.join(j) for j in r.findall(lines)]
6
Code Description
Let us explain the code.
For Python:
-
The HTML tag for link are the Anchor <a>, and end with />, so the pattern will be something like this <a - - -
1r'<a - - -'
-
between the
a
tag and next html attributes there is one space at least..1r'<a\s+'
-
The link tag contains many attributes, and we care only about the attribute
href
. so we need to select the attriutehref
as follows:1r'<a\s+href="'
-
the href has the following format:
1<a firstAttribute="somedata" 2href="/somelink/somefile.html" 3secondAttribute="someattr">Some Text here</a>
So to capture that
1r'<a\s+href="'
-
We need to capture the data in that attribute so we open a capturing group. The capturing group will end with the end of the attribute. As the atrribute will end with double quotes
"
so we can capture any character except>
and"
. We can acheive that usingNegated Character Class
which can be achieved using[^]
. The following statement capture all the text inside thehref
attribute. Notice as well how we ended up the attribute with the closing double quote.1r'<a\s+href="([^>"]*)"'
-
The link tag contains other attributes that we are not interested in, so we can get them out by using
non capturing group
. They can show up before or after thehref
attributes.
The attributes separated by spaces, so we can write a blue-print of the pattern as follows:1(?:[here we put the pattern to capture]\s*)?
-
the pattern to match the other attributes is any character that doesn’t match the end of the html tag
>
, and we use negated character class again as follows:[^>]
. But using only that we will end up using all data in the anchor (link) tag, so we use non-greedy character as follow:1r'<a\s+(?:[^>]*?\s*)href="([^>"]*)"'
-
The previous match the attributes before
href
. To add the attributes after thehref
we add as well:1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^>]*'
-
the link tag will end with
>
1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^<]*>'
-
the next part is to capture the text description of the link. We capture the text, which can include anything except end of tag
<\a>
.1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^<]*>([^<]*)'
-
And we end up with the link end tag
<\a>
.1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^<]*>([^<]*)</'
-
BUT WAIT. We are not done yet. there is more small issue.
Sometimes the link tag contains nested tags as follows:1<a href="/somelink/somefile.html" ...> 2 <img>...</img> <data></data>Some Text here</a>
In order to handle this we use non-capturing group for the following pattern:
1(?:<[^>]*>)?
So at the end the final pattern will be as follows:
1pat = r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^>]*>\s*(?:<[^>]*>)?([^<]*)</'
Python implementation
We use the method findall
with the MultiLine flag. findall
will give an iterator of all capturing groups, where each item is a tuplet of the matching capture groups.
1import re
2r = re.compile(pat,re.M)
3htmlInput = '<html> ... the html string </html>'
4for j in r.findall(htmlInput):
5 # j[0] is the url match
6 # j[1] is the text description
For more about findall
check [my previous recipe]({% post_url 2020-05-08-Regular-expression-extract-domain-names %}) where I explained in more details.
List of posts
We are going to explain advanced terms of Regular Expressions through different examples, and through series of posts. To see all articles in this series check here: