Regular expression: Extract HTML Links

Posted by Ghassan Karwchan on Fri, Sep 11, 2020

Explain advanced concepts of Regular Expressions through practical recipes:
In this recipe we are going to cover:

  • Negated Character class
  • Non-capturing group
  • Non-Greedy quantifier.
  • Python’s findall, and JavaScript’s exec

Problem Description

We need to extract the html links, or the anchor tags in an html element. We want to extract the url link and the text description for that link.

The input will be an HTML document, The output we need has the following format:

1url, Text description

For example

 1<div class="portal" role="navigation" id='p-navigation'>
 2<h3>Navigation</h3>
 3<div class="body">
 4<ul>
 5 <li id="n-mainpage-description"><a href="/wiki/Main_Page"
 6  title="Visit the main page [z]" accesskey="z">Main page</a></li>
 7 <li id="n-contents"><a href="/wiki/Portal:Contents" 
 8 title="Guides to browsing Wikipedia">Contents</a></li>
 9 <li id="n-featuredcontent"><a href="/wiki/Portal:Featured_content" 
10 title="Featured content  the best of Wikipedia">Featured content</a></li>
11<li id="n-currentevents"><a href="/wiki/Portal:Current_events"
12 title="Find background information on current events">Current events</a></li>
13<li id="n-randompage"><a href="/wiki/Special:Random"
14 title="Load a random article [x]" accesskey="x">Random article</a></li>
15<li id="n-sitesupport"><a href="//donate.wikimedia.org/wiki/Special:
16FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=
17C13_en.wikipedia.org&uselang=en" title="Support us">Donate to Wikipedia</a></li>
18</ul>
19</div>
20</div>    

Will have the follwoing output:

1/wiki/Main_Page,Main page
2/wiki/Portal:Contents,Contents
3/wiki/Portal:Featured_content,Featured content
4/wiki/Portal:Current_events,Current events
5/wiki/Special:Random,Random article
6//donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=\
7donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en,Donate to Wikipedia    

Code

The final code in Python:

1import re
2def extract_links(lines):
3    pat = r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^>]*>\s*(?:<[^>]*>)?([^>]*)</'
4    r = re.compile(pat, re.M)
5    return [','.join(j) for j in r.findall(lines)]
6    


Code Description

Let us explain the code.

For Python:

  1. The HTML tag for link are the Anchor <a>, and end with />, so the pattern will be something like this <a - - -

    1r'<a - - -'
    
  2. between the a tag and next html attributes there is one space at least..

    1r'<a\s+'
    
  3. The link tag contains many attributes, and we care only about the attribute href. so we need to select the attriute href as follows:

    1r'<a\s+href="'
    
  4. the href has the following format:

    1<a firstAttribute="somedata"
    2href="/somelink/somefile.html" 
    3secondAttribute="someattr">Some Text here</a>
    

    So to capture that

    1r'<a\s+href="'
    
  5. We need to capture the data in that attribute so we open a capturing group. The capturing group will end with the end of the attribute. As the atrribute will end with double quotes " so we can capture any character except > and ". We can acheive that using Negated Character Class which can be achieved using [^]. The following statement capture all the text inside the href attribute. Notice as well how we ended up the attribute with the closing double quote.

    1r'<a\s+href="([^>"]*)"'
    
  6. The link tag contains other attributes that we are not interested in, so we can get them out by using non capturing group. They can show up before or after the href attributes.
    The attributes separated by spaces, so we can write a blue-print of the pattern as follows:

    1(?:[here we put the pattern to capture]\s*)?
    
  7. the pattern to match the other attributes is any character that doesn’t match the end of the html tag >, and we use negated character class again as follows: [^>]. But using only that we will end up using all data in the anchor (link) tag, so we use non-greedy character as follow:

    1r'<a\s+(?:[^>]*?\s*)href="([^>"]*)"'
    
  8. The previous match the attributes before href. To add the attributes after the href we add as well:

    1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^>]*'
    
  9. the link tag will end with >

    1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^<]*>'
    
  10. the next part is to capture the text description of the link. We capture the text, which can include anything except end of tag <\a>.

    1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^<]*>([^<]*)'
    
  11. And we end up with the link end tag <\a>.

    1r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^<]*>([^<]*)</'
    
  12. BUT WAIT. We are not done yet. there is more small issue.
    Sometimes the link tag contains nested tags as follows:

    1<a href="/somelink/somefile.html" ...>
    2  <img>...</img> <data></data>Some Text here</a>
    

    In order to handle this we use non-capturing group for the following pattern:

    1(?:<[^>]*>)?
    

So at the end the final pattern will be as follows:

1pat = r'<a\s+(?:[^>]*?\s*)?href="([^>"]*)"\s*[^>]*>\s*(?:<[^>]*>)?([^<]*)</'

Python implementation

We use the method findall with the MultiLine flag. findall will give an iterator of all capturing groups, where each item is a tuplet of the matching capture groups.

1import re
2r = re.compile(pat,re.M)
3htmlInput = '<html> ... the html string </html>'
4for j in r.findall(htmlInput):
5  # j[0] is the url match
6  # j[1] is the text description

For more about findall check [my previous recipe]({% post_url 2020-05-08-Regular-expression-extract-domain-names %}) where I explained in more details.



List of posts

We are going to explain advanced terms of Regular Expressions through different examples, and through series of posts. To see all articles in this series check here:

Check all articles in this list.