Regular expression: Extract Domain Names

Posted by Ghassan Karwchan on Fri, May 8, 2020

Explain advanced concepts of Regular Expressions through practical recipes:
In this recipe we are going to cover:

  • Anchors
  • Non-capturing group
  • Python’s findall, and JavaScript’s exec

Problem Description

HTML Scraping or Web Scraping is widely used, and we need to build a scrapper to extract the URLs in a web page, and to extract the domain names in those URL.

An example of the data input

1<div class="reflist" style="list-style-type: decimal;">
2<ol class="references">
3<li id="cite_note-1"><span class="mw-cite-backlink"><b> 
4["Train (noun)"](http://www.askoxford.com/concise_oed/train?view=uk). 
5<i>(definition – Compact OED)</i>. Oxford University Press
6<span class="reference-accessdate">. 
7.....
8</ol>
9</div>

The output we need is

1askoxford.com;bnsf.com;hydrogencarsnow.com;mrvc.indianrail.gov.in;web.archive.org

The url have variant formats, and the domain name can have different formats. Examples of Url in the text as follow:

1http://www.domain.com
2https://ww2.anotherdomain.com
3https://mydomain.com

Code

The final code in JavaScript:

 1function domainExtract(inputLines){
 2  let exp = /\bhttps?://(?:www\.|ww2\.)?((?:[\w-]+\.){1,}\w+)\b/g
 3  const entries = inputLines.map(x => {
 4      let rslt
 5      let d = []
 6      while ((rslt = exp.exec(x)) !== null)
 7        d.push(rslt[1])
 8      return d
 9  }).filter(x => x).reduce((a, b) => a.concat(b), [])
10  return Array.from(new Set(entries)).sort().join(';')
11}

And Python

1import re
2def extract_domains(lines):
3    exp = r'\bhttps?://(?:www\.|ww2\.)?((?:[\w-]+\.){1,}\w+)\b'
4    r = re.compile(exp, re.M)
5    domains = ';'.join(sorted(set([ f for s in lines for f in r.findall(s) ])))
6    return domains
7    

Code Description

Let us explain the code.

For Python:

  1. we start by writing the pattern in a string and prefix it with r prefix, which treat the rest as Raw string, which means ignore the escape character \ and treat it as normal character.
    1r'the pattern string'
    
  2. The Url code appears anywhere in the string, and we can match it anywhere in the string, and to do that we use special characters called Anchors and specifically we use the word boundary anchor: \b.
    1r'\b patter to match \b'
    
  3. Then we specify the url schema part (http:// or https://), where (s) is optional.
    1r'\bhttps?://\b'
    
  4. Then we need to ignore the (www or ww2) part, so we use Non-capturing group using (?:)
    1r'\bhttps?://(?:www\.|ww2\.)?\b'
    
  5. and then we need to capture the rest of the text, because the rest contains the domain name, so we add capturing group.
    1r'\bhttps?://(?:www\.|ww2\.)?( pattern for domain )\b'
    
  6. The pattern for domain contains many words with alphanumeric characters, and can have dashes (-), and those words separated by dots (.)
    1// format of domain
    2word.second-word.third-word.com
    
  7. we use the Shorthand Character Class: \w, which matches alphanumeric characters plus underscore, and we add the dash in a character class [\w-].
    1r'\bhttps?://(?:www\.|ww2\.)?((?:[\w-]+\.){1,}\w+)\b'
    
  8. notice that we had to add the word with the dot in a group, and because we don’t need to capture that nested group, we used Non capturing group.

A word about Python and JavaScript implementation

We are going to cover more on Python and JavaScript implementation, but for now we are going to talk about Python’s findall, and JavaScript’s exec.

Python’s findall

The Python has many ways to search for a match, including the methods: search and match. But both works on one match at the time.
findall will return a list with all non-overlapping occurrences of a pattern.
The following example:

1pattern = re.compile (r'\w+')
2pattern.findall('Hello World')
3  # output: ['Hello', 'World']

If you have more than a capturing group in the pattern, then it will return a list of tuples.

1pattern = re.compile(r'(\w+) (\w+)')
2pattern.findall('Hello World!, Hello Tom!')
3  # output: [('Hello', 'World'), ('Hello', 'Tom')]

Another alternative: finditer which returns an iterator in which each element is a MatchObject, which gives more information about each match.

1pattern = re.compile(r'(\w+) (\w+)')
2it = pattern.finditer('Hello World!, Hello Tom!')
3match = it.next()
4match.groups()
5  # output: ('Hello', 'World')
6match.span()
7  # output: (0, 11)

JavaScript

JavaScript is a little bit tricky, and arguably it might be the worst implementation among many languages.
JavaScript didn’t have an equivalent for Python’s findall until very recently. It is the method String.matchAll, and it is supported in Node 12, and very latest browsers.
If you need to work in Node before 12, or a little bit older browsers, then you have one option to iterate through many matches.
JavaScript is funky because it implement the first match in different ways, but to match all matches, it force you in one awkward way (before String.matchAll).
To get all matches with their capturing group, you have to use exec method of regular expression object and iterate through it.
An example will be like this:

1const exp = /\b(\w)\w+ ?/
2 while ((rslt = exp.exec(inputString)) !== null)
3    // the capturing group above will be accessed in rslt[1]
4    do_something_with_capturing_group_value(rslt[1])

List of posts

We are going to explain advanced terms of Regular Expressions through different examples, and through series of posts. To see all articles in this series check here:

Check all articles in this list.