Explain advanced concepts of Regular Expressions through practical recipes:
In this recipe we are going to cover:
- Anchors
- Non-capturing group
- Python’s
findall
, and JavaScript’sexec
Problem Description
HTML Scraping or Web Scraping is widely used, and we need to build a scrapper to extract the URLs in a web page, and to extract the domain names in those URL.
An example of the data input
1<div class="reflist" style="list-style-type: decimal;">
2<ol class="references">
3<li id="cite_note-1"><span class="mw-cite-backlink"><b>
4["Train (noun)"](http://www.askoxford.com/concise_oed/train?view=uk).
5<i>(definition – Compact OED)</i>. Oxford University Press
6<span class="reference-accessdate">.
7.....
8</ol>
9</div>
The output we need is
1askoxford.com;bnsf.com;hydrogencarsnow.com;mrvc.indianrail.gov.in;web.archive.org
The url have variant formats, and the domain name can have different formats. Examples of Url in the text as follow:
1http://www.domain.com
2https://ww2.anotherdomain.com
3https://mydomain.com
Code
The final code in JavaScript:
1function domainExtract(inputLines){
2 let exp = /\bhttps?://(?:www\.|ww2\.)?((?:[\w-]+\.){1,}\w+)\b/g
3 const entries = inputLines.map(x => {
4 let rslt
5 let d = []
6 while ((rslt = exp.exec(x)) !== null)
7 d.push(rslt[1])
8 return d
9 }).filter(x => x).reduce((a, b) => a.concat(b), [])
10 return Array.from(new Set(entries)).sort().join(';')
11}
And Python
1import re
2def extract_domains(lines):
3 exp = r'\bhttps?://(?:www\.|ww2\.)?((?:[\w-]+\.){1,}\w+)\b'
4 r = re.compile(exp, re.M)
5 domains = ';'.join(sorted(set([ f for s in lines for f in r.findall(s) ])))
6 return domains
7
Code Description
Let us explain the code.
For Python:
- we start by writing the pattern in a string and prefix it with
r
prefix, which treat the rest asRaw
string, which means ignore the escape character\
and treat it as normal character.1r'the pattern string'
- The Url code appears anywhere in the string, and we can match it anywhere in the string, and to do that we use special characters called
Anchors
and specifically we use theword boundary anchor
:\b
.1r'\b patter to match \b'
- Then we specify the url schema part (http:// or https://), where (s) is optional.
1r'\bhttps?://\b'
- Then we need to ignore the (www or ww2) part, so we use
Non-capturing group
using(?:)
1r'\bhttps?://(?:www\.|ww2\.)?\b'
- and then we need to capture the rest of the text, because the rest contains the domain name, so we add capturing group.
1r'\bhttps?://(?:www\.|ww2\.)?( pattern for domain )\b'
- The pattern for domain contains many words with alphanumeric characters, and can have dashes (-), and those words separated by dots (.)
1// format of domain 2word.second-word.third-word.com
- we use the
Shorthand Character Class
:\w
, which matches alphanumeric characters plus underscore, and we add the dash in a character class[\w-]
.1r'\bhttps?://(?:www\.|ww2\.)?((?:[\w-]+\.){1,}\w+)\b'
- notice that we had to add the word with the dot in a group, and because we don’t need to capture that nested group, we used
Non capturing group
.
A word about Python and JavaScript implementation
We are going to cover more on Python and JavaScript implementation, but for now we are going to talk about Python’s findall, and JavaScript’s exec.
Python’s findall
The Python has many ways to search for a match, including the methods: search and match. But both works on one match at the time.
findall will return a list with all non-overlapping occurrences of a pattern.
The following example:
1pattern = re.compile (r'\w+')
2pattern.findall('Hello World')
3 # output: ['Hello', 'World']
If you have more than a capturing group in the pattern, then it will return a list of tuples.
1pattern = re.compile(r'(\w+) (\w+)')
2pattern.findall('Hello World!, Hello Tom!')
3 # output: [('Hello', 'World'), ('Hello', 'Tom')]
Another alternative: finditer which returns an iterator in which each element is a MatchObject, which gives more information about each match.
1pattern = re.compile(r'(\w+) (\w+)')
2it = pattern.finditer('Hello World!, Hello Tom!')
3match = it.next()
4match.groups()
5 # output: ('Hello', 'World')
6match.span()
7 # output: (0, 11)
JavaScript
JavaScript is a little bit tricky, and arguably it might be the worst implementation among many languages.
JavaScript didn’t have an equivalent for Python’s findall until very recently. It is the method String.matchAll
, and it is supported in Node 12, and very latest browsers.
If you need to work in Node before 12, or a little bit older browsers, then you have one option to iterate through many matches.
JavaScript is funky because it implement the first match in different ways, but to match all matches, it force you in one awkward way (before String.matchAll).
To get all matches with their capturing group, you have to use exec method of regular expression object and iterate through it.
An example will be like this:
1const exp = /\b(\w)\w+ ?/
2 while ((rslt = exp.exec(inputString)) !== null)
3 // the capturing group above will be accessed in rslt[1]
4 do_something_with_capturing_group_value(rslt[1])
List of posts
We are going to explain advanced terms of Regular Expressions through different examples, and through series of posts. To see all articles in this series check here: