Regular expression: Extract Comments From Code

Posted by Ghassan Karwchan on Thu, May 7, 2020

Explain advanced concepts of Regular Expressions through practical receipes:
In this recipe we are going to cover:

  • Capturing Group
  • Negated Character Class
  • Greedy / non-greedy quantifier.

Problem Description

We need to parse a file of code to extract the comments in the code.
The comments can be single line comments:

1// this is a single line comment
2x = 1; // a single line comment after code

Or multi lines

1/* This is one way of writing comments */ 
2/* This is a multiline 
3   comment. These can often
4   be useful*/


The final code in JavaScript:

2function processData(inputText) {
3    var t = String.raw`(//[^\n]*|/\*[\s\S]*?\*/)`
4    var ex = new RegExp(t, 'g')
5    var ar = inputText.match(ex).map(x => 
6        x.split('\n').map(y => y.trimStart()).join('\n')
7        )
8    return ar

And Python

1import sys
2import re
4def extract_comments(txt):
5    comments = [ j.lstrip() for i in re.findall(r'(//[^\n]*|/\*.*?\*/)', 
6        txt, re.MULTILINE | re.DOTALL) for j in i.split('\n')]
7    return '\n'.join(comments)

Code Description

Let us explain the code.

For Python:

  1. we start by writing the pattern in a string and prefix it with r prefix, which treat the rest as Raw string, which means ignore the escape character \ and treat it as normal character.
    1r'the pattern string'
  2. Because we want to extract the comments in the code, so we have to create Capturing Group. The capture group will capture the text matched by them into a numbered group. We create a capture group with ().
    1r'(the pattern of comments to match)'
  3. We have two styles of comments to capture: single line comment, and multi-lines comment. this is why we separate them with |. so our regular expression format is:
    1r'(single line format | multi-line format)'
  4. The first comment style is single line comment, which starts with //.
  5. and then we should match all the rest of characters until new line. To do that we use Negated Character Class : [^\n]* which means any character but new line. Negated Character Class will match any character that is not in the negated character class.
    1r'(//[^\n]*| multi-line format)'
  6. We are done with first style of comments, and let us move to the second style of comments which is multi-line.
  7. the multi-line comment starts with /*. We represent that with /\*. Notice how we added the escape character \ before the star, so we treat the star as a normal star and not a special character.
  8. Then we need to match any character including new line, and to do that we use the special character (Dot) .. The special character . (Dot) will match any character except the new line, but in Python there is an option we can specify to makes the Dot matches new line, and that option will be passed to match statement re.DOTALL.
    1re.findall(r'(pattern with Dot)', txt, re.MULTILINE | re.DOTALL)
  9. to close the multi-line comment we add the following to the pattern: \*/, which make the whole pattern as follows:
    1re.findall(r'(//[^\n]*|/\*.*\*/)', txt, re.MULTILINE | re.DOTALL)
  10. Now there is a problem with previous code, which is caused by Greedy quantifier. We will discuss this in details later, but now to continue and finish, to fix the problem we add ? after the star * to be as follows:
    1re.findall(r'(//[^\n]*|/\*.*?\*/)', txt, re.MULTILINE | re.DOTALL)

Before we jump to greedy quantifier, let us discuss JavaScript code.

JavaScript Code

The JavaScript code is very similar to the python except one important exception. JavaScript doesn’t have an option to force the special character . (Dot) to match new line as in Python, and the alternative is to match this: javascript [\s\S]*? And for an alternative for re.MULTILINE in python there is a flag g in JavaScript. javascript var t = String.raw`(//[^\n]*|/\*[\s\S]*?\*/)` var ex = new RegExp(t, 'g')

Greedy and non-greedy quantifier:

An Example on extracting comments that are not working:

For the input text:

1/* Iterate through the list till we encounter the last node.*/
2    while(pointer->next!=NULL)
3    {
4            pointer = pointer -> next;
5    }
6    /* Allocate memory for the new node and put data in it.*/

It will generate this output

1/* Iterate through the list till we encounter the last node.*/
2    while(pointer->next!=NULL)
3    {
4            pointer = pointer -> next;
5    }
6    /* Allocate memory for the new node and put data in it.*/

To fix this problem we will add *? as non-greedy quantifier.

1re.findall(r'(//[^\n]*|/\*.*?\*/)', txt, re.MULTILINE | re.DOTALL)

To explain more:
By default, a quantifier tells the engine to match as many instances of its quantified token or sub-pattern as possible. This behavior is called greedy.
As an example to match

1var rg = /.*apple/g
2var input = 'a tasty apple'

The previous code won’t match because the .* is greedy, so it swallow all characters, and then nothing left for apple. The default behavior when you try to match something with the quantifier, it matches the longest possible.

List of posts

We are going to explain advanced terms of Regular Expressions through different examples, and through series of posts. To see all articles in this series check here:

Check all articles in this list.