Regular Expressions revisited.

Well – it’s time to visit Regex land again. Thanks to Mustafa for giving me the push I needed to continue with our journey into Regular Expressions. The following regular expression is a one that I find handy from time to time to identify text that occur near to each other. For instance, suppose I want to find all of the instances of time in this paragraph where there is another mention of time inside 10 words. This expression allows me to do this easily, and would be called as MatchCollection coll = FindNear(<<paragraph text>>, “Time”, “Time”, 1, 10).

One little feature of this, is that you can set the minimum number of words that the text must be apart as well.

Anyway – without further ado, here’s the FindNear function.

/// <summary>
/// Using this method, you can find instances of a particular word near other text. 
/// For instance, you can find Dr near Who when the words occur near each other.
/// This is achieved by constraining the distance of the words that must be between the
/// instances.
/// </summary>
/// <param name="text">The text to search.</param>
/// <param name="findText">The text to find.</param>
/// <param name="nearText">The text to find the text near.</param>
/// <param name="minWords">The minimum number of words the two words can be apart.</param>
/// <param name="maxWords">The maximum number of words the two words can be apart.</param>
/// <returns>A match collection containing the find results.</returns>
public static MatchCollection FindNear(string text, 
    string findText, 
    string nearText, 
    int minWords, 
    int maxWords)
{
    if (string.IsNullOrEmpty(text))
        throw new ArgumentNullException("text");
    if (string.IsNullOrEmpty(findText))
        throw new ArgumentNullException("findText");
    if (string.IsNullOrEmpty(nearText))
        throw new ArgumentNullException("nearText");
    if (minWords > maxWords)
        throw new ArgumentOutOfRangeException("minWords");
    if (maxWords == 0)
        throw new ArgumentOutOfRangeException("maxWords");
    string reg = @"\b" + findText + 
                    @"\W+(?:\w+\W+){" + 
                    minWords + 
                    "," + 
                    maxWords +"}?" + 
                    nearText + @"\b";
    Regex regex = new Regex(reg,
        RegexOptions.IgnoreCase
        | RegexOptions.Multiline
        | RegexOptions.IgnorePatternWhitespace
        );
    return regex.Matches(text);
}

So, how does it work? Well, it builds up the following regular expression (in the case of the above example):

\bTime\W+(?:\w+\W+){1,10}?Time\b

The “magic” part is the bit {1,10} which tells the expression how many words can exist between the words you are searching for. In this case it’s from 1 to 10 words.

I haven’t forgotten the first part of the series on regular expressions – we’ll come back to that one in the next installment when I cover a different way to handle dates. In the meantime, have fun playing around with this regular expression.

Advertisements

2 thoughts on “Regular Expressions revisited.

  1. Pete, this stuff is treasure, particularly those that work with text all the time, such as yours truly. I’d love to see you expand more on the subject developing it into a “proper” article and publishing that either here or on CP.

    I could also read “pushy bugger” between the lines 😀

  2. peteohanlon

    Mustafa, thanks for that. The ultimate aim for this was to post an article of “handy” regexes that I’ve used or would like to use. Unfortunately, time has prevented this so farm but I’m not ruling out turning this into an article at some point in the future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s