Categories
Tips

Hacking lookahead to mimic intersection, subtraction and negation

Note: To understand the following, I expect you to know how regex lookahead works. If you don’t, read about it first and return here after you’re done.

I was quite excited to discover this, but to my dismay, Steven Levithan assured me it’s actually well known. However, I felt it’s so useful and underdocumented (the only references to the technique I could find was several StackOverflow replies) that I decided to blog about it anyway.

If you’ve been using regular expressions for a while, you surely have stumbled on a variation of the following problems:

  • Intersection: “I want to match something that matches pattern A AND pattern B”
    Example: A password of at least 6 characters that contains at least one digit, at least one letter and at least one symbol
  • Subtraction: “I want to match something that matches pattern A but NOT pattern B”
    Example: Match any integer that is not divisible by 50
  • Negation: “I want to match anything that does NOT match pattern A”
    Example: Match anything that does NOT contain the string “foo”

Even though in ECMAScript we can use the caret (^) to negate a character class, we cannot negate anything else. Furthermore, even though we have the pipe character to mean OR, we have nothing that means AND. And of course, we have nothing that means “except” (subtraction). All these are fairly easy to do for single characters, through character classes, but not for entire sequences.

However, we can mimic all three operations by taking advantage of the fact that lookahead is zero length and therefore does not advance the matching position. We can just keep on matching to our heart’s content after it, and it will be matched against the same substring, since the lookahead itself consumed no characters. For a simple example, the regex /^(?!cat)\w{3}$/ will match any 3 letter word that is NOT “cat”. This was a very simple example of subtraction. Similarly, the solution to the subtraction problem above would look like /^(?!\d+[50]0)\d+$/.

For intersection (AND), we can just chain multiple positive lookaheads, and put the last pattern as the one that actually captures (if everything is within a lookahead, you’ll still get the same boolean result, but not the right matches). For example, the solution to the password problem above would look like /^(?=.*\d)(?=.*[a-z])(?=.*[\W_]).{6,}$/i. Note that if you want to support IE8, you have to take this bug into account and modify the pattern accordingly.

Negation is the simplest: We just want a negative lookahead and a .+ to match anything (as long as it passes the lookahead test). For example, the solution to the negation problem above would look like /^(?!.*foo).+$/. Admittedly, negation is also the least useful on its own.

There are caveats to this technique, usually related to what actually matches in the end (make sure your actual capturing pattern, outside the lookaheads, captures the entire thing you’re interested in!), but it’s fairly simple for just testing whether something matches.

Steven Levithan took lookahead hacking to the next level, by using something similar to mimic conditionals and atomic groups. Mind = blown.

Categories
Original Tips

Create complex RegExps more easily

When I was writing my linear-gradient() to -webkit-gradient() converter, I knew in advance that I would have to use a quite large regular expression to validate and parse the input. Such a regex would be incredibly hard to read and fix potential issues, so I tried to find a way to cut the process down in reusable parts.

Turns out JavaScript regular expression objects have a .source property that can be used in the RegExp constructor to create a new RegExp out of another one. So I wrote a new function that takes a string with identifiers for regexp replacements in {{ and }} and replaces them with the corresponding sub-regexps, taken from an object literal as a second argument:

/**
 * Create complex regexps in an easy to read way
 * @param str {String} Final regex with {{id}} for replacements
 * @param replacements {Object} Object with the replacements
 * @param flags {String} Just like the flags argument in the RegExp constructor
 */
RegExp.create = function(str, replacements, flags) {
	for(var id in replacements) {
		var replacement = replacements[id],
			idRegExp = RegExp('{{' + id + '}}', 'gi');

		if(replacement.source) {
			replacement = replacement.source.replace(/^\^|\$$/g, '');
		}

		// Don't add extra parentheses if they already exist
		str = str.replace(RegExp('\\(' + idRegExp.source + '\\)', 'gi'), '(' + replacement + ')');

		str = str.replace(idRegExp, '(?:' + replacement + ')');
	}

	return RegExp(str, flags);
};

If you don’t like adding a function to the RegExp object, you can name it however you want. Here’s how I used it for my linear-gradient() parser:

self.regex = {};

self.regex.number = /^-?[0-9]*\.?[0-9]+$/;
self.regex.keyword = /^(?:top\s+|bottom\s+)?(?:right|left)|(?:right\s+|left\s+)?(?:top|bottom)$/;

self.regex.direction = RegExp.create('^(?:{{keyword}}|{{number}}deg|0)$', {
	keyword: self.regex.keyword,
	number: self.regex.number 
});

self.regex.color = RegExp.create('(?:{{keyword}}|{{func}}|{{hex}})', {
	keyword: /^(?:red|tan|grey|gray|lime|navy|blue|teal|aqua|cyan|gold|peru|pink|plum|snow|[a-z]{5,20})$/,
	func: RegExp.create('^(?:rgb|hsl)a?\\((?:\\s*{{number}}%?\\s*,?\\s*){3,4}\\)$', {
		number: self.regex.number
	}),
	hex: /^#(?:[0-9a-f]{1,2}){3}$/
});

self.regex.percentage = RegExp.create('^(?:{{number}}%|0)$', {
	number: self.regex.number
});

self.regex.length = RegExp.create('{{number}}{{unit}}|0', {
	number: self.regex.number,
	unit: /%|px|mm|cm|in|em|rem|en|ex|ch|vm|vw|vh/
});

self.regex.colorStop = RegExp.create('{{color}}\\s*{{length}}?', {
	color: self.regex.color,
	length: self.regex.length
}, 'g');

self.regex.linearGradient = RegExp.create('^linear-gradient\\(\\s*(?:({{direction}})\\s*,)?\\s*({{colorstop}}\\s*(?:,\\s*{{colorstop}}\\s*)+)\\)$', {
	direction: self.regex.direction,
	colorStop: self.regex.colorStop
}, 'i');

(self in this case was a local variable, not the window object)