Hacking lookahead to mimic intersection, subtraction and negation

Note: To understand the following, I expect you to know how regex lookahead works. If you don’t, read about it first and return here after you’re done.

I was quite excited to discover this, but to my dismay, Steven Levithan assured me it’s actually well known. However, I felt it’s so useful and underdocumented (the only references to the technique I could find was several StackOverflow replies) that I decided to blog about it anyway.

If you’ve been using regular expressions for a while, you surely have stumbled on a variation of the following problems:

  • Intersection: “I want to match something that matches pattern A AND pattern B”
    Example: A password of at least 6 characters that contains at least one digit, at least one letter and at least one symbol
  • Subtraction: “I want to match something that matches pattern A but NOT pattern B”
    Example: Match any integer that is not divisible by 50
  • Negation: “I want to match anything that does NOT match pattern A”
    Example: Match anything that does NOT contain the string “foo”

Even though in ECMAScript we can use the caret (^) to negate a character class, we cannot negate anything else. Furthermore, even though we have the pipe character to mean OR, we have nothing that means AND. And of course, we have nothing that means “except” (subtraction). All these are fairly easy to do for single characters, through character classes, but not for entire sequences.

However, we can mimic all three operations by taking advantage of the fact that lookahead is zero length and therefore does not advance the matching position. We can just keep on matching to our heart’s content after it, and it will be matched against the same substring, since the lookahead itself consumed no characters. For a simple example, the regex /^(?!cat)\w{3}$/ will match any 3 letter word that is NOT “cat”. This was a very simple example of subtraction. Similarly, the solution to the subtraction problem above would look like /^(?!\d+[50]0)\d+$/.

For intersection (AND), we can just chain multiple positive lookaheads, and put the last pattern as the one that actually captures (if everything is within a lookahead, you’ll still get the same boolean result, but not the right matches). For example, the solution to the password problem above would look like /^(?=.*\d)(?=.*[a-z])(?=.*[\W_]).{6,}$/i. Note that if you want to support IE8, you have to take this bug into account and modify the pattern accordingly.

Negation is the simplest: We just want a negative lookahead and a .+ to match anything (as long as it passes the lookahead test). For example, the solution to the negation problem above would look like /^(?!.*foo).+$/. Admittedly, negation is also the least useful on its own.

There are caveats to this technique, usually related to what actually matches in the end (make sure your actual capturing pattern, outside the lookaheads, captures the entire thing you’re interested in!), but it’s fairly simple for just testing whether something matches.

Steven Levithan took lookahead hacking to the next level, by using something similar to mimic conditionals and atomic groups. Mind = blown.

  • http://twitter.com/padolsey James Padolsey

    That intersection technique is cool but I’m not sure I’d feel comfortable using it. One, because of the IE bug, and two, because I think it lacks readability for anyone who’s not familiar with that exact technique. I really like it though, and it does seem to perform better in most browsers (http://jsperf.com/regex-lookahead-vs-multiple-regex).

    Regex rocks.

  • http://twitter.com/_davidhiggins_ ✱ David ✱
  • http://twitter.com/skylerbrungardt Skyler Brungardt

    Wicked post, and good explanation. Thanks very much for this! :-D

  • Gaelan

    Any way to do an or?

  • http://twitter.com/mdeltito mdeltito

    Minor correction – the subtraction solution (match any integer not divisible by 50) is incorrect. Perhaps the {3} was left in by mistake, but the example only works for 3 digit integers.

    • http://leaverou.me Lea Verou

      Thanks, fixed! It originally was for 3 digit integers, and then I forgot to change it.

      • http://rob.lekensteyn.nl/ Rob W

        50 and 0 are both integers divisable by 50, yet they still are still matched by the current pattern. To exclude 50, d+ needs to be changed to d*. And to exclude zero, another negated look-ahead needs to be added, so the full pattern becomes /^(?!0$)(?!d*[50]0$)d+$/.
        Instead of making the pattern more complex, an additional constraint “at least 100″ can be used to keep the example simple.

  • Corrine

    Just attended your talk at Fluent. Thanks for creating the playground!!! I only recently discovered regular expressions when a co-worker did some magic with what looks like hieroglyphics to me and have a minimal understanding. Wonder if you could suggest learning site/books/resources

    • http://leaverou.me Lea Verou

      Hi Corrine,

      I’d suggest the books “Mastering regular expressions” and “Regular expressions cookbook”, both by O’Reilly.

      Also http://www.regular-expressions.info/ is a great online resource.

      • Corrine

        Hi Lea – sorry for the delay in responding. Thanks for the resources – whoo hoo!

  • Gaelan

    Why no posts recently?

  • Anonymous

    Nice post! I love playing with regexes :)

    Never used the “negation” pattern before, but I think it should end with a “.*”, as in /^(?!.*foo).*$/ so that it matches the empty string