Search secrets: searching like a pro with regular expressions

Well known to computer programmers, regular expressions (“regex” or “regexp” to insiders) are also a secret search weapon of librarians around the globe. A regular expression is simply a text pattern that can be used to find matching text strings. Regular expressions use wildcards and special shorthand notations to describe these patterns. Regular expressions are not available in most search engines, but they are part of Exalead’s Advanced Search options (which is one reason hard-core info-geeks are so fond of Exalead!).

What does a regular expression look like? Let’s look at an example using a period (“.”), the regular expression wildcard representing all letters of the alphabet. If you wanted to use this wildcard within a regular expression in the Exalead engine, you would first frame your query with forward-slash marks “/” to indicate it’s a regular expression, then place the period wherever you wanted variations of a single letter to appear. Thus, the regular expression “/c.p/” would return matches where the “.” is replaced by any single letter, as in “cop,” “cup” and “cap”.

Now one would be hard pressed to imagine a practical reason for running a search that would return both “cop” and “cup,” but using regular expressions to search for potentially misspelled proper names, product codes or technical terms can be very handy.

Imagine for instance you’re doing some research on Exalead. To make sure you haven’t missed an important document in which Exalead has been misspelled, you might try something like “/ex.lead/” to catch variants such as “exelead” or “exilead”.

You could also try “/exa*lead/”, with the asterisk (“*”) being a regex wildcard that indicates the preceding letter can be repeated 0 or more times. A search on “/exa*lead/” would therefore return variants like “exalead”, “exaalead” and “exaaalead”.

If you wanted to exclude documents in which Exalead was correctly spelled, you could simply add “-exalead” to your query, i.e. “/exa*lead/ -exalead”, returning only matches like “exaalead” and “exaaalead”. (The minus sign is an Exalead Advanced Search option that lets you exclude words from the results for any query. Looking for company names containing “Einstein” but no time to wade through a zillion articles on Albert Einstein? Try “einstein -albert“!).

Sometimes, you may not be using regular expressions to hunt for misspellings but rather to include legitimate spelling variations, like “color” (American English) and “colour” (British English). Here, you could use a vertical bar (“|”) between alternative characters or words, which is regex ‘shorthand’ for “or”. For example, entering “/gr(a|e)y/ whale” would tell ExaBot to find all matches for either “gray whale” or “grey whale.”

To learn more about regular expressions, take a look at the regex Wikipedia article. Be sure to also look over all of Exalead’s Advanced Search options. Used alone or in combination (as with the “/exa*lead/ -exalead” example), they offer an easy way to inject some high-octane fuel into your next query.

  • Jason

    Excellent work, thank you! I’ve been looking for a regex-enabled search engine for a while, and I’m glad that you’ve made it. Please keep up the great work — I see that this feature of Exalead has been around for at least 3 years.

  • Great extension for professional programmers. Unfortunaly reg expressions are not easy enough for regular users.

  • First, congratulations for integrating this very powerful feature, which in my opinion would be worth teaching much more widely.
    Regexps are too often closely associated with the word ‘geek’ nearby…
    A few remarks however:
    About the example “/exa*lead/ -exalead”:
    1. a more straightforward regexp is simply /exaa+lead/
    2. the mix between regexp and word exclusion with minus gives me the uncomfortable feeling that it introduces a kind of new dialect of regexps 🙁
    About the example “color” (American English) and “colour”:
    It is usually used to introduce the ? quantifier, /colou?r/
    About dialects, when a MacUser, I appreciate a lot the very complete and sophisticated regexp treatment offered by TextWrangler, with a nice user manual. Worth having a look to define specs?
    TextWrangler is available as freeware at:
    http://www.barebones.com/