Applying Regular Expressions to Sort Items in Google Sheets

Lists are more useful when grouped.

When you need to analyze a large list of items, you need to create areas of commonality to enable grouping and sorting that list. A simple example of grouping might be to think about a grocery list, where certain items (bananas, oranges) are fruits and other items (paper plates, napkins) are non-perishable items that need to be managed differently. Short lists of a few items are easy to categorize; longer items take an effort to sort and manage the data; and large lists (>200 items) might require a programmatic approach.

Fortunately, spreadsheets are excellent for a large, boring sort that might drive you a bit nuts if you tried to do it by hand. Spreadsheets require a logical framework to decide exactly how to organize your items into categories, especially if you have multiple criteria. To organize your items, you might look for an obvious sort (alphabetical) or count individual things, but you are more likely trying to create a grouping that’s not obvious.

RegEx can help (really)

Regular expressions are a computer-geeky way of writing a pattern for text that you want to match, and provide a TRUE or FALSE result to your expression. This code is amazing and flexible, and also a bit hard to master (check out https://regexr.com/ if you want a handy playground to try them out). You use Regular Expressions when you want to apply the same test to multiple items, e.g. finding the domain name in an email, or retrieving a portion of text from a URL.

Using a logical test and validating it in Regular Expressions allows you to take an abstract problem (find the senior leaders in a list of titles) that’s easily understood by humans but needs instructions for computers to execute. This all sounds logical, but searching for Google Sheets RegEx formulas might leave you feeling like adding more than one condition at a time is a bit … challenging. So I decided to simplify that process.

Here’s a concrete example of applying multiple true/false conditions to a single problem. If you’ve built prospect lists for outbound campaigns, there’s one thing that almost always happens: you have to “bucket” the titles by groups to decide what to do next. You might manually decide that titles starting with CEO or “Chief” are C-level, and that ones starting with VP are VP-level, but it’s kind of a drag to do this manually. Wouldn’t it be easier to define the filters, then apply the true/false results of each filter to all of your titles at once?

A quick tool to help sort lists

I decided to build a small tool in Google Sheets to model using multiple filters to test the ability to arrange job titles into families. To build this utility, I used Google Sheets to list the testName I wanted to establish in each row, followed by an example title, a Regular Expression condition I wanted to match, and the result of TRUE or FALSE for that test. I also created a column to help you understand which part of the string/title matched your condition.

Several rows of tests, displaying the regular expression used to evaluate the test

The initial list has four tests, where the simplest is:

(?i)(mgr|manager)

Or – in English – “match everything that has either the string “mgr” or the string “manager”, in the source string, and don’t worry about whether it’s upper or lower case.”

These tests each evaluate a job title, determining what seniority it lives in based on a keyword or words in the title. I intended this rule stack evaluate from the top to the bottom, so you would evaluate the first rule, then stop if you get TRUE, then move on to the next rule. If you prefer, you can keep going even when you have multiple TRUE conditions as it will help you understand the distribution of your list between buckets.

A few titles, displaying the results of the testing run and the item that matched

Here’s how this works:

  1. Taking the title text from the “Example Title” column,
  2. We use the header column to look up the row of the test we want to execute. For isManager level, we would use a formula to look at the column labeled Test Name in the other tab, do a vLookup to find the row matching isManager, then read the expression in the Condition/Regex column, and return a TRUE/FALSE result using the REGEXMATCH formula, e.g. =REGEXMATCH(“Sr Manager of Spreadsheets”,”(?i)(mgr|manager)”)
  3. use conditional formatting to alert a TRUE result;
  4. and reverse the formula using a REGEXEXTRACT to identify the portion of the string that matched first.

Ok, so what?

Yes, you could apply these formulas one at a time or use complicated sorting methods to find boolean conditions that are true or false in your data set. This method is helpful because it allows you to flexibly add conditions (add a row in the RegEx Tester sheet; and a column in the ProspectTitles sheet labeled with that row’s test name). It also allows you to change your RegEx conditions to match your data set by updating a single field for each change you want to make in your expression across as many items as you want.

This method works well and quickly for evaluating several fields against a set of 10k records or fewer. As you think about longer lists that involve updating more than 50k cells at a time, you might combine your tests or use small portions of lists at a time to increase the speed (that’s a lot of VLOOKUPs 😉

Here’s a link to the example – make a copy and give it a try.

Make a Boolean Expression From A List of Rows in Google Sheets

google-list

We all use search forms every day. There’s usually an advanced search form of some sort that allows you to enter a boolean expression. You know Boolean expressions as “This OR That” – an inclusive search that increases results – or “This AND That” – an exclusive search that limits results.

When you need to filter your results, a powerful method is to make a list of companies, individuals, skills – whatever you are using to get more precise answers – and create a Boolean expression to get a better initial set of results from a search engine. Yet when someone gives you a list of 10, 20, 50, or 100 items the idea of making a delimited list of “Item1” OR “Item2” OR “Item3” is more difficult than you’d like to be.

Never fear: Google Sheets to the rescue! This is a perfect example of a task for a Lazy Programmer™ to solve, as it’s much easier to have a spreadsheet count values and rearrange text for you than it is for you to do the work by hand. Let’s take a look.

As an example, I created a utility spreadsheet that takes a range of values, counts those values, and outputs a result string that delivers a boolean expression to paste into your favorite advanced search page. Whether you’d like to create a list of criteria to find your next job, search Twitter more effectively, or simply create a customized search for Google, this expression builder can help.

This spreadsheet has two formulas that do the work for you. The first one joins a range of values into a concatenated string (turns list of Orange and Apple into ‘Orange’ OR ‘Apple’ OR ”):

join

There are five functions in this expression:

  1. JOIN combines an expression of values into a string connecting each item with the string ‘ OR ‘;
  2. INDIRECT tells the spreadsheet to wait until the range expression in the formula from A2 to the last unblank value in column A is completed and we know the number of the last row to select;
  3. TEXT converts a complex expression into a value – in this case a calculation of the number of rows in the column of this spreadsheet minus the number of blank rows in the column of this spreadsheet – so that INDIRECT will take result of this expression and substitute “5”, the last row above that has a value. This results in a range from A2:A5 to feed back to JOIN;
  4. ROWS counts the number of rows in a range – the standard number in a Google Spreadsheet is 1000 – and returns an integer to feed back to the expression inside of TEXT;
  5. COUNTBLANK returns the number of blank rows in a range, so that our calculation of rows minus the number of blank rows leaves us with the number of rows containing values (4) or the total number of rows including the header (5).

Now that we have our formula, it creates the string:

'Orange' OR 'Apple' OR 'Banana' OR 'Grapefruit' OR ''

From the rows of Orange, Apple, Banana, Grapefruit above. And there is one problem here: that we need to remove the trailing “OR ”” that resulted from the expression.

Our second formula, REGEXREPLACE, removes the last “OR ”” from the string. RegEx is sort of like Black Magic, so you will probably need to use a helper site like Rubular. This string looks for a specific string right before the end of the line (represented by $)

OR ''

and replaces it with nothing.

regex

This leaves you with the strong you want to paste into a search form:

'Orange' OR 'Apple' OR 'Banana' OR 'Grapefruit'

You’re all set! Now, use this utility with any number of rows in your list of items to create a Boolean expression made for you automagically.

Create a free website or blog at WordPress.com.

Up ↑

%d bloggers like this: