Applying Regular Expressions to Sort Items in Google Sheets

Lists are more useful when grouped.

When you need to analyze a large list of items, you need to create areas of commonality to enable grouping and sorting that list. A simple example of grouping might be to think about a grocery list, where certain items (bananas, oranges) are fruits and other items (paper plates, napkins) are non-perishable items that need to be managed differently. Short lists of a few items are easy to categorize; longer items take an effort to sort and manage the data; and large lists (>200 items) might require a programmatic approach.

Fortunately, spreadsheets are excellent for a large, boring sort that might drive you a bit nuts if you tried to do it by hand. Spreadsheets require a logical framework to decide exactly how to organize your items into categories, especially if you have multiple criteria. To organize your items, you might look for an obvious sort (alphabetical) or count individual things, but you are more likely trying to create a grouping that’s not obvious.

RegEx can help (really)

Regular expressions are a computer-geeky way of writing a pattern for text that you want to match, and provide a TRUE or FALSE result to your expression. This code is amazing and flexible, and also a bit hard to master (check out https://regexr.com/ if you want a handy playground to try them out). You use Regular Expressions when you want to apply the same test to multiple items, e.g. finding the domain name in an email, or retrieving a portion of text from a URL.

Using a logical test and validating it in Regular Expressions allows you to take an abstract problem (find the senior leaders in a list of titles) that’s easily understood by humans but needs instructions for computers to execute. This all sounds logical, but searching for Google Sheets RegEx formulas might leave you feeling like adding more than one condition at a time is a bit … challenging. So I decided to simplify that process.

Here’s a concrete example of applying multiple true/false conditions to a single problem. If you’ve built prospect lists for outbound campaigns, there’s one thing that almost always happens: you have to “bucket” the titles by groups to decide what to do next. You might manually decide that titles starting with CEO or “Chief” are C-level, and that ones starting with VP are VP-level, but it’s kind of a drag to do this manually. Wouldn’t it be easier to define the filters, then apply the true/false results of each filter to all of your titles at once?

A quick tool to help sort lists

I decided to build a small tool in Google Sheets to model using multiple filters to test the ability to arrange job titles into families. To build this utility, I used Google Sheets to list the testName I wanted to establish in each row, followed by an example title, a Regular Expression condition I wanted to match, and the result of TRUE or FALSE for that test. I also created a column to help you understand which part of the string/title matched your condition.

Several rows of tests, displaying the regular expression used to evaluate the test

The initial list has four tests, where the simplest is:

(?i)(mgr|manager)

Or – in English – “match everything that has either the string “mgr” or the string “manager”, in the source string, and don’t worry about whether it’s upper or lower case.”

These tests each evaluate a job title, determining what seniority it lives in based on a keyword or words in the title. I intended this rule stack evaluate from the top to the bottom, so you would evaluate the first rule, then stop if you get TRUE, then move on to the next rule. If you prefer, you can keep going even when you have multiple TRUE conditions as it will help you understand the distribution of your list between buckets.

A few titles, displaying the results of the testing run and the item that matched

Here’s how this works:

  1. Taking the title text from the “Example Title” column,
  2. We use the header column to look up the row of the test we want to execute. For isManager level, we would use a formula to look at the column labeled Test Name in the other tab, do a vLookup to find the row matching isManager, then read the expression in the Condition/Regex column, and return a TRUE/FALSE result using the REGEXMATCH formula, e.g. =REGEXMATCH(“Sr Manager of Spreadsheets”,”(?i)(mgr|manager)”)
  3. use conditional formatting to alert a TRUE result;
  4. and reverse the formula using a REGEXEXTRACT to identify the portion of the string that matched first.

Ok, so what?

Yes, you could apply these formulas one at a time or use complicated sorting methods to find boolean conditions that are true or false in your data set. This method is helpful because it allows you to flexibly add conditions (add a row in the RegEx Tester sheet; and a column in the ProspectTitles sheet labeled with that row’s test name). It also allows you to change your RegEx conditions to match your data set by updating a single field for each change you want to make in your expression across as many items as you want.

This method works well and quickly for evaluating several fields against a set of 10k records or fewer. As you think about longer lists that involve updating more than 50k cells at a time, you might combine your tests or use small portions of lists at a time to increase the speed (that’s a lot of VLOOKUPs 😉

Here’s a link to the example – make a copy and give it a try.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s