Alteryx Tools in Focus: Fuzzy Match, Make Group and Unique
Alteryx has a vast number of tools, and it’s easy to miss some functionality that might be useful, so for this new series of blog posts we’re going to take readers through three tools per blog post, detailing functionality as well as hints and tips for each tool.
The unique tool is a Data Preparation tool, and as you can imagine from the name it’s function is to return a unique dataset for the fields provided, though I think a clearer name for the tool would be “First Distinct”, because that’s essentially what the tool does. For the field(s) selected the tool groups the rows based on those fields (sorting on them in the process), then takes the first row of each group into the U (Unique) stream, the remaining rows go into the D (Duplicate) stream. This means that sorting is incredibly important if we want to control the output. e.g. If we have first sorted a recordset based on a Profit field descending, then did a unique on the Product Category, we would get the highest record with the highest Profit for each Product Category.
Remember that if we simply want to know the maximum Profit per Product Category then we could use a Summarize Tool, grouping by Product Category and selecting the Maximum Profit will produce that result, however the Unique tool has the difference that it keeps all the fields, the Summarize Tool will only keep the aggregations you specify.
This is probably the tool we get most support questions on, it’s a complex tool and it isn’t immediately obvious how to use it. As such it requires more explanation than probably any other tool in the Data Cleansing / Preparation categories.
We’re likely to reach for Fuzzy Matching when we have a dataset which needs to be string matched against some kind of reliable “master” dataset, but the strings don’t provide an exact match – we call this a “Merge” match in Alteryx (here we don’t care about looking for duplicates within the Master file). Alternatively you might have one dataset and want to find all the duplicates across it, to keep one master record’s details – this is a “Purge” match. The data might be Product Names, Company Names, Name and Address, the approach is the same, but crucially we can tweak the matching algorithm so that we get the best match depending on our use case, for example, if I’m matching Customer Names to a mailing list to send an offer or promotion, then I might accept a looser match than if I’m, say, approving them for a loan.
The first thing you’ll notice when you drag the tool onto the canvas is that it only has one input, which immediately seems counter intuitive. How can I match my data if I only have one input? Well in requiring a single input stream Alteryx have actually given us limitless options in how we do our matching, and it’s this flexibility that makes the tool appear daunting to new users. However by approaching the problem step by step we can remove the complexity.
Before the Fuzzy Match Tool
Step 1: Get rid of any identical matches using the join tool (or the duplicate tool if you only have one data set), this will aid the matching process by reducing the number of possible matches.
Step 2: Build a single dataset – add a column to each dataset to give each source a unique identifier if you are doing a merge match – then use the Union Tool to combine your source datasets with your master data.
Step 3: Give each row a unique ID field using the Record ID. Top Tip: Given that the tool returns these IDs I find if you’re just using one field it helps if you create an ID incorporating the field you want to match, that way it’s easy to see the results.
Step 4: Connect the Fuzzy match tool and specify the RecordID field. Also specify whether you are doing a merge or a purge, as defined above. In a merge you will need to specify the source id field.
Finding the right match algorithm is an iterative process, likely to be dependent on the data you are feeding through the tool. Once you’ve connected to the fuzzy match tool your first step is to decide which fields to create your match on, e.g. if it’s just Company Name then you’re likely to only have one match field, but if you have Name, Address and Postcode you might have sever: First Name, Surname, Address1, Address2, Town, County, Postcode. Once you’ve decided then follow this process for each field:
Step 5: Add the match field to the “Match Fields” selection at the bottom of the tool.
Step 6: Choose a match style, there are some redefined ones, so the most relevant one to your match, if you’d like to understand the matching algorithm or tweak it then you can hit “custom” (and we would recommend you do to get the most benefit out of the tool, see the Appendix of this post for details of what you will find and how the matching is performed).
After the Fuzzy Match Tool
Step 7: Out of the tool will come two Record IDs to show which records match each other; a match will only appear once, you won’t get the reverse match shown. Unless you made the RecordID includes the match field (as suggested in Step 3 you will then need a couple of joins to rejoin the original data back on and work with the results.
There’s a lot, relatively speaking, compared to other Alteryx Tools, to getting the Fuzzy Match process working, it’s a very powerful tool though. Check out the Appendix for ways to tweak the results and build a very powerful custom algorithm.
The Make Group Tool identifies groups within you dataset, and works very well in combination with the Fuzzy match tool. You won’t use it much, and interestingly enough this isn’t a tool you’ll find anywhere in the v8.6 Samples, but it’s worth having it in your back pocket.
The tool looks across pairs of matches, and forms them into groups e.g. if A links to B and B links to C then A, B and C will all be in the same group. This can be useful in fuzzy matching because you are going to get paired results, and you’ll want to form them into groups to identify where to purge the results, e.g. if Alan Smith matches to Alan Smythe who matches to A Smythe, but Alan Smith didn’t match to A Smythe it’s likely they are all duplicates.
Let’s look at these tools in action in a 20 minute example video.
You can find the package for this module here.
There are several elements to the matching process, the most important are the Generate Keys and Match Style.
This is the first “rough” match that is then further refined later. Each record is assigned a key and if two records have the same key they will be matched. The options are
None: This field will not be taken into used in deciding which records match – use this if you are using multiple match fields but want to ignore a field in the key generation match phase
Digits Only: Use only the digits to form the key, e.g. useful for phone numbers which may contain punctuation or spaces.
Digits Only Reverse: This is useful where phone numbers may contain a country code, combined with the maximum key length you can ensure you only match on the last x digits of the phone number.
Double Metaphone / Double Metaphone with Digits : A useful algorithm that can match words based on what they sound like, so can match even where there are spelling errors. It works by phonetically by reducing them to 12 consonant sounds.
Soundex / Soundex with Digits : useful for European names this is another sounds-like algorithm
Whole Field: Only records where the entire field matches will be matched. Case must also match.
Whole Field (Case Insensitive): Only records where the entire field matches will be matched. Case is ignored.
Alphanumeric Only: Looks only at alphanumeric characters to make a match. Case must also match.
Alphanumeric Only (Case Insensitive): Looks only at alphanumeric characters to make a match. Case is ignored.
Generate Keys for each word: Ignore the order of words, e.g. University Birmingham and Birmingham University will match.
Maximum Key Length: Specify the maximum length of the key to consider for the match
The Match function is a actual matching process that does the beef of the work, after the brutish nature of the Key Generation. Records are only compared where keys match.
None – Key Match Only: Looks only at the Key Generation specifications.
Levenshtein Distance / Jaro Distance : Two measures of similarity between two strings. The Jaro Distance is more forgiving than the Levenshtein Distance.
Using character matches the whole string will be matched, use a Word based match to compare individual words (i.e. the order within the string won’t matter)
Use this setting to set the tolerance of the match, we tend to recommend returning the score using the options on the Fuzzy Match tool then sorting by the score and viewing the data to determine the best threshold to use. If using the match threshold, ensure the overall threshold on the main settings for the tool is also set to a similar value, these settings work independently and so both will limit the matches returned.