A typical task in data science is to combine different pieces of information, for example by matching entries from multiple lists.
On the first peek, matching data is a piece of cake. However, it is really simple to create so much data confusion that fixing contradictions and errors takes up more time than the matching itself. Accordingly, a defensive way of thinking might help:
Whatever can go wrong, will go wrong.
Murphy's Law
Situation
This example is about three lists containing information about US presidents. At this point, we assume that there is further information behind each list element that is supposed to be analyzed later. For simplicity, the focus is one entry from each list. Obviously, "John Adams", "J. Adams" and "Adams, John" are the same person and, thus, supposed to be matched.
...
...
John Adams
...
...
...
...
...
Adams, John
...
Approach 1: Creating lists of pairwise matchings
A good starting point to tackle a matching task is to enumerate all matchings as pairs of list elements. This is simple, handy and also feels natural when working with a spreadsheet tool. Besides, when automation tools (e.g. Alteryx, SSIS) are used, their output is a list of matched pairs as well. Accordingly, for the three lists here, there are three pairwise combinations to consider. In general, when n is the number of lists, the number of pairwise combinations will be n(n-1)/2.
So, these are the pairwise matchings:
List 1
|
List 2
|
... |
... |
John Adams |
J. Adams |
... |
... |
List 2
|
List 3
|
... |
... |
J. Adams |
Adams, John |
... |
... |
List 1
|
List 3
|
... |
... |
John Adams |
... |
... |
... |
Whoops, what happened to the matchings between lists 1 and 3? In fact, two things happened here:
-
There is a matching missing (i.e. false negative) as the connection between "John Adams" from list 1 and and "Adams, John" from list 3 is missing.
-
Instead an incorrect matching was made (i.e. false positive). "John Adams" was matched to some other "..." element. This is not correct.
Unfortunately, issues like these happen with real life data very quickly. Ideally, after matching entries between lists 1 and 2 and lists 2 and 3, the matching between lists 1 and 3 would be created automatically. This is called
transitivity. Unfortunately, this is nothing what a spreadsheet provides out of the box.
An important thing to note: Lacking transitivity can even happen when data is matched within one list only (which is duplicate detection basically). So, the issue shown here is not a mere consequence of the experimental setup involving more than one list.
Approach 2: Storing matching elements in one place as so-called "cliques"
To ensure transitivity another logical thinking might help: All list entries which are matched together, will be kept together in a collection, called "clique".
List 2: J. Adams
List 1: John Adams
List 3: Adams, John
The benefit of this is easy to see: Transitivity is preserved at all times since there are no matching pairs which might be forgotten. It is an all-or-nothing matching concept for each element. Unfortunately, handling the matchings this way gets quickly quite confusing: As each of the clique members needs to be checked against the remaining list elements, this approach gets confusing quickly. And on top of this, all cliques need to be checked against each other as well. If there is a match between cliques, merging gets necessary.
Spreadsheets are not good at dealing with data this way. Spreadsheets are made for lists and tables with only one element - and not a clique - in every cell. Automation tools work better here. Even though they store the matchings as pairs in the beginning, they can consolidate cliques automatically. For Alteryx it is the Grouping operation, for example. Unfortunately, automation tools can match data using text-based similarity at most. Context-based similarity, as recognized by humans, remains infeasible for these tools.
Matchmerize ensures matching transitivity through cliques and manages the data handling complexity for the user
As explained above: For ensuring matching consistency (so-called "transitivity") a clique-based handling is really helpful. As cliques are hard to handle with a spreadsheet alone, using something more sophisticated makes sense.
Matchmerize liberates the user from the need of taking care of the cliques manually. This is achieved through a specially designed application which enables the user to concentrate on finding the right matching quickly.
Brain teaser: Are cliques multi-user-proof?
One of the strengths of Matchmerize is to consolidate the matchings of multiple users into one combined matching output list. However, what if the users matchings are in conflict with each other. Imagine, Alice sees an additional matching between "John Adams" and "John Q Adams" while Bob does not. What happens in such a case?
Learn how Matchmerize consolidates different opinions