Entity Resolution
Intro
The data Sayari collects includes references to entities (e.g., persons, companies, etc.) from around the globe. These entity references may refer to the same real-world entity across different documents and/or jurisdictions. Entity resolution aggregates entity references that refer to the same real-world entity.
Entity references are extracted with different information dependent on the richness of the collected data; different jurisdictions collect and publish different attributes (e.g., corporate identifiers, contact information, etc.). To account for differing levels of detail, a variety of entity resolution rules, developed jointly by Sayari Data Engineers and Analysts, determine how to aggregate entity references. Moreover, to capture and convey different levels of confidence, Sayari implements two stages of entity resolution: identity resolution and possibly same as (PSA) resolution.
The goal of our entity resolution pipeline is to take these input entity references and use their attributes and relationships to determine which entity references refer to the same real-life entities. The output of our entity resolution is a mapping that assigns every entity reference in our data to an entity that will be represented in the platforms (Graph & Map). Identity resolution aggregates entity references into entities. In other words, identity resolution collects entity references that confidently refer to the same real-world entity, while PSA (Possibly Same As) resolution collects entities into groups that likely refer to the same real-world entity.
Entity Resolution Fundamentals
Blocking and Matching
Our fundamental entity resolution process is based on “blocking” followed by “matching.”
The goal of “blocking” is to reduce the number of paired comparisons required in the matching step. To accomplish this, references are assigned to groups called “blocks” on the basis of shared “blocking keys” found in their attributes; for instance, entities might be placed into a block together because they share an address or a name attribute. Generally speaking, being placed in the same blocking group is not sufficient for references to resolve together; it is simply a process to generate “candidate pairs” of references that may refer to the same entity. The candidate pairs will then be examined more closely to determine which ones are true matches that will be linked together.
After blocking groups are created, we examine every possible pair within each blocking group and apply our matching business logic to determine whether these references refer to the same entity or not. If a pair meets our standards for being considered a match, then we link these references together.
After the generation of the individual links, we identify groups of references that have been linked into connected components; if reference A and B have been linked and reference B and C have also been linked, then we assume that A, B, and C all refer to the same entity. These groups are the ultimate output of our entity resolution.
Splitting
Sometimes, our entity resolution process incorrectly links entity references that do not actually refer to the same real-life entity. Entity references may cluster together due to common attributes but also have conflicting attributes that show that they cannot be the same. For example, the following references may reasonably cluster together:
But a person cannot have multiple dates of birth, so the resulting entity does not make much sense. Instead, this example might suggest a parent and child at the same address. Accordingly, we split the input entity into entities that don’t have this conflict.
Currently, our splitting algorithm triggers a split on the basis of conflicts in 1) strong identifier attributes and 2) birth years. After splitting references up, they are re-sorted into new non-conflicting buckets using shared identifiers and names to decide which references to bucket together.
One might ask why we need a special step for this as opposed to simply including “no conflicts” as a matching rule. The reason for this is that our entity resolution works transitively; in the above example, we might generate pairs A-B and B-C, where no pair has an explicit conflict, but a conflict reveals itself after we collect all three references into one entity.
Our Implementation
Sayari implements two levels of entity resolution:
Identity Resolution
We have implemented our identity resolution in an iterative fashion, applying aspects of the process described above (blocking → matching → component generation [clustering] → splitting) multiple times in succession. Each iteration uses its own set of key generators and matching rules. Currently, we perform identity resolution over three iterations, or stages.
Between each cycle, we “roll up” the attributes of the newly resolved references into new intermediary “entities” that contain the aggregated, deduplicated attributes of all of the resolved references. These serve as inputs into the following iteration.
Notably, in the first two passes, we do no matching at all; we simply generate blocking keys that are strong enough that we consider them sufficient to resolve entities with no further examination. At time of writing, the more traditional blocking + matching strategy is only implemented in full in the third pass. Note that our splitting algorithm (see above) is implemented in all three stages.
The blocking/matching rules implemented in each stage are described in the following sections. Note that we use the term “references” throughout the identity resolution sections as a term for “things that we are attempting to resolve together.” This is not strictly accurate in the secondary and tertiary stages, where the inputs are not single references but rather intermediary entities that may be composed of multiple references. However, for simplicity, we use the term “references” for both cases.
At the moment, we deduplicate attributes in entity resolution roll-ups only on the basis of exact matches across all fields. Note that we also do a separate roll-up with attribute deduplication to produce the attributes that are displayed in the Sayari Graph application; this process has its own separate attribute deduplication logic.
As a general rule across all stages, we do not resolve references together unless they share the same entity type (classification as company, person, etc.).
PSA Resolution
Our PSA resolution is similar in concept to identity resolution, but rather than aggregating entity references into entities, it creates PSA edges between entities to create PSA components. The goal here is to mark groups of entities that likely refer to the same entity, but with a lesser degree of confidence.
Our PSA process consists of three iterative stages of block generation, connected component generation, and splitting. We start with more permissive blocking rules and iterate to stricter blocking rules until we obtain sufficiently PSA (“possibly same as”) components (10,000 entities at most). Unlike identity resolution, we do not do any matching in this stage.