How a machine learns to extract text from human-given text highlights
Suppose a financial analyst wishes to analyze changes in crime rates over the years. She has a dataset of financial press releases for a company. She has to extract all the revenues and incomes from the dataset. She wants to produce a table similar to this for further analysis.
Additionally, she may also want to automate this process such that if her dataset of press releases grows, then she can easily extract the new revenue and income numbers.
In short, our financial analyst has to perform an information extraction task. Information extraction or IE automatically extracts data from large collections of documents.
There are other scenarios of IE in real life, often involving humans who do not necessarily have a coding background that will help them quickly write code to help them perform IE:
- Social Media Analytics: Extract user ratings from Yelp reviews.
- Financial Analytics: Extract key people and their relationships with major US companies from SEC data.
- Healthcare Analytics: Extract data from health forms to understand the side effects of certain drugs.
In this article, we will talk in-depth about an interactive, human-in-the-loop tool called SEER. SEER helps users who work with such text datasets extract relevant data from them.
A user in SEER would highlight examples of text they wish to extract. Positive examples are texts they wish to extract. Negative examples are texts they do not wish to extract.
After SEER learns a model, SEER presents a set of prompts and some extraction rules, which are programs that will perform the rest of the extraction task on behalf of the user.
The user answers “yes” or “no” to the prompts which then filters out the extraction rules according to the user’s feedback.
The user selects an extraction rule in order to preview the extractions. A selected rule will then highlight the rest of the extractions in the documents and tabulate the extractions through all the documents in the dataset. The user can then use either the highlighted documents or the table to decide whether the selected extraction rule performs the desired extraction task correctly.
Otherwise, further fine-tuning can be accomplished by highlighting more positive and negative examples.
Extraction Rules in SEER
While designing SEER, we considered a number of machine learning techniques. However, it was tricky to find a method that would learn syntactic text patterns from few-shot examples of text highlights accurately from the user, without requiring the user to have some level of technical knowledge to either fine tune the machine learning model or write up code to post-process any errors in the output. Yet our target users were intended to not need any coding background or machine learning expertise, e.g. journalists, investigators, financial analysts, etc.
We decided to use a rule-based model, which is essentially programs that perform extractions. In SEER we refer to them as extraction rules. SEER would automatically generate the initial extraction rules and if the user needed to finetune the rules, then the user can either highlight additional examples to provide to SEER to learn from. They also had the choice to export the rules to edit them directly.
Language of Extraction Rules
SEER learns visual extraction rules from user-highlighted text examples. An extraction rule is a sequence of primitives. Primitives extract one or more tokens. Tokens are words delimited by spaces and newlines. For example, in bold, 5.4 percent is tokenized into its individual tokens. 5, point, 4, and percent. There are 5 types of primitives that can form a rule:
The 5 types of primitives along with sample tokens of texts the primitives can extract. Image by Author.
- Pre-builts extract entities, such as organizations, integers, percentages, phone numbers, and much more.
- Literals extract the exact string. So literal percent extracts only the instances of the word percent in the dataset.
- Dictionary extracts texts that appear in the dictionary entries.
- Regular expressions or regex for short are supported in SEER through a predefined library of regular expressions. The yellow box containing [A-Za-z]+ is a regex capturing tokens containing lowercase or uppercase letters.
- Token gap can extract any token. A token gap that contains “0 to 1” means that it will skip over to extract 0 to 1 words until the next matching primitive after the token gap. Token gaps are used in the context of sequences of primitives.
Here is a sample of an extraction rule that SEER can extract:
This sequence of primitives can capture percentages. It begins with a prebuilt integer, skips over zero to two tokens until the next primitive match occurs, which in this case is the literal percent. And this rule can capture the text “5.4 percent”.
Why Learning Extraction Rules is Challenging
While the language of extraction rules is fairly simple, there are a number of challenges related to automatically generating extraction rules. The details can be found here. But the following is a summary of why it is challenging:
- Users may provide a limited number of examples & feedback: The rule learning algorithm is constrained to learning from few number of examples, with a minimum of 2 examples to at most on average 6 examples.
- The limited number of user inputs results in a large search space of valid rules. The algorithm must traverse an exponentially large space of candidate rules. We thus store rules in a tree structure. We then effectively search the space by learning and ranking rules according to how human developers would handcraft rules. Moreover, we leverage user feedback through the user interface to further prune the search space.
- Users may provide examples that follow slightly different patterns. SEER must also learn rules that capture multiple variants and it does this by learning multiple sets of extraction rules.
Given these challenges, how does SEER learn rules? At a high-level, the algorithm is as follows:
- Generating rules: SEER’s learning algorithm generates the valid sequences of rules capturing the positive examples. Rules that capture the negative examples are dropped.
- Generalizing rules: And then it generalizes the rules. Generalizing means to intersect rules that are specific to an example and create a rule that captures as many of the positive examples as possible.
- Rank rules: Additionally, SEER assigns scores to the primitives of the rules that reflect how a human developer selects and prefers certain primitives over others when hand-crafting extraction rules.
- Select rules: SEER groups similar rules together, and for each group selects a rule. These rules are then displayed to the user. SEER also presents diverse rules, meaning that the majority of the rules don’t only contain one type of primitive. And this helps SEER capture a variety of user intent.
- Present the rules and get user feedback.
The details of the algorithm can be found in a longer paper.
1. Generating Rules
The first step involves generating the candidate rules for each example. In this scenario, there are two positive examples: “revenue of $3 million” and “income of $4 billion”. There are definitely more rules beyond what’s shown in the image above.
There are an exponential possible number of rules. For each token in a positive example, there is a possibility of at most 5 primitives that can capture it. The candidate rules are stored in a tree structure in order to save up space. A path from the root node to the leaf node represents a candidate rule. This structure allows any similar prefixes of the rules to be represented by the same initial path.
2. Generalizing rules
Then, SEER generalizes rules to capture all the positive examples. Generalizing is done by an intersection operation. In the above image, we present a subset of rules resulting from the intersection of the rules generated from “revenue of $3 million” and “income of $4 billion”.
A rule from the first positive example intersects a rule from the second positive example only if their sequence of primitives intersect:
- The prebuilt primitive intersects with each other.
- The literal primitive intersects with literals if their values are equal.
- Regular expressions intersect if their expressions are the same.
- Dictionaries intersect with other dictionaries.
- Token gaps can be inserted anywhere except at the beginning or in the end of the rule.
Additionally literal and dictionary primitives can be merged with each other:
- The literal primitive intersects to form dictionary rules.
- Dictionaries and literals can be merged into a combined dictionary.
- Dictionaries can be merged with each other to create a bigger dictionary.
Suppose the user also highlighted “$5 million in revenue” and “$6 billion in income”. The rules will not intersect with the rules generated from “revenue of $3 million” and “income of $4 billion”. In this case, SEER maintains several sets of rules, one for each intersectable set of rules.
This allows SEER to capture any slight variations in the examples. In this case, there is one variation with the currency amount, e.g. “$3 million”, at the beginning of the positive example. The second variation has the currency amount at the end of the positive example.
3. Rank rules
At this point of the algorithm, there are still many candidate rules. Since we cannot display all of them to the user, the algorithm assigns scores to rank rules.
The rules are ranked according to how human developers prefer certain types of primitives over other types given a certain kind of token. There are two types of tokens:
- Semantic tokens: tokens with natural meaning. These are usually entities such as school names, organizations, currency amounts, etc. SEER identifies a semantic token if it is captured by a pre-built primitive.
- Syntactic tokens: tokens that represent syntax, for example, a dash, a colon, and filler articles like ‘a’ or ‘the’.
Depending on the nature or kind of token, we apply one of two scoring functions:
- In the first scoring function, if the token is of a semantic type we assign scores such that prebuilts are preferred to literals and dictionaries, which are preferred to token gaps and regexes. For instance, if the token has a natural meaning, for example, the token “Dubai” which refers to a city, then the pre-built “City” will be ranked higher than a literal “Dubai” or a regex.
- In the second scoring function, if the token represents syntax, we assign scores such that token gaps or regexes are preferred to literals and dictionaries.
These scoring functions are from our observations of how human developers choose primitives when creating rules. Based on this preference, we map it to numerical values as primitive scores. We conducted a study that shows a strong agreement between our scoring heuristics and the preferences of human developers for different primitives. You may read more about the study here.
4. Select rules
Once the scores are assigned to primitives and the scores of the rules are calculated (as averages of the primitive scores), a handful of the rules are selected to be presented to the user. While it may be preferable to select only the top-k scoring rules, we also need the rules to be diverse. In other words, selecting the top-k scoring rules might result in a set of rules containing mostly pre-builts or mostly regexes, which is undesirable. This is because the user might want a rule that is specific, e.g. a rule containing a combination of literals and pre-builts, or the user might want a rule that is generic, e.g. a rule containing mostly regexes. But selecting the top-k scoring rules may bias it only one way, and may against the user’s intent.
It is tricky to really know what the user intended, e.g. whether a specific rule or a generic rule, because the user gives a handful of text examples (an average of 6 positive examples). To capture the variety of user intent, SEER generates a diverse set of rules. It is done as follows:
- Group similar rules together. A rule is similar to each other if it is composed of the same primitive types regardless of the sequence in the rule.
- Select the top-scoring rule from each group.
In the above where a generated set of rules is shown, each rule has a unique set of primitives, and the set of rules is considered diverse. This ruleset will be presented to the user. The rules will be presented in the order of rule score to the user.
5. Present the rules and get user feedback.
Finally, the user can filter out the rules by accepting or rejecting extractions:
In this article, we describe a human-in-the-loop text extraction system called SEER. SEER learns a rule-based model and interacts with the user to help them quickly and accurately accomplish their extraction tasks with little to no understanding of code. If you would like to read more about the user study we conducted with SEER, please read the longer paper here.