Machine learning is hard; models need to be scoped, data needs to be collected, labelled (and debugged) and expensive hardware needs to be configured. The whole process needs to be repeated at a regular cadence once the whole thing is up and running. That is to say, successfully executing a single iteration of the machine learning lifecycle can entail a lot of technical overhead and complexity.
Often, this complexity displaces an essential need for applied machine learning to actually “move the needle” and do something useful for an organisation, because an essential truth is that most clients simply want a model that works with minimal cost and complexity.
So where does that leave us? Well, I believe that, where possible, initial ML solutions should be simple and should appropriately “ramp” in their complexity as needed, if at all. Hell, maybe you just need a database instead of a BERT? Safe to say it’s also extremely awkward to have something like this pointed out to you at the end of a project in a room full of stakeholders, so why not put the cart back behind the horse and try something cheaper first?
Tying this back to the realm of NLP and specifically, document classification, one of the most effective model types I’ve come to use to vet whether a model has legs in the first place and/or establish a supervised baseline is by using a dictionary-based classification model. Put simply, dictionary-based classifiers offer:
- Flexibility. Quickly add and remove new labels to the label space in successive iterations, without the overhead of propagating such changes across a labelled dataset. This allows users to focus on the domain relevancy of the model, and ties into the broader point of freeing up time/resources to focus on the functional relevance of the model you’re building instead of debugging complexity.
- Speed and simplicity. As we’ll see in the implementation section, a dictionary-based classifier can be built using (almost) pure python, whilst the main operation is a set membership check. No CUDA tears.
- Transparency and programmability. Subject matter expert heuristics and industry knowledge can be explicitly programmed into the classifier. Term lists appear frequently in the work of analysts and often exist before specific machine learning projects. Where they do exist, dictionary classifiers allow for efficient resource reuse.
- Data agnostic. Probably the biggest reason to use dictionary classifiers is that they do not require labelled data to get started and can function decently purely with a little imagination. Indeed, I’ve found dictionary classifiers to be a great starting point to building labelled datasets via pre-annotation which eventually turns into more complex models.
So that’s nice, but there are also some drawbacks, the obvious ones being:
- Generalisability. Since dictionary classifiers are explicitly programmed, there is a risk that inputs that lie outside of these explicit boundaries are discarded and/or incorrectly classified.
- Lack of orthodoxy. AFAIK, and broadly speaking within open-source NLP libraries, heuristic-driven model development exists as a component concept within broader, weakly supervised frameworks like snorkel and skweak that often recombine labelling functions within a downstream modelling process.
Before building a new library, it’s probably wise to ensure that such a thing doesn’t already exist in the world. Sklearn is an obvious starting point. I found an explicit reference to dictionary learning (no relation) as well as the aforementioned Dummy Classifier (too rigid). The closest thing I could find was sklearn’s text-based feature extractors which encapsulate the word:count
vocabulary data structure we ideally want. Below is a traditional application (fit/transform) of the CountVectorizer
:
from sklearn.feature_extraction.text import CountVectorizer# retrieve BOW vocab using usual sklearn fit method
texts = ["hello", "world", "this", "is", "me"]
cv = CountVectorizer()cv.fit(texts)
cv.vocabulary_
# {'hello': 0, 'world': 4, 'this': 3, 'is': 1, 'me': 2}
Simple and straightforward, but one problem is that we’d prefer to “inject” the knowledge about our domain at the point of instantiation in a class/label-specific way, instead of AFTER during a subsequent call to fit (which also requires a labelled dataset). Perhaps something like this:
from sklearn.feature_extraction.text import CountVectorizer# explicitly define vocabulary at point of instantiation
texts = ["hello", "world", "this", "is", "me"]
cv = CountVectorizer(vocabulary=texts) # err
So sklearn comes close to what we want but doesn’t allow us to specify the detail of our classes. Cool, that’s enough due diligence for me, onto clear BOW.
Naming things is hard, and this project probably deserves a better name, but the gist I was getting at was that you can clearly
define the Bag of Words
that exists within a classifier, instead of generating the bag in the usual fashion. So how does it work? The basic IO is that you instantiate an DictionaryClassifier
instance and supply a str:list
dictionary containing the classes of interest and some associated terms:
# some class definitions for a superannuatation-themed topic classifier
super_dict = {
"regulation": ["asic", "government", "federal", "tax"],
"contribution": ["contribution", "concession", "personal", "after tax", "10%", "10.5%"],
"covid": ["covid", "lockdown", "downturn", "effect"],
"retirement": ["retire", "house", "annuity", "age"],
"fund": ["unisuper", "aus super", "australian super", "sun super", "qsuper", "rest", "cbus"],
}
dc = DictionaryClassifier(label_dictionary=super_dict)
Whilst the outputs closely mimic the output of a traditional classifier. In the below example, the values of a multi-class prediction sum to 1.0, simulating the effect of applying a softmax function to model logits:
dc.predict_single("A 10% contribution is not enough for a well balanced super fund!")
# {'regulation': 0.0878,
# 'contribution': 0.6488,
# 'covid': 0.0878,
# 'retirement': 0.0878,
# 'fund': 0.0878}
Cool cool; so its output l o o k s similar enough and doesn’t do anything too freaky, but how does it w o r k? Well, taking the above superannuation example, and some example docs there are two main internal functions worth unpacking within the DictionaryClassifier
class:
def _get_label_word_count(self, text):
tally = {}
for k, v in self.label_dictionary.items():
tally_temp = sum(e in text.lower() for e in v)
tally[k] = tally_temp
return tallydef _transform_predict_dict(self, pred_dict):
# if all word counts are 0
if all(x == 0 for x in pred_dict.values()):
prob_dict = {k: 0.0 for k in pred_dict.keys()}
prob_dict["no_label"] = 1.0
return prob_dict elif self.classifier_type == "multi_class":
return dict(zip(pred_dict.keys(), self._softmax_array(list(pred_dict.values())))) elif self.classifier_type == "multi_label":
return dict(zip(pred_dict.keys(), self._sigmoid_array(list(pred_dict.values()))))
When a call to DictionaryClassifier.predict
is made, word counts for all of the associated terms within each class are calculated and tallied via _get_label_word_count
. Some basic lowercasing is also performed here as a lazy attempt to generalise the matches across as many inputs as possible. Of course, if the particular domain features lots of abbreviation-heavy language or important context where casing matters this can be a problem. Perhaps a use_lowercase
param should be hoisted as a configurable part of the class at some point in the future?
At this point, we have word frequency tallies, for each class of interest. Using the above example, the word counts look like this:
# ding ding the answer is probably going to be "contribution"
{'regulation': 0, 'contribution': 2, 'covid': 0, 'retirement': 0, 'fund': 0}# also, .values() kind of looks like a vector..
[0, 2, 0, 0, 0]
At this point, we invoke _transform_predict_dict
which will apply either a softmax (multi-class) or a sigmoid (multi-label) function across the values of the word count dictionary and bada bing bada boom we have model predictions that look and smell like typical model predictions.
Some additional nice-to-haves are the serialisation methods, which I’ve modelled using a spacy-like design:
# I always thought the to/from_disk pattern for models/components was neat
dc.to_disk('/Users/samhardy/Desktop/dict_classifier')# reload, use as before etc.
dc = DictionaryClassifier('/Users/samhardy/Desktop/dict_classifier')
This allows DictionaryClassifiers
to slot into a common ML-ops pattern of loading a model from a bucket URI, just like other models that require weights/binaries to be loaded. Or, you can just store the dictionary in JSON-form somewhere else and re-define on the fly!
Repo here, and/or let rip via pip install clear_bow