Expert models are one of the most useful inventions in Machine Learning, yet they hardly receive as much attention as they deserve. In fact, expert modeling does not only allow us to train neural networks that are “outrageously large” (more on that later), they also allow us to build models that learn more like the human brain, that is, different regions specialize in different types of input.
In this article, we’ll take a tour of the key innovations in expert modeling which ultimately lead to recent breakthroughs such as the Switch Transformer and the Expert Choice Routing algorithm. But let’s go back first to the paper that started it all: “Mixtures of Experts”.
Mixtures of Experts (1991)
The idea of mixtures of experts (MoE) traces back more than 3 decades ago, to a 1991 paper co-authored by none other than the godfather of AI, Geoffrey Hinton. The key idea in MoE is to model an output “y” by combining a number of “experts” E, the weight of each is being controlled by a “gating network” G:
An expert in this context can be any kind of model, but is usually chosen to be a multi-layered neural network, and the gating network is
where W is a learnable matrix that assigns training examples to experts. When training MoE models, the learning objective is therefore two-fold:
- the experts will learn to process the output they’re given into the best possible output (i.e., a prediction), and
- the gating network will learn to “route” the right training examples to the right experts, by jointly learning the routing matrix W.
Why should one do this? And why does it work? At a high level, there are three main motivations for using such an approach:
First, MoE allows scaling neural networks to very large sizes due to the sparsity of the resulting model, that is, even though the overall model is large, only a small…