The “Attention is All You Need” transformer revolution has had a profound effect on the design of deep learning model architectures. Not long after BERT, there was RoBERTa, ALBERT, DistilBERT, SpanBERT, DeBERTa, and many more. Then there’s ERNIE (which still goes strong with “Ernie 4.0”), GPT series, BART, T5, and so it goes. A museum of transformer architectures has formed on the HuggingFace side panel, and the pace of new models has only quickened. Pythia, Zephyr, Llama, Mistral, MPT, and many more, each making a mark on accuracy, speed, training efficiency, and other metrics.
By model architecture, I’m referring to the computational graph underlying the execution of the model. For example, below is a snippet from Netron showing part of the computational graph of T5. Each node is either an operation or a variable (input or output of an operation), forming a node graph architecture.
Even though there are so many architectures, we can be pretty sure that the future holds even more modifications and new breakthroughs. But each time, it’s human researchers that have to understand the models, make hypotheses, troubleshoot, and test. While there’s boundless human ingenuity, the task of understanding architectures gets tougher as the models get larger and more complex. With AI guidance, perhaps humans can discovery model architectures that would take humans many more years or decades to discovery without AI assistance.
Intelligent Model Architecture Design (MAD) is the idea that generative AI can guide scientists and AI researchers to better, more effective model architectures faster and easier. We already see large language models (LLMs) providing immense value and creativity for everything from summarizing, analyzing, image generation, writing assistance, code generation, and much more. The question is, can we harness the same intelligent assistance and creativity for designing model architecture as well? Researchers could be guided by intuition and prompt the AI system with their ideas, like “self attention that scales hierarchically”, or even for more specific actions like “add LoRA to my model at the last layer”. By associating text-based descriptions of model architectures, e.g., using Papers with Code, we could learn what kinds of techniques and names are associated with specific model architectures.
First, I’ll start with why model architecture is important. After, I’ll cover some of the trajectories towards intelligence MAD in neural architecture search, code assistance, and graph learning. Finally, I put together some project steps and discuss some of the implications for AI designing and self-improving via autonomous MAD.
Andrew Ng’s push and coining of “data-centric” AI was very important for the AI field. With deep learning, the ROI for having clean and high quality data is immense, and this is realized in every phase of training. For context, the era right before BERT in the text classification world was one where you wanted an abundance of data, even at the expense of quality. It was more important to have representation via examples than for the examples to be perfect. This is because many AI systems did not use pre-trained embeddings (or they weren’t any good, anyway) that could be leveraged by a model to apply practical generalizability. In 2018, BERT was a breakthrough for down-stream text tasks, but it took even more time for leaders and practitioners to reach consensus and the idea of “data-centric” AI helped change the way we feed data into AI models.
Today, there are many that see the current architectures out there as “good enough”, and that it’s much more important to focus on improving the data quality than it is on editing the model. There is now a huge community push for high quality datasets for training, like Red Pajama Data for example. Indeed we see that many of the great improvements between LLMs lie not in model architecture but in the data quality and preparation methodology.
At the same time, every week there is a new method that involves a kind of model surgery that is showing to have some great impact on training efficiency, inference speed, or overall accuracy. When a paper claims to be “the new transformer” like RETNET did, it has everyone talking. Because as good as the existing architectures are, another breakthrough like self attention will have a profound impact on the field and what AI can be productionized to accomplish. And even for small breakthroughs, training is expensive so you want to minimize the number of times you train. Thus, if you have a specific goal in mind, MAD is also important for getting the best return for your buck.
Transformer architectures are huge and complex and that makes it more difficult to focus on model-centric AI. We’re at a time when generative AI methods are becoming more advanced and intelligent MAD is in sight.
The premise and goal of Neural Architecture Search (NAS) is aligned with intelligent MAD, to alleviate the burden of researchers designing and discovering the best architectures. Generally, this has been realized as a kind of AutoML where hyper-parameters include architecture designs, and I’ve seen it become incorporated into many hyper-parameter configurations.
A NAS dataset (i.e., NAS benchmark) is a machine learning dataset,
At the end of the day, there are only about 250 primitive-level tensor operations in a deep learning tensor library like TensorFlow or PyTorch. If the search space is from first principles and includes all possible models, then it should include the next SOTA architecture variation in its search space. But in practice, this is not how NAS is set up. Techniques can take the equivalent of 10 years of GPU compute, and that’s when the search space is relaxed and limited in various ways. Thus, NAS has mostly focused on recombining existing components of model architectures. For example, NAS-BERT used a masked modeling objective to train over smaller variations of BERT architectures that perform well on the GLUE downstream tasks, thus functioning to distill or compress itself into less parameters. The Autoformer did something similar with a different search space.
Efficient NAS (ENAS) overcomes the problem of needing to exhaustively train and evaluate every model in the search space. This is done by first training a super network containing many candidate models as sub-graphs that share the same weights. In general, parameter-sharing between candidate models makes NAS more practical and allow the search to focus on architecture variation to best use the existing weights.
From the generative AI perspective, there is an opportunity to pre-train on model architectures and use this foundation model for generating architectures as a language. This could be used for NAS as well as a general guidance system for researchers such as using prompts and for context-based suggestions.
The main question from this point of view is whether to represent architectures as text or directly as graphs. We have seen the recent rise of code generation AI assistance, and some of that code is the PyTorch, TensorFlow, Flax, etc., related to deep learning model architecture. However, code generation has numerous limitations for this use case, mostly because much of code generation is about the surface form i.e., text representation.
On the other hand, Graph Neural Networks (GNNs) like graph transformers are very effective because graph structure is everywhere, including MAD. The benefit of working on graphs directly is that the model is learning on an underlying representation of the data, closer to the ground truth than the surface-level text representation. Even with some recent work to make LLMs amazing at generating graphs, like InstructGLM, there is promise for graph transformers in the limit and especially conjunction with LLMs.
Whether you use GNNs or LLMs, graphs are better representations than text for model architecture because what’s important is the underlying computation. The API for TensorFlow and PyTorch is constantly changing, and the lines of code deal with more than just model architecture, such as general software engineering principles and resource infrastructure.
There are several ways to represent model architectures as graphs, and here I review just a few categories. First, there are code machine learning compilers like GLOW, MLIR, and Apache TVM. These can compile code like PyTorch code into intermediate representations that can take the form of a graph. TensorFlow already has an intermediate graph representation you can visualize with TensorBoard. There’s also an ONNX format which can be compiled from an existing saved model, e.g., using HuggingFace, as easy as:
optimum-cli export onnx --model google/flan-t5-small flan-t5-small-onnx/
This ONNX graph serialized looks something like (small snippet):
One problem with these compiled intermediate representations is that they are difficult to understand at a high level. An ONNX graph of T5 in Netron, is immense. A more human-readable option for model architecture graphs is Graphbook. The free-to-use Graphbook platform can show you the values being produced by each operation in the graph and can show you where tensor shapes and types don’t match, plus it supports editing. In addition, the AI-generated model architectures may not be perfect, so having an easy way to go inside and edit the graph and troubleshoot why it doesn’t work is very useful.
While Graphbook models are just JSON, they are hierarchical and therefore allow better levels of abstraction for model layers. See below, a side view of the hierarchical structure of GPT2’s architecture.
This is an outline for a proposal for generative MAD. I wanted to include these sub-steps to be more concrete about how one would approach the task.
- Code-to-Graph. Create a MAD dataset from code such as HuggingFace model card, converting the model to a graph format such as ONNX or Graphbook.
- Create datasets. These would be datasets like classifying the operation type of an operation in the graph, classifying the graph itself, masking/recovering operations and links, detect when a graph is incomplete, convert an incomplete graph into a complete one, etc. These can be self-supervised.
- Graph-Tokenizer. Tokenize the graph. For example, let each variable in the graph be a unique vocabulary ID and generate the adjacency matrix that can feed into a GNN layers.
- GNN design. Develop a GNN that uses the graph tokenizer output to feed through transformer layers.
- Train and Test. Test the GNN on the datasets, and iterate.
Once these steps are fleshed out, they could be used as part of a NAS approach to help guide the design of the GNN (step 4).
I want to give some notes about the implications of autonomous MAD. The implications of AI being able to design model architectures is that it can improve the structure of its own brain. With some kind chain/graph of thought process, the model could iteratively generate successor states for its own architecture and test them out.
- Initially, AI has given model architecture, is trained on specific data for generating model architectures, and can be prompted to generate architectures. It has access to the training source which includes its own architecture design, and the training sources include a variety of tests around architecture tasks like graph classification, operation/node classification, link-completion, etc. These follow general tasks you find in the Open Graph Benchmark.
- Initially, at the application-level there is some kind of agent that can train and tests model architectures and add these to the AI’s training source, and perhaps it can prompt the AI with some kind of instructions about what works and what doesn’t work.
- Iteratively, the AI generates a set of new model architectures, and agent (let’s call it MAD-agent) trains and tests them, gives them a score, adds these to the training source, directs the model to retrain itself, and so on.
In essence, instead of only using AutoML/NAS to search the space of model architectures, learn model architectures as graphs and then use graph transformers to learn and generate. Let the graph-based dataset itself be a model architecture represented as graphs. The model architectures representing graph datasets and the space of possible model architectures for learning graph datasets become the same.
What’s the implication? Whenever a system has the capability of improving itself, there’s some potential risk to having a runaway effect. If one designed the above AND it was designed in the context of a more complex agent, one where the agent could indefinitely pick data sources and tasks and coordinate itself into becoming a multi-trillion parameter end-to-end deep learning system, then perhaps there is some risk. But the unsaid difficult part is the design of a more complex agent, resource allocation, as well as the many difficulties in supporting the wider set of capabilities.
AI techniques in autonomous AI model architecture design (MAD) may help AI researchers in the future discover new breakthrough techniques. Historically, MAD has been approached through neural architecture search (NAS). In conjunction with generative AI and transformers, there could be new opportunities to aid researchers and make discoveries.