The rise and rise of AI text-to-image generators
I was young when I first listened to the song Video Killed the Radio Star by the Buggles. I thought it was a fun and catchy song, almost like a jingle, even though I had no idea what the song was all about. But that was ok, I was just a young boy listening to the radio and watching videos on VCR machines. I didn’t really pay much attention to the title or the lyrics.
It was much later I realised what the lyrics actually meant.
Video killed the radio star
Video killed the radio starIn my mind and in my car
We can’t rewind, we’ve gone too far
Pictures came and broke your heart
Put the blame on VCRVideo Killed the Radio Star (Downes/Woolley/Horn)
The song was part of the Age of Plastic album, which had themes of nostalgia, anxiety of the effects of modern technology. While the album was released more than 40 years ago (it was released in 1980) these themes still ring true and clear.
You know a technology has made it to the big leagues when TikTok starts rolling it out in their app. Just this week alone I have read several top 10 lists of AI text-to-image generators. The technology is blistering hot and models and startups are mushrooming overnight. Entire ways of thinking are being overthrown every other week or month, yesterday’s mind-blowing technology is today’s old news.
The most well-known text-to-image AI model is of course, OpenAI’s Dall-E, released in January 2021. OpenAI released another model, GLIDE (it’s ok, I’ll talk about acronyms later), in December 2021 and not long after, Dall-E 2 was announced in April 2022. The pace of release is break-neck, to say the least.

Google’s highly-anticipated Imagen, was announced in June 2022 and even though it’s not released yet, legions of media reports started extolling how it completely outperformed the king-of-the-hill Dall-E 2. Meta (Facebook) announced their text to image tool, Make-A-Scene in July 2022. Make-a-Scene is a bit different from the others in as it allow users to supplement text prompts with a simple sketch if they choose.
Even Microsoft has jumped on the bandwagon and announced NUWA-Infinity, an AI model that can generate high-quality images and videos from any given text, image or video input.
Yes, video.
The big companies don’t have a monopoly over the text-to-image gold rush though. There are plenty of small startups in the same space. Midjourney is a small startup/research lab (less than 10 people) that has a popular following of users who uses its tools to generate art pieces. It went open beta in July 2022.
NightCafe is another popular tool, based out of Australia. The company started in November 2019 as a neural style-transfer application but when the newer AI models came on the scene in mid-2021, they quickly switched over to them and today it boasts of having created over 5 million artworks on its platform.
The new darling of small AI text-to-image startups must be Stability AI. It went open beta in August 2022 and is closely tied to the new and powerful stable diffusion text-to-image AI model that it sponsored, developed by the Ludwig Maximillian University of Munich. Stability AI took a step further from the other small startups by not only having its own powerful AI model, but it also took the bold step of releasing the model as open source.
The new stable diffusion model is not the only open source model though.
On June 6, Hugging Face, a company that hosts open source AI projects and applications, suddenly saw traffic to an AI text-to-image application it hosted, called Dall-E mini, explode. Dall-E mini, as the name suggests, is a take on the Dall-E model.
The application was a simple one — it just creates 9 images out of a text prompt. What made it wildly popular was that it was available and it was free, unlike its original namesake Dall-E, which required a waitlist to join a beta. It very quickly hit 50,000 images a day, forcing eventually the creators to move the application out of Hugging Face altogether. It even had to change its name (to Craiyon) when OpenAI started noticing and was uncomfortable with the use of the Dall-E name.
Another popular open source text-to-image is VQGAN+CLIP, named as such because it combined 2 different AI models (VQGAN and CLIP) to create a powerful text-to-image model. Around April 2021, Ryan Murdock, combined CLIP with an image generating GAN named BigGAN. Later Katherine Crawson and some others replaced BigGAN with VQGAN and put it on a Google Colab notebook. And the rest is open source history.
As of today, in the Replicate, an AI model hosting company (with great APIs) has a count of 32 open source text-to-image models, including stable diffusion, various GAN+CLIP models and of course, Dall-E mini.
The AI industry is fill with amazing hard-to-decipher acronyms and jargon and the tex-to-image sub-industry is just as obscure, if not more so. If you wade through the technology you will at various times hear about GANs, CLIP, transformers and diffusion models. The technology terms can be mind-boggling (naturally, they are the results of significant research effort after all) but I’ll try to explain it as straightforward as possible. Let’s start with the technical components and we’ll move on from there. I won’t attempt to explain neural networks because that’s a basic given. If you’re not sure what they are and would like a quick primer, I wrote an article about it some time back.
Transformer
A transformer is a neural network. Stanford researchers called transformers foundation models in an August 2021 paper because they see them driving a paradigm shift in AI.
Neural networks have been used to analyse complex data types like images, video, audio and text and different types of neural networks are designed for different types of data. For example, CNNs (convolutional neural networks) are often used for image data.
Unfortunately there hasn’t been a really good neural network for language and text, even though RNNs (recurrent neural networks) are also often used (prior to transformers) even though they hard to train and have issues to work with long paragraphs.
The transformer neural network were developed in 2017 by researchers at Google and the University of Toronto, initially designed for language tasks. Transformers were easy to train because it can be efficiently parallelized and it effectively replaced RNNs as the main neural network to use, for eventually, a lot of things.

One of the most famous transformers today is OpenAI’s GPT-3 and it’s a massive model that was trained on 45 TB of text data, including almost the entire public web.
While it was originally design to work with natural language processing (i.e. with language and text) but it has been used for many other things, including text-to-image generators. As you can see, transformers are everywhere. In fact, the rest of the technologies here are either transformers outright or have transformers as part of the model.
CLIP
CLIP (Contrastive Language-Image Pre-training) is a neural network that identifies objects in pictures and provide text snippets to describe them. This sounds like your good old image classifier but it’s more powerful than that. The image classifier you’re used to can identify objects in a picture if they are trained on labeled data, and if objects doesn’t fit into any of the categories it won’t be identified.
CLIP on the other hand is trained not using labeled image datasets but from 400 million images and their text descriptions, taken from the Internet. As a result it doesn’t identify by category but instead provides a text-snippet of the image from a list of words.
CLIP also creates a sort of image-text dictionary of sorts that allows translation between image and text, and this is very helpful for text-to-image AI models. CLIP itself uses 2 different transformers as encoders — the image encoder is a vision transformer while the text transformer is GPT-2.
VQGAN
VQGAN (Vector Quantized Generative Adversarial Network) is another type of neural network first developed by researchers from the University Heidelberg in 2020. It consists of a GAN (generative adversarial network) that takes a set of images to learn features and a transformer that takes a sequence to learn the long range interactions.
GANs are an interesting machine learning technique that pitches 2 neural networks in a contest with each other, a neverending fight for supremacy. One of the neural networks is called the generator and the other is the discriminator. The generator generates data that mimics real data while the discriminator tries to identify the real from the generated. Essentially GANs create their own training data and the feedback loop between the generator and discriminator produces better and better results.
BART
BART (Bidirectional Auto-Regressive Transformers) is a neural network created by Facebook AI researchers that combines Google’s BERT and Open AI’s GPT techniques.
BERT (Bidirectional Encoder Representations from Transformers)’s directional approach is good for downstream tasks like classification that needs information about the whole sequence but not so good for generation tasks where the generated data only depend on previously generated data. GPT’s unidirectional, auto-regressive approach is good for text generation but not so good for tasks that require information of the whole sequence.
BART combines the best of both worlds by putting BERT’s encoder together with GPT’s decoder. It’s particularly good for text generation but also for text comprehension.
Diffusion
Diffusion is a generative machine learning technique, meaning they create data that looks like the data they are trained on. Stable diffusion for example, uses 3 neural networks, the autoencoder and the U-Net as well the CLIP.
The model’s main idea is that it takes an image and randomly scramble it until it becomes pure noise. Then it trains a neural network to change it step by step back to something that resembles the original image. This effectively generates images from randomness and when given random samples it can even generate new images!
That was a lot of acronyms and technologies crammed into a few paragraphs! With a hopefully clearer idea of what the technologies do, let’s take a look at how the models are put together using these technologies. Let’s start with the models from OpenAI first.
Dall-E is a basically a 12-billion parameter version of GPT-3, a transformer model, trained with a dataset of text-image pairs.
GLIDE (Guided Language to Image Diffusion for Generation and Editing) is CLIP guided diffusion model with 3.5 billion parameters.
Dall-E 2 is quite different from Dall-E, and more similar to GLIDE. Like GLIDE, it is smaller than Dall-E with only 3.5 billion parameters. It also uses a diffusion model, guided by a CLIP model, which is then optimised and refined and trained with 650 million text-image pairs. It puts together concepts learnt from Dall-E, CLIP and GLIDE all at once.
VQGANs are good at generating images that look similar to each other while CLIP determines how well the prompt matches the image. The idea of putting them together first came in a Google Colab Notebook when Katherine Crowson, an artist and mathematician, was inspired by Ryan Murdoch, combined these 2 models together. Essentially, VQGAN generates the candidate images and CLIP ranks them until an acceptable candidate is generated. This became wildly popular and there are a few models that uses this approach, with their own tweaks in place. NightCafe for example, uses VQGAN+CLIP as the basis of their AI text-to-image generator.

Another model that uses VQGAN+CLIP is Dall-E mini that I mentioned earlier. It is basically VQGAN + CLIP but uses BART to encode the text prompt for the VQGAN.
Stable diffusion, as the name suggests uses the diffusion model. In fact diffusion models are increasingly used by many companies as the basis of their AI model. Midjourney for example, uses the diffusion model as well. Stable diffusion is the open source implementation of the latent diffusion model (LDM), pre-trained on 256×256 images and then fine-tuned on 512×512 images, all from a subset of the LAION-5B database.
All these technologies are fascinating but they also have real world consequences.
In August 2022, Jason Allen won the Colorado State Fair’s digital art competition with a piece of work called Théâtre D’opéra Spatial. Jason Allen however was no artist. The image was created using Midjourney — iIt took a text prompt from Allen and created an amazing piece of work that won him the first price.
Unsurprisingly it set off a backlash with accusations of cheating, ethics of AI generated art and also claims that this is essentially a high-tech form of plagiarism.
The last is understandable. I used Midjourney myself to create this piece of work in just minutes. If you’re familiar with Henri Matisse’s work, this should defintely remind you of it.

On other other hand, new art-making technologies have always been controversial. Many painters and artists of the day were outraged by the invention of cameras. Charles Baudelaire, a French poet and art critic said of photography that “by invading the territories of art, has become art’s most mortal enemy”. In a similar vein, David Byrne, a British photographer who won the title of Landscape Photographer of the Year 2012 was stripped of his title and winnings after judges ruled that he did too much image manipulation.
Just a couple of months before, someone wrote an article in the MakeUseOf magazine about AI text-to-image generators and he ended the article with this reassuring sentence.
Just like AI writing tools, while the end product seems “real” enough like it was made by a human, it still misses some things. Artists can add creativity, emotion, and a self-defined style that makes an artwork personal and original. Maybe years later, AI could evolve to that too, but artist jobs are safe as of now.
https://www.makeuseof.com/ai-text-to-art-generators/ (published June 11, 2022)
It now seems a bit dated. After all if an AI model can come up with those pictures above, how safe are artist and illustrator jobs? Like what Allen said to The New York Times when interviewed: “Art is dead, dude. It’s over. A.I. won. Humans lost.”
Very early on in my career, I listened to a talk by Guy Kawasaki, the legendary Apple Fellow and evangelist. He told the story of the ice harvesting business (what Kristoff did at the beginning of Frozen, the Disney movie) and talked about how technology changes are often not evolutionary but revolutionary.
A large ice harvesting industry existed in Scandivanian countries and North America during the 19th and early 20th century. Large blocks of ice were cut from lakes and rivers, stored in ice houses and then delivered to homes and businesses in warmer countries or during summer. At its peak at 1890s, the US ice trade employed an estimated 90,000 people and exported ice to as far as Hong Kong, Southeast Asia, the Philippines and the Persian Gulf.

However things changed rapidly during the early years of the 20th century. Ice factories producing artificial ice started replacing natural ice trade, and a combination of scarcity of natural ice during warm winters, fears of contaminated natural ice as well as lower costs competition from ice factories drove it to its eventual demise. In the years after World War I the entire ice trade collapse and along with it went the industries and ice harvesting jobs.
The second change came even as the ice factories started taking over from the natural ice trade. In 1913, the first electric refrigerator for home and domestic use, the DOMELRE (Domestic Electric Refrigerator) was invented and sold. It was revolutionary and a success, and even included innovations like automatic temperature control and introduced a freezing tray for ice cubes. By 1944, 85% of American homes had refrigerators. Just as the ice factories replaced the ice trade, the home refrigerators replaced the need for homes to buy ice from ice factories.

Interestingly, none of the ice trade companies made the jump from harvesting and selling ice to running ice factories, and similarly none of the ice factory companies made the jump to making refrigerators. Ice harvesters didn’t become ice factory workers and ice factory workers didn’t become refrigerator factory workers.
Just as photography changed the painting industry, and photo editing changed the art and photography industry, generative art in the form of AI text-to-image generators will change the art and illustration industry as well. How well the current artists and illustrators jump the curve no one really knows.
The transformer model was created to solve a natural language processing problem in 2017. In 2020 OpenAI released the highly popular GPT-3 model, using the transformer model. That branched into Codex in 2021, which is used for software code generation (now used in Github Copilot), and also into the Dall-E (2021) and Dall-E 2 (2022) models used for text-to-image generation.
It seems highly unlikely that text-to-image is the last frontier. There’s video, audio and music (text to singing, anyone?) and plenty more ahead. We are in for exciting times, but also some scary times.
Video killed the radio star, and maybe AI might be killing artists and illustrators. So who’s next?