We are likely just at the start of a new generative AI revolution.
Generative artificial intelligence refers to algorithms and models that are capable of creating content. The output of such models includes text, audio, and images that can often be mistaken as real human output.
Applications such as ChatGPT have shown that generative AI is no mere novelty. AI is now capable of following detailed instructions and seems to have a deep understanding of how the world works.
But how did we get to this point? In this guide, we will go through some of the key breakthroughs in AI research that have paved the way for this new and exciting generative AI revolution.
The Rise of Neural Networks
You can trace the origins of modern AI to the research on deep learning and neural networks in 2012.
In that year, Alex Krizhevsky and his team from the University of Toronto were able to achieve a highly-accurate algorithm that can classify objects.
The state-of-the-art neural network, known now as AlexNet, was able to classify objects in the ImageNet visual database with a much lower error rate than the runner-up.
Neural networks are algorithms that use a network of mathematical functions to learn a particular behavior based on some training data. For example, you can feed a neural network medical data in order to train the model to diagnose a disease like cancer.
The hope is that the neural network slowly finds patterns in the data and becomes more accurate when given novel data.
AlexNet was a breakthrough application of a convolutional neural network or CNNs. The “convolutional” keyword refers to the addition of convolutional layers which places more emphasis on data that is closer together.
While CNNs were already an idea in the 1980s, they only started to gain popularity in the early 2010s when the latest GPU technology pushed the technology to new heights.
The success of CNNs in the field of computer vision led to more interest in the research of neural networks.
Tech giants like Google and Facebook decided to release their own AI frameworks to the public. High-level APIs such as Keras gave users a user-friendly interface to experiment with deep neural networks.
CNNs were great at image recognition and video analysis but were having trouble when it comes to solving language-based problems. This limitation in natural language processing might exist because how images and text are actually fundamentally different problems.
For example, if you have a model that classifies whether an image contains a traffic light, the traffic light in question can appear anywhere in the image. However, this sort of leniency does not work well in the language. The sentence “Bob ate fish” and “Fish ate Bob” have vastly different meanings despite using the same words.
It had become clear that researchers needed to find a new approach to solve problems involving human language.
Transformers change everything
In 2017, a research paper titled “Attention Is All You Need” proposed a new type of network: the Transformer.
While CNNs work by repeatedly filtering small portions of an image, transformers connect every element in the data with every other element. Researchers call this process “self-attention”.
When trying to parse sentences, CNNs and transformers work very differently. While a CNN will focus on forming connections with words that are near each other, a transformer will create connections between each and every word in a sentence.
The self-attention process is an integral part of understanding human language. By zooming out and looking at how the entire sentence fits together, machines can have a clearer understanding of the sentence’s structure.
Once the first transformer models were released, researchers soon used the new architecture to take advantage of the incredible amount of text data found on the internet.
GPT-3 and the Internet
In 2020, OpenAI’s GPT-3 model showed just how effective transformers can be. GPT-3 was able to output text that seems almost indistinguishable from a human. Part of what made GPT-3 so powerful was the amount of training data used. Most of the model’s pre-training dataset comes from a dataset known as Common Crawl which comes with over 400 billion tokens.
While GPT-3’s ability to generate realistic human text was groundbreaking on its own, researchers discovered how the same model can solve other tasks.
For example, the same GPT-3 model that you can use to generate a tweet can also help you summarize text, rewrite a paragraph, and finish a story. Language models have become so powerful that they are now essentially general-purpose tools that follow any type of command.
GPT-3’s general-purpose nature has allowed for applications such GitHub Copilot, which allows programmers to generate working code from plain English.
Diffusion Models: From Text to Images
The progress made with transformers and NLP has also paved the way for generative AI in other fields.
In the realm of computer vision, we’ve already covered how deep learning allowed machines to understand images. However, we still needed to find a way for AI to generate images themselves rather than just classify them.
Generative image models like DALL-E 2, Stable Diffusion, and Midjourney has become popular because of how they’re able to convert text input to images.
These image models rely on two key aspects: a model that understands the relationship between images and text and a model that can actually create a high-definition image that matches the input.
OpenAI’s CLIP (Contrastive Language–Image Pre-training) is an open-source model that aims to solve the first aspect. Given an image, the CLIP model can predict the most relevant text description for that particular image.
The CLIP model works by learning how to extract important image features and create a simpler representation of an image.
When users provide a sample text input to DALL-E 2, the input is converted into an “image embedding” using the CLIP model. The goal now is to find a way to generate an image that matches the generated image embedding.
The latest generative image AIs use a diffusion model to tackle the task of actually creating an image. Diffusion models rely on neural networks that were pre-trained to know how to remove added noise from images.
During this process of training, the neural network can eventually learn how to create a high-resolution image from a random noise image. Since we already have a mapping of text and images provided by CLIP, we can train a diffusion model on CLIP image embeddings to create a process to generate any image.
Generative AI Revolution: What comes next?
We are now at a point where breakthroughs in generative AI are happening every couple of days. With it becoming easier and easier to generate different types of media using AI, should we be worried about how this could affect our society?
While the worries of machines replacing workers have always been in the conversation since the invention of the steam engine, it seems that it’s a bit different this time around.
Generative AI is becoming a multipurpose tool that may disrupt industries that were deemed safe from an AI takeover.
Will we need programmers if AI can start writing flawless code from a few basic instructions? Will people hire creatives if they can just use a generative model to produce the output they want for cheaper?
It is difficult to predict the future of the generative AI revolution. But now that the figurative Pandora’s box has been opened, I hope that the technology will allow for more exciting innovations that can leave a positive impact on the world.
Leave a Reply