A classic problem in artificial intelligence is the pursuit of a machine that can understand human language.
For example, when searching for “nearby Italian restaurants” on your favorite search engine, an algorithm has to analyze each word in your query and output the relevant results. A decent translation app will have to understand the context of a particular word in English and somehow account for the differences in grammar between languages.
All these tasks and much more fall under the subfield of computer science known as Natural Language Processing or NLP. Advances in NLP have led to a wide array of practical applications from virtual assistants like Amazon’s Alexa to spam filters that detect malicious email.
The most recent breakthrough in NLP is the idea of a large language model or LLM. LLMs such as GPT-3 have become so powerful that they seem to succeed in almost any NLP task or use case.
In this article, we will look into what exactly LLMs are, how these models are trained, and the current limitations they have.
What is a large language model?
At its core, a language model is simply an algorithm that knows how likely a sequence of words is a valid sentence.
A very simple language model trained on a few hundred books should be able to tell that “He went home” is more valid than “Home went he”.
If we replace the relatively small dataset with a massive dataset scraped from the internet, we start to approach the idea of a large language model.
Using neural networks, researchers can train LLMs on a large amount of text data. Because of the amount of text data the model has seen, the LLM becomes very good at predicting the next word in a sequence.
The model becomes so sophisticated, it can perform a lot of NLP tasks. These tasks include summarizing text, creating novel content, and even simulating human-like conversation.
For example, the highly popular GPT-3 language model is trained with over 175 billion parameters and is considered to be the most advanced language model so far.
It’s able to generate working code, write entire articles, and can take a shot at answering questions about any topic.
How Are LLMs Trained?
We’ve briefly touched on the fact that LLMs owe a lot of their power to the size of their training data. There is a reason why we call them “large” language models after all.
Pre-training with a Transformer Architecture
During the pre-training stage, LLMs are introduced to existing text data to learn the general structure and rules of a language.
In the past few years, LLMs have been pre-trained on datasets that cover a significant portion of the public internet. For example, GPT-3’s language model was trained on data from the Common Crawl dataset, a corpus of web posts, web pages, and digitized books scraped from over 50 million domains.
The massive dataset is then fed into a model known as a transformer. Transformers are a type of deep neural network that works best for sequential data.
Transformers use an encoder-decoder architecture for handling input and output. Essentially, the transformer contains two neural networks: an encoder and a decoder. The encoder can extract the meaning of the input text and store it as a vector. The decoder then receives the vector and produces its interpretation of the text.
However, the key concept that allowed the transformer architecture to work so well is the addition of a self-attention mechanism. The concept of self-attention allowed the model to pay attention to the most important words in a given sentence. The mechanism even considers the weights between words that are far apart sequentially.
Another benefit of self-attention is that the process can be parallelized. Instead of processing sequential data in order, transformer models can process all inputs at once. This enables transformers to train on huge amounts of data relatively quickly compared to other methods.
After the pre-training stage, you can choose to introduce new text for the base LLM to train on. We call this process fine-tuning and is often used to further improve the output of the LLM on a specific task.
For example, you may want to use an LLM to generate content for your Twitter account. We can provide the model with several examples of your previous tweets to give it an idea of the desired output.
There are a few different types of fine-tuning.
Few-shot learning refers to the process of giving a model a small number of examples with the expectation that the language model will figure out how to make similar output. One-shot learning is a similar process except only a single example is provided.
Limitations of Large Language Models
LLMs such as GPT-3 are capable of performing a large number of use cases even without fine-tuning. However, these models still come with their own set of limitations.
Lack of a Semantic Understanding of the World
At the surface, LLMs appear to display intelligence. However, these models do not operate the same way the human brain does. LLMs solely rely on statistical computations to generate output. They do not have the capacity to reason out ideas and concepts on their own.
Because of this, an LLM can output nonsensical answers simply because the words seem “right” or “statistically likely” when placed in that particular order.
Models like GPT-3 also suffer from inaccurate responses. LLMs can suffer from a phenomenon known as hallucination where models output a factually incorrect response without any awareness that the response has no basis in reality.
For example, a user may ask the model to explain Steve Jobs’ thoughts on the latest iPhone. The model may generate a quote from thin air based on its training data.
Biases and Limited Knowledge
Like many other algorithms, large language models are prone to inherit the biases present in the training data. As we start relying more on LLMs to retrieve information, the developers of these models should find ways to mitigate the potentially harmful effects of biased responses.
In a similar capacity, the blindspots of the model’s training data will also hinder the model itself. Currently, large language models take months to train. These models also rely on datasets that are limited in scope. This is why ChatGPT only has limited knowledge of events that occurred past 2021.
Large language models have the potential to truly change how we interact with technology and our world in general.
The vast amount of data available on the internet has given researchers a way to model the complexities of language. However, along the way, these language models seem to have picked up on a human-like understanding of the world as it is.
As the public begins to trust these language models to provide accurate output, researchers and developers are already finding ways to add guardrails so that the technology remains ethical.
What do you think is the future of LLMs?