Decoding Generative AI Basics : Context
Part I : A Simple Guide on how Transformer Models build Context.
AI: A Long History, but a Recent Revolution
AI has been part of human life for decades, with its earliest applications emerging in the 1950s. Back then, the term "AI" was mostly reserved for researchers and felt like a distant dream. However, in recent years, AI has become a buzzword across industries, even for people outside the tech world. Why is that?
The launch of ChatGPT, for example, captivated many by offering a radically new way to obtain information and answer questions. But what made it so different?
The Power of Habit and the Appeal of Simplicity
Humans are creatures of habit. We often stick to what we know, but we also seek easier ways to perform familiar tasks. If a new innovation is not easy to use, it fails to catch on. On the other hand, if it simplifies an existing process, it is often seen as revolutionary. For instance, in the past, if people needed specific information, they had to rely on books. With the rise of the internet, they could use search engines to find relevant links and sift through information. Today, in the AI era, users no longer need to search for information. Instead, they simply ask AI—like ChatGPT—and get the answers they need in the desired format. This shift from searching to asking and receiving is what makes AI radically different, offering a faster and more efficient way to find the information you need.
Why Generative AI is a Massive Unlock for Innovation
What makes this technology so transformative? At its core, AI involves machines performing tasks typically requiring human-like intelligence, such as learning, reasoning, perception, problem-solving, and natural language understanding. For example, Siri, launched in 2011 with iOS 5, uses natural language processing (NLP) to understand spoken commands and perform tasks like web searches. IBM Watson, developed the same year, showcased its reasoning ability by winning the game show Jeopardy! and earning $1 million. So, what has changed recently?
The key innovation is the rise of Generative AI, a subset of AI that can create entirely new content, not just respond with existing data. Generative AI has the ability to learn, reason, and apply knowledge to generate new information—whether in the form of text, images, videos, or audio. This capability is powered by deep learning, which enables AI to understand and generate complex patterns and outputs. But why is this ability to create so groundbreaking?
This is because Generative AI has unlocked the potential of artificial cognitive generation. To write code, for example, humans don’t just memorize syntax—they must understand patterns, reason through problems, and apply learned skills to create new solutions. A human who can’t do this would be unable to code or create applications\softwares. This was the state of AI earlier. Generative AI has bridged this gap by extending AI’s capabilities to create something new in a directed way—whether it’s writing a story, generating code, or producing an image. This opens up new possibilities such as for businesses to develop software faster, people to brainstorm ideas more effectively, and industries to innovate at a rapid pace.
Now, the boom of Generative AI wouldn’t have been possible without the Transformer Model. Having learned this context, Let us dive into those.
The Role of Attention
So the question remains, what has caused this shift in AI capabilities specifically in Gen AI ? There are several key concepts contributing to this innovation. Let's explore them one by one.
One such key concept is “Attention.”
Have you ever listened to an hour-long keynote and been able to convey the key takeaway points? Of course, the key takeaways may differ from person to person, especially if their tastes or perspectives vary. But the point is, you’re able to do it because your brain can identify and summarize the key insights from the speech. It knows what aspects are most significant to highlight during the summarization process. Now, imagine you didn’t have that capability. What would your key takeaways be? How would you determine the most important points in a speech if you didn’t know what to focus on?
This was the challenge for models before the introduction of the "Attention" mechanism. Without the ability to know what to focus on, both machines and humans would struggle to derive meaningful takeaways. Let’s consider a simple example to illustrate this.
Imagine the sentences: “I hit a six with the red bat,” “I saw a blue bat fly,” “I threw my red bat away,” and “I watched a Red bat fly.” Without knowing which word to focus on, understanding the key meaning of these sentences would be challenging. For instance, does “a bat that flew” refer to a blue-colored bat, or is the blue bat flying (mammal)? If a machine encounters this confusion, how can it generate meaningful output?
This is where the attention mechanism becomes crucial. It enables machines to “attend to” specific details, much like how humans focus on key points during comprehension. For example, consider the term “bat fly.” A human can interpret it as the bat (the mammal) flying without confusion as the word "fly" is associated with the word bat and the fact that a cricket bat cannot fly. A machine, through the attention mechanism, can now learn to associate "bat fly" with the mammal rather than the cricket bat based on the context and the training it has received. Similarly, it can associate the word ‘red’ with a ‘bat’ and interpret, it as a red-colored bat. Without this attention capability, understanding at this level of nuance would be impossible or easily misinterpretable.
The Importance of Positional Encoding in AI
Take the sentence “I hit a six with the red bat.” Let’s imagine that your brain doesn’t have the ability to process this sentence sequentially. Without that capability, how would you understand who is performing the action and what or whom the action is directed at? For instance, did the red bat hit the six, or did I hit the six? This is where another key concept comes into play: Positional Encoding.
As the name suggests, positional encoding is a mechanism that encodes the position of words in a sentence. With this data, a machine can gain a more contextual understanding of who is performing the action and who is receiving it, based on the position of the words in the sentence. The position of each word is encoded in a vector, providing the machine with the necessary context to correctly interpret the relationships between words in a sequence.
Understanding Embeddings in Generative AI
The next key concept is “Embedding.” While not unique to generative AI, embedding is a core concept in deep learning, which generative AI leverages heavily.
Before understanding embeddings, let’s clarify what tokens are, as this will make embeddings clearer. As the name suggests, “Tokens” are pieces of data, such as words or parts of words in text. For instance, in the sentence “I hit a six with the red bat,” each word could be a token, or if using a subword tokenizer, even parts of each word could be treated as tokens. Ideally, subword tokenization balances efficiency and vocabulary coverage by splitting text into manageable subwords. If the input were an image, tokens might represent sections like a 16x16 pixel block within a 1024x1024 resolution. In general, tokens are the chunks that models process as input.
What is an Embedding? Our computers still operate as binary processing engines, so they can’t derive meaning directly from words. To process text or any data meaningfully, it must be converted into numbers. Embedding is the process of representing these tokens numerically. These representations are lists of numbers (or vectors) that convey the meaning or value of each token. In a high-dimensional space, these vectors occupy specific positions and hold a specific meaning.
To illustrate, imagine a coordinate system with four quadrants. A point in the first quadrant has positive x and y values. If two points in the same quadrant are connected, their slope is positive, indicating a relationship. Similarly, the vector representing a word like “bat” exists in a high-dimensional space, far beyond human visualization capability (such as 12,000 dimensions). This vector holds a specific contextual meaning. In this case, this vector holds the meaning of the token “bat”. The vectors in its closest proximity will be words that hold a similar semantic meaning. For example, baseball bats, wood (which bats are made of), etc. Now, let us put all these concepts together.
How Attention and Positional Encoding Enhance Embedding
While embedding isn’t unique to transformer architectures, attention and positional encoding distinguish this approach. For example, after embedding, the vector for “bat” in a sentence like “I hit a six with the red bat” is refined further by applying positional encoding (which keeps the word order) and attention (which determines contextual meaning) for the rest of the sentence. This combination adjusts the vector to capture the full meaning of the entire sentence, making it contextually rich.
For instance, attention associates the word red with bat and modifies the vector to represent a new position in a high-dimensional space that encodes the meaning of “red bat.” Similarly, positional encoding identifies the subject and verb based on the tokens' positions and readjusts the vector to capture the full context-aware meaning of the sentence. This wasn’t possible at this level of detail and accuracy before.
The Power of Parallel Processing in Transformers
Another key advantage of transformers is their ability to process information in parallel. Since each token's position and attention relationship are encoded, transformers can process sequences simultaneously, dramatically speeding up learning. This scalability allows models to achieve exponential performance gains with the addition of computing resources.
Final Thoughts
So, far, we have seen how the context is built and what are some of the key concepts that help in building that context. Embeddings, Attention, and Positional encoding collectively help develop the full contextual understanding of a sentence that wasn’t possible earlier. Now, you might wonder, “How does this architecture allow a model to respond or even generate new content?” These are excellent questions, and they reveal how transformative this architecture truly is. In the next chapter, we’ll explore how transformers use this underlying structure to generate new content and respond innovatively to inputs. Stay tuned!
Note: This article doesn’t try to explain the architecture of the Transformer model in detail or with technical depth. It focuses on simplifying some core concepts of what it entails and why it has been truly transformative.
Sources