Speaking with AI
Anyone who has used ChatGPT or a LLM in general would have noticed that these models lack any perception of time: they don't know what day it is, the weather, nor even which question you last asked them.
This is due to how these models inherently function. LLMs predict the next token based on an input context (also known as a attention window). Without more data than just the question you've just posed, the model will likely make mistakes in trying to respond or say it can't answer because it lacks access to additional information. Therefore, to get the most out of LLMs, providing this context is crucial.
Peeker
Tokens are the smallest unit used by a language model. Originally, they were directly associated with words, although modern LLMs like GPT-4o or Llama use BPE (Byte Pair Encoding), breaking down words into smaller fragments than a complete word.
Forms of Context
Most prompt engineering techniques (or how to write prompts) consist of providing the appropriate context to the LLM so that its token prediction response is useful.
- RAG: Involves supplying semantically similar document snippets relevant to the user's question, to provide context on the subject.
- Examples: Fundamental. A single example of a given task significantly increases the model’s understanding. Between 1 and 5 examples are optimal for getting an appropriate response.
- Previous Messages: Provided by AI applications like ChatGPT. Remembering previous messages aids generation, as these models are specifically trained to operate in this manner. Conversely, discussing different topics can alter the model's accuracy.
Infinite Context?
Since the initial launch of ChatGPT, the capacity for context has increased exponentially. Initially, ChatGPT only had about 8,000 tokens of context.
This context included both the input (input) and its response (output), limiting the texts it could generate.
Currently, GPT-4.5, OpenAI's flagship model, has a total context capacity of 128,000 tokens. Other cutting-edge models like Llama 4 Maverick, released at the beginning of this month, have an attention window of up to 10 million tokens, capable of processing entire movies and books in one response. Other models like Google's Gemini-2.5 Flash also have a very broad context.
Context and Reasoning
Reasoning models, such as o1 or o3-mini from OpenAI, use their own context window for reasoning. The model "thinks out loud" before generating the final response to the user. This also translates into much higher token costs if using the API to interact with the model.
Limits of Context
Even though models have such expansive attention windows, studies accompanying these models' launches usually indicate a significant drop in precision beyond a certain number of tokens. This occurs in smaller models known as (SLMs), like Phi-4, which despite having a context capacity of 128,000 tokens, it is not recommended to exceed 16,000 nor are they focused on direct chat usage.
Conclusion
In this article, we've begun exploring how language models function and why context is so important when interacting with them, as LLMs lack any inherent knowledge; thus, additional data must be provided to generate a truly useful response.