Good Bot
Posts
The "Strawberry" Problem LLMs Can't Solve + OpenAI o1: The Next Leap in AGI

The "Strawberry" Problem LLMs Can't Solve + OpenAI o1: The Next Leap in AGI

OpenAI's o1 model addresses fundamental tokenization issues while advancing reasoning and problem-solving capabilities beyond previous language models

Rohan Yadav
September 17, 2024

Have you ever asked a large language model (LLM) like ChatGPT to count the number of 'R's in the word "strawberry"?

ChatGPT-4o failing to determine the number of 'R's in the word 'strawberry' (hint: the answer is three)

Well, you might be surprised by just how often it messes up. While such a task may seem like a breeze for humans—even toddlers—LLMs tend to struggle.

The reason LLMs make mistakes when asked to count specific letters is due to a flaw in their tokenization process. Tokenization is how LLMs break down text into smaller units, or tokens, to process language more efficiently. However, these tokens don’t always correspond to individual letters.

For example, in a word like "strawberry," rather than dividing the word into individual letters, the LLM breaks it down into smaller chunks or tokens, each with a unique token ID.

Here is an example of how an LLM like ChatGPT breaks down a random text sample.

The tokenization breakdown of a random phrase

What ChatGPT Sees When It Looks at “Strawberry” (And Why It Stumbles)

Here is how ChatGPT deciphers the word “strawberry”:

The tokenization of the word "strawberry"

You can see that ChatGPT’s tokenizer splits “strawberry” into three distinct parts:

Part 1: “str”
Part 2: “aw”
Part 3: “berry”

Here are the corresponding token IDs for each part:

The token IDs associated with the word "strawberry"

From an LLM’s point of view, “strawberry” isn’t a series of easily countable letters. Instead, “strawberry” is processed as a sequence of token IDs. This is why ChatGPT encounters difficulties when asked to count specific letters.

While tokenization is highly efficient and practical for the majority of tasks, it can introduce complications for tasks that require hyper-focused detail or letter-by-letter analysis.

Enter OpenAI’s New o1 Model

After being long teased as Project Strawberry, OpenAI has officially launched its o1 line of models, including o1-preview and o1-mini (currently available for OpenAI Plus users), though the full o1 model remains unavailable at the moment. These new LLMs have been trained using reinforcement learning, and OpenAI claims that they possess PhD-level intelligence. Specifically, these o1 models are designed to handle complex challenges in STEM-related fields such as science, mathematics, and computer science.

OpenAI boasts that the new o1 models offer more complex reasoning and problem-solving abilities than the previous ChatGPT-4 models. One of the standout features of the o1 line is that it’s trained to emulate human-like problem-solving and reasoning processes. Users will notice that the model "thinks" through problems more thoroughly. OpenAI claims this deliberate reasoning directly contributes to its superior performance in complex tasks.

So How Is It Better?

An evaluation from testing conducted by OpenAI

Another evaluation from testing conducted by OpenAI

Across a broad range of human and machine learning evaluation metrics, o1 consistently outperforms ChatGPT-4o on reasoning-based benchmarks.

According to OpenAI:

"The OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA)."

Does o1 Really... Think?

Well… no, at least not in the traditional “human” sense.

The o1 model employs a sophisticated chain-of-thought processing system to tackle challenging and complex problems. Instead of thinking like a human, it processes problems step-by-step, leveraging reinforcement learning to refine its approach. As the model encounters difficulties, it learns from its mistakes, continuously improving its reasoning abilities over time. It breaks down difficult challenges into simpler steps, adjusts its strategies as necessary, and adopts new approaches when its current methods are insufficient. This ongoing refinement allows the o1 model to excel in problem-solving, making it more adept at handling complex tasks compared to previous models.

Is o1 the Next Step Toward AGI?

Well, that depends on who you ask. According to Google DeepMind’s definition of AGI, o1 could be classified as Level 2 AGI, or “Competent AGI,” meaning it is more adept than at least the 50th percentile of adult humans—still, however, an incredibly low level of AGI.

An evaluation from testing conducted by ARC-AGI on the o1-preview and o1-mini models

When stacked up against other LLMs in the ARC-AGI benchmark, o1-preview only scores 21%, the same as Anthropic’s Claude 3.5 Sonnet, while o1-mini scores even lower at 13%. Both fall far short of the ultimate goal of 85%, which is considered a benchmark for approaching AGI.

It is important to note that ARC-AGI only tested the o1-preview and o1-mini models, not the full o1 model, so it can be assumed that the percentage accuracy will increase once the full o1 model (with its vision capabilities) is tested.

Despite the advancements brought by the Strawberry architecture, the performance of the o1 model on the ARC-AGI benchmark reinforces a critical point: LLMs are not truly intelligent. They excel at memorization, not reasoning.

Links + Sources

https://openai.com/index/introducing-openai-o1-preview/

https://openai.com/index/learning-to-reason-with-llms/

https://platform.openai.com/tokenizer

https://prompt.16x.engineer/blog/why-chatgpt-cant-count-rs-in-strawberry

https://deepmind.google/research/publications/66938/