AI known as the transformer, the hunt is on for new architectures. Transformers underpin OpenAI’s video-generating model Sora, and they’re at the heart of text-generating models like Anthropic’s Claude, Google’s Gemini, and GPT-4. However, they’re beginning to run up against technical roadblocks — particularly computation-related challenges.
Efficiency Challenges with Transformers
Transformers aren’t especially efficient at processing and analyzing vast amounts of data, at least not on off-the-shelf hardware. This inefficiency is leading to steep and perhaps unsustainable increases in power demand as companies build and expand infrastructure to accommodate transformers’ requirements.
Introduction of Test-Time Training (TTT)
A promising architecture proposed this month is test-time training (TTT), developed over a year and a half by researchers at Stanford, UC San Diego, UC Berkeley, and Meta. The research team claims that TTT models can process far more data than transformers while consuming significantly less computing power.
The Hidden State in Transformers
A fundamental component of transformers is the “hidden state,” essentially a long list of data. As a transformer processes information, it adds entries to the hidden state to “remember” what it just processed. For instance, if the model is working through a book, the hidden state values will represent words or parts of words.
“If you think of a transformer as an intelligent entity, then the lookup table — its hidden state — is the transformer’s brain,” Yu Sun, a post-doc at Stanford and a co-contributor on the TTT research, told TOPCLAPS. “This specialized brain enables the well-known capabilities of transformers such as in-context learning.”
Limitations of the Hidden State
While the hidden state is part of what makes transformers so powerful, it also hobbles them. To “say” even a single word about a book a transformer just read, the model would have to scan through its entire lookup table — a task as computationally demanding as rereading the whole book.
TTT’s Innovative Approach
Sun and his team proposed replacing the hidden state with a machine learning model — like nested dolls of AI, a model within a model. The TTT model’s internal machine learning model, unlike a transformer’s lookup table, doesn’t grow as it processes additional data. Instead, it encodes the data into representative variables called weights, making TTT models highly performant. No matter how much data a TTT model processes, the size of its internal model won’t change.
The Potential of TTT Models
Sun believes that future TTT models could efficiently process billions of pieces of data, from words to images to audio recordings to videos, far beyond the capabilities of today’s models. “Our system can say X words about a book without the computational complexity of rereading the book X times,” Sun said. “Large video models based on transformers, such as Sora, can only process 10 seconds of video because they only have a lookup table ‘brain.’ Our eventual goal is to develop a system that can process a long video resembling the visual experience of a human life.”
Skepticism Around TTT Models
Will TTT models eventually supersede transformers? They could, but it’s too early to say for certain. TTT models aren’t a drop-in replacement for transformers. The researchers only developed two small models for study, making TTT difficult to compare to larger transformer implementations.
“I think it’s a perfectly interesting innovation, and if the data backs up the claims that it provides efficiency gains then that’s great news, but I couldn’t tell you if it’s better than existing architectures or not,” said Mike Cook, a senior lecturer in King’s College London’s department of informatics who wasn’t involved with the TTT research. “An old professor of mine used to tell a joke when I was an undergrad: How do you solve any problem in computer science? Add another layer of abstraction. Adding a neural network inside a neural network reminds me of that.”
Growing Recognition of the Need for Breakthroughs
Regardless, the accelerating pace of research into transformer alternatives points to a growing recognition of the need for a breakthrough.
Other Alternatives: State Space Models (SSMs)
This week, AI startup Mistral released a model, Codestral Mamba, based on another alternative to the transformer called state space models (SSMs). SSMs, like TTT models, appear to be more computationally efficient than transformers and can scale up to larger amounts of data. AI21 Labs is also exploring SSMs. So is Cartesia, which pioneered some of the first SSMs and Codestral Mamba’s namesakes, Mamba and Mamba-2.