Related Videos

New AI HYENA Destroys Old AI Models and Breaks Speed and Memory Records

25.17k1,674 Wörter8m readGrade 8

[Music] Something big just dropped and it's not another transformer upgrade. It's something completely different. Liquid AI, a Boston startup spun out of MIT, just revealed Hyena Edge on April 25th, right before ICLR 2025 kicks off in Singapore.

And yeah, hyena, like the animal that laughs at lions, except this thing is laughing at latency graphs on your phone. is built to run powerful AI right on your device, faster and lighter than anything we're used to. And it might just be the first real sign that the Transformer era is starting to crack.

So, let's rewind just a bit. For years, we've been in this romance with the Transformer architecture because of that paralyzable attention mechanism Baswani and friends introduced back in 2017. It's gotten us some wild breakthroughs, sure, but there's a big catch.

Squeezing these chunky Transformer models onto a smartphone without frying your battery or devouring your RAM has been painful. Most edge optimized models like SM LM2, like FI, even Meta's Llama 3. 21B still lug around standard attention blocks and rely on kernels that are great on a data center grade GPU, but not so hot on the Snapdragon inside your pocket.

Liquid AI is basically saying, "Why not ditch most of that attention baggage and do something leaner? " Enter Hyena Edge, a convolution-based multi-hybrid model. Convolutions on a language model.

Yep. Convas aren't new. They rule in vision, but here they're part of this broader family of operators called hyena, which Michael Poli's group kicked off a couple years ago.

The edge variant takes it further by replacing roughly 2/3 of the grouped query attention operations inside a topshelf transformer plus backbone with these gated convolutions from the hyena subf family. That swap alone cuts a heap of memory overhead and avoids the quadratic time blow up that attention brings. Now, Liquid AI didn't just eyeball some abstract benchmarks on a desktop GPU and declare victory.

They ran the whole thing on an actual Samsung Galaxy S24 Ultra. Yes, the phone you might have in your pocket right now. And compared it to a parameter matched GQA Transformer Plus+ model.

Prefill latency. Hyena Edge was faster across the board and at longer contexts. We're talking up to 30% quicker.

The code latency, same story. Once you hit sequences over 256 tokens, that convolution magic really kicks in. and memory usage was lower at every sequence length they measured, which is huge when your app's data has to fit between Spotify, Tik Tok, and the photo album of your cat.

What about accuracy, though? Because, let's be honest, nobody cares if a model is lightning fast if it can't finish your sentence. Liquid AI trained both models on the exact same 100 billion tokens and then unleash them on a battery of standard language model benchmarks.

On wiki text, hyena edg's perplexity dropped to 16. 2 compared to 17. 3 from the transformer baseline.

Lambata went from 10. 8 down to 9. 4.

On pyap, the accuracy nudged up from 71. 1 to 72. 3.

Hella swag saw a jump from 49. 3 to 52. 8.

Wino Grande climbed from 51. 4 to 54. 8.

Ark Easy crept up from 63. 2 to 64. 4 and Arc Challenge pushed from 53.

34 to 55. 2. One funny footnote, both models tied on a Pi QA variant at 31.

7, so the Hyena didn't win absolutely everything, but it never fell behind either. Net result, speed and memory savings come with equal or better predictive oomph, which is the holy grail for ondeision AI. Okay, but how did they actually arrive at that architecture?

This is where it gets super nerdy, but also kind of sci-fi cool. Back in December 2024, Liquid AI unveiled something called STAR, the synthesis of tailored architectures framework. Picture an evolutionary algorithm wearing a lab coat.

You feed it a bunch of primitive operators, some constraints about latency and memory, sprinkle in linear systems theory, and then let it evolve architectures generation after generation. For Hyena Edge, they kicked off with a population of 16 candidate models. Over 24 generations, Star juggled 18 different convolution options.

Hyena full, hyena X, hyena Y, with filter lengths ranging from 3 to 128, plus several flavors of grouped query attention and swag glue feed forward layers. Every candidate got its latent memory and latency profiled on the actual S24 Ultra, not a random desktop card. And they even trained each mini model for 5 billion tokens to keep score on Perplexity in real time.

As the evolutionary cycles rolled on, the Hyena Y operator kept musling its way to the front of the pack. It turns out this variant strikes that sweet balance. plenty of expressive power without the inner convolution overhead you see in hyena full and a lighter gating setup than hyena X star could literally visualize how many self attention hyena and swuiglue blocks were inside each generation watch the walkthrough video Liquid AI posted you'll see those histograms shifting over time self-attention bars shrinking hyeni y bar swelling latency curves dipping kind of like watching natural selection but with code blocks instead of genes By the final generation, Star spat out the design that became Hyena Edge, 32 layers deep, width, 48, attention head size 64, and 2/3 of what used to be GQA replaced by hyenagated convolutions.

No handtuning, no trust me, bro, just hardcore automated search. And they stress tested the outcome directly on the phone again to be sure those earlier approximations held true at full scale, which they did. Otherwise, we wouldn't be talking about this.

One angle I love is how they benchmark responsiveness on short prompts because that's where hybrids usually fall flat. Edge apps like voice assistants often fire off queries under 20 tokens. So, shaving milliseconds there is everything.

Liquid AI reports their prefill latency advantage is visible right from the shortest sequences. That's basically the model's first impression for the user and it only widens as you feed in longer context. Even if you're just sending a single sentence to your ondevice language model, Hyena Edge can answer faster than its transformer twin.

Now, a quick side tour. Group QA attention KI was already an optimization to make transformers more manageable by letting multiple queries share key value heads. It's lighter than full attention, but still attention at heart.

Hyena Edge swaps most of those heads out entirely, and that's what slices the quadratic term down. Convolution operations scale linearly with sequence length. So for 512 or 1,024 tokens, you're looking at serious compute savings.

The kicker is that liquid AAI engineered their gated convolution, so they still capture long range dependencies, something older convol struggled with. That's how they kept or improved the Perplexity numbers without fall back to heavy attention. All that said, Liquid AI isn't keeping this in a vault.

They've already stated loudly that they plan to open source Hyena Edge and a series of liquid foundation models in the coming months. If you're like me, that sentence feels like Christmas morning. It means developers will get a turnkey model that can run natively on stuff like the S24 Ultra, maybe the iPhone 16 Pro, maybe even a Raspberry Pi if you dare, without requiring a cloud subscription or a 5 watt charger.

And because it's open, we'll see a thousand forks. Someone will quantize it to 4bit. Someone else will fine-tune for coding assistance.

Another team will port it to watch OS. Mark my words. More broadly, this is part of a bigger trend.

We're stepping into a post transformer world or at least a poly architecture ecosystem. Transformers are still unbeatable for heavy GPU jobs though. But when it comes to edge devices where every bit of energy matters, hybrids like convolutions, recurrent models, and even state space models are finally getting their moment.

And with smart tools like Star doing the architecture search, we're moving faster than manual tweaking ever allowed. But the best part, everything's tested directly on real devices like smartphones, not just on GPUs in a lab. So what works in theory also works in your hand.

Zooming out a little, phones now have powerful NPUs, laptops are shipping with crazy AI accelerators, and there's pressure to keep AI local for privacy. Models that can crush benchmarks like Lombata at a perplexity of 9. 4, use less RAM, and respond 30% faster could be exactly what tips the scale.

Running everything on devices just feels better. No lag, no cloud dependency, no leaking personal data when you're offline. Credit where it's due.

The team behind this listed under Liquid Science includes Armen Thomas, Stefano Masaroli, Michael Pulley, and the rest of Liquid Edge. They built on the Hyena X and Hyena Y work, plus tweaks like Swiggloo, and of course, the original Transformer ideas. It's less about one-off brilliance and more about letting algorithms evolve designs smarter than we could by hand.

One quick note, early versions of Hyena Edge were tested at a width of 512 before scaling up to48 for the final model, keeping attention head size at 64. During Stars evolution runs, they estimated whole model latency and memory by adding up pre-measured operator stats, which let them move fast without wasting full training cycles. They even visualized it.

Watching those models slide toward the bottom left corner where low latency meets low perplexity was like watching a stock ticker only cooler. So where are we headed? If the last few years belong to transformers, the next could be ruled by automated architecture search, hybrid models, and real practical edge AI.

Hyena Edge proves you can rip out most of the attention, still match or beat quality, and get way faster on realworld devices. And since Liquid AI plans to open source it, they're inviting everyone to take it even further. Is this the future?

Powerful AI running straight from your pocket with no cloud in sight, or are we just dreaming too big, too soon? Thanks for watching, and I'll catch you in the next one.