Nvidia Stock to See New Growth Catalyst; 35X Faster AI with Groq 3 LPX
March 20, 2026
Beth Kindig
Lead Tech Analyst
At GTC this week, Jensen Huang stated the revenue opportunity for Nvidia’s artificial intelligence chips may reach at least $1 trillion through 2027, up from a previous target of $500 billion. While that grabbed most of the headlines, there was another jaw-dropping statistic that will set the stage in the coming years - which was the ability to drive up to 35X higher throughput per megawatt with its new Groq 3 LPX racks.
The 256-chip LPX rack introduces Groq’s unique SRAM‑based architecture that allows Nvidia to offload decode‑phase workloads and massively increase token throughput. This primarily targets trillion‑parameter LLMs, million-token context, and multi‑agent systems, which are bottlenecked less by compute and more by how efficiently a system can move data and generate tokens. Paired with the new Vera Rubin GPUs, Nvidia claims this architecture can deliver up to 35X higher throughput per megawatt, with seamless integration into Vera Rubin deployments.
In some ways, this acquisition draws parallels to Nvidia’s $6.9 billion acquisition of Mellanox, which my firm covered for premium research members in 2020. Mellanox was a strategic purchase to clear the bottleneck at the time on GPU performance, which was scale-out networking. By combining Nvidia’s GPUs with the strength of Mellanox’s InfiniBand, smart NICs and switching, Nvidia was able to turn accelerators into clusters by removing the limiter at that time (scale-out networking).
The Groq acquisition is aimed to solve a different limiter, which is inference throughput per watt, where memory bandwidth can become the gating factor to token output and cost. Nvidia is preparing to position its GPUs to be among the best inference options available, utilizing Groq’s unique SRAM-based architecture to significantly turbocharge token throughput and accelerate inference performance.
Nvidia expects Groq will help drive up to a 15X increase in tokens per second, directly translating into higher tokens per megawatt, which is already scaling by a factor of 10X between Blackwell and Rubin. If these claims hold true, then cheaper inference will unlock more usage, and more usage should lead to higher revenue and higher profits as the AI monetization wave plays out.
Below, we cover how Nvidia, the de facto leader in training, is now shifting its focus to inference architecture as the next catalyst.
Why Nvidia is Rethinking AI Inference Architecture
Year after year (and generation after generation), Nvidia has proven that it can consistently deliver massive efficiency gains on inference throughput and token processing speed. For example, Nvidia’s GB200 NVL72 boosts per-GPU throughput by up to 30X versus the HGX H100 platform, while the GB300 NVL72 boasts up to a 50x increase in AI factory output via a 10X increase in tokens per second per user and a 5X increase in throughput per MW.
Versus the Blackwell NVL72 systems, Nvidia says Vera Rubin can deliver up to 10X more throughput per megawatt, rapidly compounding performance gains from its Hopper generation in just three years.
Source: Nvidia
However, the more important piece of the puzzle is not just the rapid pace of these throughput gains, but how Nvidia can continue to deliver exponentially more throughput gains and how Nvidia will accelerate inference workloads further. The key answer to this is Groq, and ‘inference disaggregation’ at the rack level.
Inference disaggregation refers to splitting up the two-step process of token generation, prefill and decode, instead of running both steps together. The prefill phase processes the entire input token sequence in parallel and stores information in the KV cache, relying heavily on GPU compute and not as much on memory (yet). The decode phase generates the output tokens one by one in a sequential manner, relying on the KV cache and previous tokens, making it extremely reliant on memory bandwidth and capacity to rapidly access cached tokens. When discussing how AI workloads are memory constrained, it comes from the decode phase.
When both prefill and decode shared the same hardware (the GPUs), the two would interfere with each other and lead to delays, as a new prefill request would either force the system to pause decodes and prioritize the prefill, or run both again at the same time, elongating response times.
mid
With inference disaggregation, prefill and decode can be scaled and scheduled on different optimized hardware via Nvidia’s Dynamo; in this case the Rubin GPUs handle prefill and Groq LPUs handle decode. With disaggregation and the LPU’s massive memory bandwidth, Nvidia CEO Jensen Huang says the two combined can deliver up to 35X higher throughput per MW on trillion-parameter LLMs:
“What if we disaggregated inference altogether with a piece of software called Dynamo? What if we rearchitected the way that inference is done in the pipeline, so that we could put the work that makes perfect sense on Vera Rubin and then offload the decode generation, the low latency, the bandwidth limited challenged part of the workload for Groq. And so we united, unified processors of extreme differences, one for high throughput, one for low latency.
It still doesn't change the fact that we need a lot of memory. And so Groq, we're just going to add a whole bunch of Groq chips, which expands the amount of memory it has. And so if you could just imagine, out of 1 trillion parameter model, we have to store all of that in Groq chips. However, it sits next to NVIDIA Vera Rubin, where we could hold the massive amounts of KV cache that's necessary in processing all of these agentic AI systems. It's based upon this idea of disaggregated inference. We do the prefill, that's the easy part, but we also tightly integrate the decode.
So the attention part of decode is done on NVIDIA's Vera Rubin, which needs a lot of math and the feed forward network part of it, the decode part is done -- the token generation part is done -- on the Groq chip. The 2 of them working tightly coupled together over today, Ethernet with a special mode to reduce its latency by about half.
And so that capability allows us to integrate these 2 systems. We run Dynamo, this incredible operating system for AI factories on top of it, and you get 35x increase, not to mention additional new tiers of inference performance for token generation the world has never seen.”
Inference disaggregation is not an entirely new concept, but rather it is the way Nvidia is approaching disaggregation that makes this move noteworthy. Instead of seeing disaggregation as a separate, service-layer optimization, such as what AWS is eyeing with its recent partnership with Cerebras, Nvidia is pushing to directly embed disaggregation into the rack to maximize throughput.
Inside Groq’s SRAM Architecture and Its Massive Bandwidth Advantage
Groq’s chips feature a completely different memory-based architecture than Nvidia’s GPUs, utilizing SRAM instead of HBM. This unique architecture gives Groq’s language-processing units (LPUs) a significant advantage in the decode phase and in low-latency, high-query inference workloads from extremely higher bandwidth.
SRAM offers a major trade-off versus DRAM and HBM when it comes to memory storage capabilities within AI accelerators. Unlike typical DRAM, SRAM does not require capacitors and stores data without the need for periodic refreshing, as long as power is available. Because of this design, SRAM can offer the fastest memory access speeds with minimal latency, though at the cost of having a mere fraction of the capacity of HBM chips – the LPUs have just 500MB of capacity versus 288GB of HBM in its Rubin GPUs.
Despite having just 500MB of capacity, each LPU delivers 150 TB/s of SRAM bandwidth -- this is nearly 7X the 22 TB/s HBM bandwidth per Rubin GPU. In the rack-scale configuration, the Groq 3 LPX delivers an astounding ~2.5X increase in total scale-up bandwidth and a 25X increase in SRAM bandwidth versus HBM bandwidth.
The Groq 3 LPX combines 256 individual LPUs for a total of 128GB of SRAM capacity, yet it offers 40 PB/s of SRAM bandwidth versus 1.6 PB/s of HBM bandwidth in the Vera Rubin NVL72. Total scale-up bandwidth reaches 640 TB/s versus 260 TB/s in the NVL72. This also dwarfs the upcoming NVL576 rack which offers just 4.6 PB/s of HBM bandwidth.
This 25X increase in bandwidth is precisely the reason why Nvidia is aiming to offload decode and low-latency workloads to the LPX racks, as more bandwidth means more weight data can be processed per second, which, at its core, means more tokens can be generated per second.
Nvidia Positioning Groq 3 LPX as a ‘Token Accelerator’
Nvidia is positioning its new Groq 3 LPX racks as a ‘token accelerator’ functioning in tandem with Vera Rubin GPUs to significantly boost token throughput and address the upcoming multi-agent future. The Groq LPUs are not meant to replace GPUs in inference workloads, but rather compliment them by optimizing for memory-intensive decode.
Off the bat, Nvidia expects that combining Rubin GPUs and Groq racks will drive substantial increase in token throughput, with Nvidia VP Ian Buck claiming the combination “moves us from a world where 100 tokens per second is a reasonable throughput to one of 1500 TPS or more for AI agent intercommunication.”
To visualize this, anything over 100 TPS feels near-instantaneous, such as for chatbot users; in other terms, this would represent 1,500 words per second, or ~275X the average human reading speed. This distinction and shift from 100 TPS to 1,500+ TPS is more important than it might appear, as 100 TPS is optimized for human consumption, such as chatbot outputs, while 1,500 TPS is optimal for machine consumption, such as multi-agent communication, autonomous long-form reasoning and real-time AI systems that all require continuous, low-latency token.
The introduction of the Groq LPUs as the seventh chip in Rubin’s co-design also represents a natural shift in Nvidia’s rack scale strategy that may help deepen its moat, where it disaggregates compute and bandwidth via different specialized architectures to optimize inference at the rack and system rather than chip level. Nvidia is moving quickly with the new combined infrastructure, with Groq chips in volume production at Samsung and CEO Jensen Huang saying they would be shipping around the Q3 timeframe.
Nvidia foresees a rather large opportunity from this new integration, with CEO Jensen Huang explaining at GTC that he believes the Groq racks could account for up to 25% of a data center footprint to extend the performance and value of Vera Rubin, as well as future chips. Overall, Huang added that combining Vera Rubin with the Groq LPX racks could unlock a $300 billion annual revenue opportunity for customers.
While some analysts had cautioned that reaching the upper end of this would depend on buyer appetite and ‘ultra-premium’ tiers such as up to $150 per million tokens (nearly ~10X of GPT 5.4’s cost), the scale of the opportunity reflects Nvidia's belief that inference-optimized rack-level systems will become a key part of future AI infrastructure buildouts.
AI Monetization is Arriving, and Tokens are the Currency
As we had covered in our Bloom Energy analysis, My Top 2026 Stock Pick for the AI Boom, the real risk to the AI economy lies in the physical constraints of scaling these AI ambitions — not in compute availability from companies like Nvidia or Broadcom, and certainly not in Big Tech’s software capabilities, but in power availability, thermal management, and infrastructure that were never designed for this magnitude of demand.
This is the core challenge the AI industry now faces, and this means the most important equation for the upcoming inference-driven monetization wave is how many tokens can be generated, served, and monetized within a fixed power envelope. With Vera Rubin and the new Groq racks, Nvidia is increasingly orienting its GPU roadmap around that point, aiming to exponentially increase tokens per second per watt. It is about creating a platform that is not just faster, but able to deliver more of that monetizable output (tokens) per watt.
Nvidia CEO Jensen Huang made this point extremely clear at GTC, explaining that “everybody is looking for land, power and shell. Once you build it, you are power limited. Within that power limited infrastructure, you better make for darn sure that your inference -- because you know inference is your workload, and tokens is your new commodity, that compute is your revenues -- that you want to make sure that the architecture is as optimized as you can.” With Vera Rubin, Huang emphasized that Nvidia is “going to take our token generation speed, token generation rate from 2 million to 700 million, a 350x increase” for GW-scale AI data centers. To roughly estimate what this could look like using a cost of $1 per million tokens, this would represent a step function from $2 in revenue to $700, before applying that to scale.
While achieving the 350X increase in token generation may be reserved for hyperscalers operating at maximal scale and efficiency, this can be translated across the industry to emerging neoclouds and data center operators alike. Think of it this way -- for a data center with a fixed 100MW power envelope, the amount of users and tokens that can be served with Vera Rubin and Groq racks are multiples higher than Blackwell and other generations.
This means that driving TPS per MW higher is essentially a multiplier on revenue and margins, as more tokens under the same power footprint translates directly to higher revenue and lower costs per token processed. As Nvidia puts it, up to 10X more tokens per MW and up to 10X lower cost per million tokens with Rubin versus Blackwell – put differently, if it cost a cloud provider $10 to serve 1 million tokens that generated $15 in revenue, it would net $5 in profit. With Rubin, if it can generate 10 million at that same $10 cost, profit could reach as much as $140.
The above scenario assumes high revenue for AI inference, which may compress as the AI inference market is built out. Yet, even with a ~67% compression in token costs (revenue) from $15 to $5, there will still be $40 in profit at the 10 million tokens, or an 8X increase.
This does not mean that Nvidia is not immune to rising competition from custom silicon, as hyperscalers continue to turn towards custom chips to optimize for specific inference workloads and dramatically lower serving costs. For example, Alphabet lowered Gemini’s inference serving costs by 78% through 2025 via model optimizations, utilization and efficiency improvements, and its newest TPU generation is likely to drive further cost reductions through 2026. Meta also recently expanded its custom silicon roadmap with four new chips, focusing on ranking and recommendation model performance, genAI models, and inference via increasing HBM bandwidth and capacity each generation.
Among the hyperscalers and startups with the deepest pockets, custom silicon will likely remain a key choice in AI deployments for its ability to offer much lower costs and high performance for optimized workloads. However, for neoclouds and companies without capital to build and deploy ASICs at scale, Nvidia is creating an extremely compelling value proposition by offering a platform optimized for token throughput at scale.
Conclusion
Nvidia is leveraging Groq’s SRAM-based LPUs and extreme memory bandwidth to significantly accelerate inference and token throughput by offloading the decode phase to the new chips. When paired with Vera Rubin, Nvidia claims this architecture can deliver up to 35X higher throughput per megawatt for trillion parameter LLMs. As the AI industry now faces power and infrastructure constraints rather than compute, the key differentiator in the upcoming AI inference monetization wave will be how to extract the highest number of tokens per megawatt to maximize revenue.
For years, Nvidia’s dominance has been synonymous with training. Now, the company is making it clear it wants to own inference, which is the part of AI that actually scales into everyday usage and recurring revenue. The market latched onto Jensen Huang’s $1 trillion AI chip visibility through 2027, but the bigger tell may be what Nvidia is optimizing for next: tokens per megawatt. If Groq 3 LPX helps Rubin deliver anything close to the claimed throughput gains, Nvidia’s next growth leg won’t be about building bigger models—it will be about making inference cheap enough that demand explodes.
The I/O Fund predicted Nvidia would become the world’s most valuable company in 2019 – years before Street consensus. Today, our team runs a high-performing tech portfolio with cumulative returns of 326%, which would place us as #1 if we were a hedge fund and #3 if we were a tech ETF or mutual fund. To get a 60-page analysis on our Top 15 AI Stocks, sign up now.
Damien Robbins, Equity Analyst at I/O Fund contributed to this analysis.
Please note: The I/O Fund conducts research and draws conclusions for the company’s portfolio. We then share that information with our readers and offer real-time trade notifications. This is not a guarantee of a stock’s performance and it is not financial advice. Please consult your personal financial advisor before buying any stock in the companies mentioned in this analysis. Beth Kindig and the I/O Fund own shares in NVDA at the time of writing and may own stocks pictured in the charts.
Recommended Reading:
More To Explore
Newsletter
Nvidia Stock to See New Growth Catalyst; 35X Faster AI with Groq 3 LPX
At GTC this week, Jensen Huang stated the revenue opportunity for Nvidia’s artificial intelligence chips may reach at least $1 trillion through 2027, up from a previous target of $500 billion. While t
Palantir Stock is Out of Favor, but is the Growth Engine Still Intact?
Palantir stock sold off 38% from November to February and is down about 10% year-to-date. Even so, it has held up better than many software peers given the software sector has taken it on the chin lat
“Tech Bubble” Warnings Cost Investors a 550% Nasdaq-100 Run
Investors have been hearing “tech bubble” warnings for more than a decade — but instead of collapsing, the Nasdaq‑100 has gained 550%. If we look back ten years ago to 2015, headlines such as “Sell ev
My Top 2026 Stock Pick for the AI Boom
The market is fixated on when Big Tech will generate economic value from the $650 billion+ being poured into AI data center expansion annually. The market is missing the point. Monetization has never
I/O Fund Jumps to 326% Cumulative Return, Ranking Among Wall Street’s Best
I’m pleased to share the I/O Fund’s audited 2025 return of 37%, bringing cumulative performance since our May 2020 launch to 326%. This represents a 294% lead versus popular tech ETFs and a 152% outpe
Bitcoin After the Cycle Peak: What Comes Next and How We’re Positioning
Bitcoin rarely rewards narrative-based investors for long. Time and again, it has shown a habit of reversing its dominant trend against the prevailing story of the moment. A large portion of the I/O F
S&P 500 Outlook 2026: Rising Volatility Risk and Key Support Levels
Since November 2021, when the equal-weight Mag 7 Index does not confirm a new high in the S&P 500, it has been a reliable signal of a weakening market environment. A similar divergence is occurring to
The Future of AI Stocks? TSMC Commentary Suggests AI Megatrend
TSMC is one of the least sensational management teams in the AI stocks space, yet management explicitly called AI a multi-year “megatrend” in their most recent earnings call, with demand now being pul
The $530 Billion AI Question: Which Big Tech Stock is Winning?
Big Tech is expected to invest $530 billion for building AI infrastructure in 2026, while the path to near-term monetization remains a question mark. As investor scrutiny around capital expenditure in
Palantir Stock 2026 Forecast: Is Its High Valuation Sustainable?
Palantir’s stock has defied gravity, delivering steady performance that no other AI software stock has come close to matching (yet). For investors, the Palantir thesis is two-fold: the company must co
