Inspiring 2027 Target
Here's what I propose as Intel's moonshot system targets:
- 1 ExaFlop of raw FP8/INT8 compute performance
- 5 PB/Sec of "HBM" bandwidth at 138 TB Capacity
- 2.5 PB/Sec of GPU-GPU bandwidth
- All while maintaining a 132 KW power envelope
- At $3M price
Let's grasp the audacity of these targets:
- 3X leap in compute performance
- 10X revolution in memory bandwidth and capacity
- 20X breakthrough in interconnect bandwidth.
- All while maintaining the same power envelope and cost
Intel possesses all the technological ingredients to achieve these spectacular specifications. With complete organizational alignment and focus, they can get there. And we should expect NVIDIA to set their sights on similar – or even more ambitious – parameters. You need to beyond your best to compete against Nvidia and you also need to show up again and again in 2028, 2029..
It's also important to mention that the specs above will translate to very compelling systems at 1 Petaflop (132W) and 100 Teraflop(13W) ranges as well, giving Intel an excellent leadership stack from mobile, mini-PC, Desktop to Data Centers. Intel will have the ability to offer single stack from device to DC to deploy excellent open models like Deep Seek efficiently to consumers and enterprises.
A single system that can productively host the whole 670B parameter DeepSeek model under $10K is very much in Intel's realm.
There's a "deepseek" moment in cost within the next 3-5 year horizon. What gives me this optimism? One should go down to first principles and look at the following factors -
- How many logic and memory wafers do we need for the specs above
- The price of these wafers
- Pflops-per-mm2
- Gbytes-per-mm2
- Wafer yields
- Assembly and rest-of-system overheads
- Margin
with ~0.01 Pflop/mm2 and ~0.03 GB/mm2 for memory, you can construct a simple first principle cost range for these products. You will be amazed to see that there is 5-10x opportunity on dollars. Exploiting this opportunity is impractical if you don't own most of the components and more importantly the final assembly (3D, 2.5D, 2D...)
( I have simple GPUFirstPrincipleCost web app in the works where you can feed your input and assumptions and the app calculates the cost. Will share when done)
Let us now look at Pj/Flop derived from the ambitious targets above
Intel2027PjFlop=(132,000∗1012)/(1000∗1015)≈0.1
Achieving this target requires a 4X reduction in Pj/Flop – a daunting challenge in the post-Moore's law era. However, Intel's Lunar Lake silicon already demonstrates promising efficiency, delivering ~100 INT8 TOPS (GPU+NPU) at ~20W, or ~0.2 Pj/op. This baseline proves Intel possesses IP capable of competitive efficiency.
4 key connected challenges to overcome
1. Find another 2X efficiency to get to 0.1 Pj/Flop
2. While scaling compute 10,000X to get to Exaflop (including the cost for interconnect)
3. While delivering 10X near memory bandwidth
4. while staying compatible with the existing Python/C/C++ GPU software (ie; no esoteric diversions like quantum, neuromorphic , logarthmic and other ideas being pursued by a few startups )
3 key contributors for on-chip power (In Femto (F) joules )
- Math ops: ~8 Fj/bit
- Memory: ~50 Fj/bit
- Communication: ~100 Fj/bit/mm
All state of the art designs will be in the same ball park on the math ops power. They will be close to whatever the leading process node of that time (TSMC N2, Intel 14A etc) entitles them to. Most of the interesting differentiation for Intel needs to come from memory and communication aspects.
At IEDM recently nVidia published the below picture.
that can usher in the era of near memory computing. Some hints linked below
https://www.tomshardware.com/news/intel-patent-reveals-meteor-lake-adamantine-l4-cache
Whether it's their homegrown technology or from a "tight" partnership with DRAM industry, there are 10X bandwidth increase opportunities in the 3-4 year horizon.
Whoever takes the first risks and executes can be far ahead of the rest. Interestingly the technologies that help deliver 10X memory bandwidth also help with the communication bandwidth target of 20X. The key is to free up more of the chip perimeter for chip-to-chip communication.
Intel also has excellent Silicon Photonics technology, which won't amount to anything if it isn't integrated into products to start the learning loop. All technologies and IP are perishable goods with expiry date. Consume them before it's too late.
Let's talk about Intel's scalability and software now. Recently I got access to Intel PVC 8-GPU system on their Tiber cloud. I also add access 8-GPU setups from AMD and Nvidia. All three systems are floating point beasts. Here are there FP16/BF16 specifications
- Nvidia 8xH100 - 8 PF
- AMD 8xMI300 - 10.4 PF
- Intel 8xPVC. - 6.7 PF
I wrote a custom benchmark tool (called torchure
) to understand the performance of these systems across various sizes and shapes of matrices. The motivation for this came from observing the traces of various AI models.
I noticed that majority of the performance is dominated by sequence of matrix multiplies, all of them generally large matrices (4K and above). I also wanted to exercise these system with PyTorch - standard PyTorch, no fancy libraries or middleware. My thesis is that the quality, coverage and performance of standard PyTorch is a good benchmark for AI software developer productivity on different GPUs.
Software observations.
The install and getting things to "work" first time was more steps with both AMD and Intel. It involved interactions with engineers at both companies before I got going. Nvidia was straightforward.
But I have to acknowledge that both AMD and Intel made a ton of progress in making Pytorch easy to use, compared to where we were 2 years ago. Intel's driver install and Pytorch setup was a bit less friction than AMD.
AMD supports torch.cuda device directly and with intel you need to map to torch.xpu device. So, there is a bit of code adjustments I needed to make for Intel, but was not too painful. Intel "sunset" PVC GPU last year and from what I have heard the AI software team was busy with Gaudi for past few years. My expectations of compatibility and performance on Intel were very low.
I was pleasantly surprised that I was to run my tests to completion - not only 1 GPU, but also all the way up to 8 GPUs. Below are the results for 8X GPUs.
Across the sweep of different matrix shapes and sizes
- Nvidia 8xH100 - 5.3 PF (67% of peak)
- AMD 8xMI300 - 3.1 PF (30% of peak)
- Intel 8xPVC - 2.7 PF (40% of peak)
Some observations
- Easy to see why Nvidia is still the darling of everyone. This is H100. Blackwell will move the bar up even more
- From the semi-analysis article, understand AMD has new drivers coming that seem to improve GEMM numbers substantially. which is good news for AMD. This article is not about AMD or NVIDIA.
- The surprise here is the abandoned PVC that is even this close to the top GPUs . PVC is a generation behind MI300X in terms of process technology. Majority of PVC silicon is on Intel 10nm, which is ~1.5 nodes behind TSMC N4. The GPU-to-GPU bandwidth through XeLink seems to be performing better than AMD xGMI solution.
- There are definitely software optimizations left on the table. They should be able to get to 60% of peak. You can see the impact of software overhead on Intel in the case of smaller matrix dimensions.
- Intel cancelled the follow-on to PVC called Rialto Bridge in Mar' 2023 (
https://www.crn.com/news/components...ter-gpu-road-map-delays-falcon-shores-to-2025
)
- This chip was ready for tape-out in Q4'22 and would have been volume production in 2024 and was speced to deliver more than H100.
- AMD began there iteration loop with advanced packaging and HBM with Fiji in 2015, followed it with Vega in 2017. MI25, MI50, MI100, MI200, MI250 followed and eventually MI300. MI300 is AMD's first GPU to cross $1B in revenue. You only learn by shipping.

Getting back to the main thread. The data points above show that Intel has foundations to be able to compete with the best. They need to be actively playing the game and not thrash the roadmap. And stop snatching defeats from jaws of victory.
None of the this is going to be easy. All layers of Intel have to go through painful transformations. Just the executive leadership musical chairs are insufficient.
"Let chaos reign and then rein in chaos."
This is a famous quote by Andy Grove ( probably the last CEO of Intel that understood every layer of the company's stack very intimately. I often wondered what Andy would do now..)
Let's dissect this a bit. Why would you let any chaos reign? Isn't all chaos bad? the answer is no. There is good chaos and bad chaos. Good chaos forces you to invent and change. Major tech and industry transitions are good chaos. Internet, WiFI, Cloud, Smartphone, AI are some examples of transitions than can lead to good chaos. Intel benefitted from some of these transitions when it was able to "rein in". Good chaos generally comes in from external events. Bad chaos comes from internal issues. I like to call bad chaos "organizational entropy". This is the higher order bit that decays the efficiency of companies.
https://pdfs.semanticscholar.org/8655/f1d23285639d5833ff4fa0ea4632856011cf.pdf
.
When entropy crosses a certain threshold, the leadership loses control of the company. No amount of executive thrash can fix this situation, until you reduce this entropy.
My humble suggestions for whoever takes the leadership mantle at intel
- Increase the coder-to-coordinator ration by 10x. This is likely the most painful thing to do, as it could result in massive reduction in head count first and some rehiring. Give re-learning opportunity to folks stuck in co-ordination tasks to get back to coding or exit the company. AI tools are a great enablers for seniors to get back into hands on work.
- Organize the company around product leadership architecture. Intel can build the whole stack of products from 10W to 150KW with <6 modular building blocks (/chiplets) that are shared across the whole stack. Splitting the company around go-to-market boundaries is preventing them to leverage their leadership IP up and down the stack (eg:- Lunarlake SOC energy efficiency on Xeon will be awesome, but Xeon energy efficiency is far from leadership today). With leverage of leadership IP across the whole stack Intel can field top performing products across client, edge and data centers and get a healthy share of >$500B TAM accessible to them.
- Cancel the cancel culture. The legacy of Intel is built on relentless iteration. Iteration cycles to 90% yields of new process technologies every 18 months. Tick-tock model of execution. Stop the "cancel culture". You achieve nothing.
- Bet on generality and focus in performance fundamentals. Ops/clk, Bytes/Clock, Pj/Op, Pj/Bit etc. The boundaries are not CPU, GPU and AI Accelerators. The workloads are an evolving mix of scalar, vector and matrix computations demanding increasing bandwidth and memory capacity. You have the unique ability to deliver these elements in ratios that can delight your customers and destroy your competitors.
- Make a ton of BattleMage and PVC GPUs available to open source developers worldwide friction free. Selling a ton of Battlemage GPUs is a good step to achieve this. Don't worry about the margins on them. This is the most efficient way to get into hearts and minds of AI developers, while delighting millions of gamers worldwide. Battlemage is a great example of the benefit of iteration. Very measurable gains in software robustness and performance since Alchemist in 2022. They will be on path to leadership if they iterate again and launch Celestial in the next 12 months. Make all inventory of PVC (including ones in the Argonne Exascale installation) available to Github developers with no "cloud friction". It should be a single click connect to cloud GPUs from any PC/Mac in the world. Intel GPUs are the most compatible (amongst other intel choices) with Pytorch/Triton AI developer eco-system. This effort will help immensely with the leadership 2027 system launch, where more software will be functional on Intel day one.