NVIDIA's flops/W and flops/mm2 aren't impressive compared to ATI either, but obviously that doesn't tell the whole story. When discussing things that are fundamentally based in *efficiency* of code running on a specific architecture, you need to look at comparing the ideal algorithms to solve a *problem* written on a given architecture, not the same code. Already you have to write different code for ATI and NVIDIA DX11 parts if you want to get near the best performance on either.And before you ask what could you do with 3% more transistors, ask yourselves why LRB's bogo-flops/mm and bogo-flops/W are not very impressive wrt it's competitors of today.
I don't think anyone is disagreeing with you on this. Doesn't mean the stupid stuff in OCL/DX/CUDA has to stay that way though, and I think it's fair to say that a lot of the memory and execution model of these languages is borderline broken.in 2000, intrinsics made sense.
In 2010, OCL/DXCS/CUDA/make sense.
I am not hearing numbers.
WHY?
Most estimation points towards 35- 40M transistors for x86 decode.
Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.
I think that is more transistors than shipped in any x86 core EVER. In fact, I'm pretty sure that excluding L1 caches, no one has eclipsed the 10 mil mark yet for a core.
What?! Haha wow. I didn't know that. So what exactly is filling up the ~500M transistors difference in a modern Quad Core X86? Is that all just cache and other such supporting circuitry?
Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.
What do you think about this new Charlie's piece?
A clean-up of X86 ISA? Something new? He is wrong altogether?
I say... welcome PS4's GPU .
The article didn't have much meat to it, given the amount of words expended.
It's rather vague on the things I would have liked to know, such as what issues there were and what is to be changed, or what stepping Intel used in its live demo (given the numbers, it seems it was about half of what Intel hoped).
The "converged pipeline" scheme seems nebulous to me. Is this converged in the sense of a convergence with mainline x86 that makes it incompatible with Larrabee I, or a converged pipeline in the sense that there is no scalar and vector demarcation, and that this would be incompatible with both Larrabee I and mainline x86.
The latter interpretation would make the "it benefits from being x86" even less relevant than it was for the P54C chimera that was Larrabee I. The "Intel controls the compilers so breaking the ISA is cool" rationalization is just another point of how irrelevant the x86 ISA is for anyone but Intel.
If it turns a Larrabee core into basically a core that runs an AVX/LRBni hybrid, its execution units could someday be transplanted into a vector block, something akin to what AMD might be trying with its shared and separately scheduled FPU blocks in Bulldozer.
A core or computing cluster whose base granularity is that of a vector unit or units would be oddly familiar for the GPU folks.
If SPUs were 16 wide.