22 nm Larrabee

Well, not disputing so much as wondering why Nick keeps repeating the same words over and over again irrespective of what anybody says.
 
Sure, but what's the point of repeating the same words over and over again?

Anyway, this train of discussion is rapidly derailing. Let's try to get back to 22nm Larrabee.
 
rpg.314 may I dare a question am I right to assume that your pov is that CPUs (even throughput oriented ones) will never be competitive versus SIMD array under the control of command processors(s) (so keeping dedicated tex units). Or your pov is more like it can be ok but it needs more fixed hardware (say a rasterizer or a tiny GPU).

Nick, while augmenting the throughput of the CPU core (simple or complex one) don't you think it could be an interesting thing to keep a "tiny/lesser GPU"?
I mean you pointed earlier in the thread that some time ago the vertex were still being handled by the CPU on Intel platform. How about moving the pixel shading too (like Dice is doing on PS3)?
Basically you put a tiny GPU with by today standard a "fucked up" ALUs/Tex ratio (ie plenty of texturing power vs compute power), assuming most 3D engine are moving to more and more deferred techniques the gpu (a modern one) would act a "deferred renderer accelerator"/"render target filler", could that make sense?
Why I think about it is because in Larrabee as it was there were multiple tex units distributed along an already busy bus. The way SnB works looks (for some reasons) "better" to me.
I got your point about texturing but I guess I'll have to side with the others as I believe low quality texturing is plague. But I agree with you on the idea on Intel tackling the market from below.
So back to a hypothetical larrabee @22nm and assuming they are no longer going for high end.
For example you could have:
2 CPU cores, 8 larrabee cores, a GPU and a L3.
The larrabee cores act as mostly as they were supposed to do but as some point (close to rasterization) they write the tile/bin they were working on in a queue in the L3(or in RAM). The GPU read it, do its stuffs and write a tile in the L3, the tile as part of a task queue are processed by "free" larrabe cores.
My idea is that could be a win as you insulated the performance critical part of 3D rendering into a pretty limited part of the chip. It may evolve but it likely to remain a pretty tiny part of your silicon budget. Assuming shrinks and improvements to its design, you would pay the price once (I mean it's take 15% of your die now, on the next process it will be less) actually the price could be decreasing.
Looking forward if communication overhead between to many core is bothering, instead of using 2 CPU cores and 8 larrabee ones, it may be better to use 10 std CPU cores for example but some parts of the graphic pipeline as it is now would not be compromised(/low perfs) and the price would have been paid a long time ago. On the other hand you maximize "utility" as you said.
 
rpg.314 may I dare a question am I right to assume that your pov is that CPUs (even throughput oriented ones) will never be competitive versus SIMD array under the control of command processors(s) (so keeping dedicated tex units). Or your pov is more like it can be ok but it needs more fixed hardware (say a rasterizer or a tiny GPU).

As 3D and others pointed out, trying to add wide scatter/gather hurts serial performance, so they are unlikely to be added. Besides, in the last 10 years we have seen progress towards wider SIMD but no addition of anything beyond wide aligned loads/stores, I guess IHVs have a good reason for that.

Tex units are non negotiable, and in all likelyhood, to make a competitive gpu you'll need more ff hw as well.
 
I was tempted to write an in-depth reply, but then I realised this entire discussion is just a short-term anomaly. All CPUs and GPUs are designed around one key limitation: massive external memory latency. It is not a coincidence that the most interesting innovations in the last 10+ years have happened in DSP architectures with very small external bandwidth requirements (e.g. wireless basebands: Picochip, Icera, Coresonic, ...)

Is this a fundamental limitation of physics? No, it's only a fundamental limitation of electricity. While data movement will always be a potential bottleneck in a fixed dimensional world (as opposed to a pointer-based world as possibly implied by quantum teleportation) it seems unlikely that the cost of data movement per computation must be so high as it is today.

Imagine what would happen if external memory latency was so low that you had the equivalent of a 1GB L1 cache in terms of latency and bandwidth (with very low power consumption). Every single design consideration of modern architectures would fly right out of the window. Even fine-grained parallelism would be made orders of magnitude simpler. In the long-term, Silicon Photonics is one viable contender - it might not get to that level of performance overnight, but it's far from impossible.

---

Nick, as to your points, even if you were right about area (which I'm very skeptical about), I think you're massively underestimating the data movement overhead and in general the power consumption penalty of all this. And chip designers are more and more willing to sacrifice a LOT of area to save power consumption, both directly at the architectural level and indirectly by reducing voltages.

So yeah, errr... we were talking about 22nm Larrabee I think? It is noteworthy that even Larrabee is severely limited by the cost of on-chip data movement (hi R5xx/R6xx ring bus). Adding more accelerators would have a lot of hidden costs. I think it's really important that both the software and hardware architectures are made with data movement in mind, and this adds yet another layer of complexity that you wouldn't expect from a classical software renderer.
 
I was tempted to write an in-depth reply, but then I realised this entire discussion is just a short-term anomaly. All CPUs and GPUs are designed around one key limitation: massive external memory latency. It is not a coincidence that the most interesting innovations in the last 10+ years have happened in DSP architectures with very small external bandwidth requirements (e.g. wireless basebands: Picochip, Icera, Coresonic, ...)

Is this a fundamental limitation of physics? No, it's only a fundamental limitation of electricity. While data movement will always be a potential bottleneck in a fixed dimensional world (as opposed to a pointer-based world as possibly implied by quantum teleportation) it seems unlikely that the cost of data movement per computation must be so high as it is today.

Imagine what would happen if external memory latency was so low that you had the equivalent of a 1GB L1 cache in terms of latency and bandwidth (with very low power consumption). Every single design consideration of modern architectures would fly right out of the window. Even fine-grained parallelism would be made orders of magnitude simpler. In the long-term, Silicon Photonics is one viable contender - it might not get to that level of performance overnight, but it's far from impossible.

---

Nick, as to your points, even if you were right about area (which I'm very skeptical about), I think you're massively underestimating the data movement overhead and in general the power consumption penalty of all this. And chip designers are more and more willing to sacrifice a LOT of area to save power consumption, both directly at the architectural level and indirectly by reducing voltages.

So yeah, errr... we were talking about 22nm Larrabee I think? It is noteworthy that even Larrabee is severely limited by the cost of on-chip data movement (hi R5xx/R6xx ring bus). Adding more accelerators would have a lot of hidden costs. I think it's really important that both the software and hardware architectures are made with data movement in mind, and this adds yet another layer of complexity that you wouldn't expect from a classical software renderer.

Are you saying that Silicon photonics can deliver ~1ns read write latency to DRAM? If so, would you please write an in depth article about it instead discussing the roadblocks, the upsides and state of the art? ;)
 
Hi Nick,

Did you tried writing an DX10 or even DX11 level software renderer? What do you think is the biggest differences?

I think the pipeline structure gets much more complexed, and is much harder to load-balance them like DX9 does without using off chip memory stream out. Not to mention those new features like GS, Append/Consume buffers and Tessellation units.They might require even more bandwidth compared to GPUs because of the lack of a huge FIFOs.

I suspect the reason that Larrabee's get canceled was that it's designed for DX9, the software rasterizer simply can't keep up when tessellation's used. They are going to re-write the whole rasterization step for pixel-sized triangles. Furthermore, I suspect quad structure is no longer efficient and they might need to use some quad merging algorithms.

PS: I ran SwiftShader with my raytracing demo, the render result looks incorrect. I understand the need to optimize the texture unit, but what change affacts precision?
 
Last edited by a moderator:
Are you saying that Silicon photonics can deliver ~1ns read write latency to DRAM? If so, would you please write an in depth article about it instead discussing the roadblocks, the upsides and state of the art? ;)
Silicon photonics can't do 1ns today, but I'm pretty sure it's not a fundamental limitation - there's just no reason to focus too much effort on that aspect today when it wouldn't change anything for the short/mid-term applications. Also, the latency of the DRAM itself is potentially a big problem. Even if you divided it in smaller blocks it would still be the bottleneck.

I think the first commercial implementation of something like this could probably reach latency comparable to very fast CPU L3 (~10-15ns) along with much higher bandwidth than today's DRAM interfaces. And that'd still be a very big deal and fairly disruptive architecturally, even if not as much as L1-level latency. We won't have that anytime soon either, but it's some food for thought...
 
And what is the current state of the art for silicon photonics?
In terms of latency, I'm honestly not sure. But in general the Intel Research stuff is the leading-edge, along with companies like Luxtera to a smaller extent, and some academic stuff that's hard to judge. I'm sorry, I'm not an all-knowledgeable expert here :) Don't confuse my long-term wishful thinking based on what's agreed to be theoretically feasible with cold hard predictions ;)
 
Well, from what I understand from your statements, the state of the art in silicon photonics is too backward/costly/infeasible to call massive memory latency a short term problem. :)
 
I suspect the reason that Larrabee's get canceled was that it's designed for DX9, the software rasterizer simply can't keep up when tessellation's used. They are going to re-write the whole rasterization step for pixel-sized triangles. Furthermore, I suspect quad structure is no longer efficient and they might need to use some quad merging algorithms.
There's quite a few papers out there on software micropoly rendering, and Intel's parallel setup was pretty fast, so I don't think it's the software rasterization that's the problem.

Instead, I think it's the way that tile based rendering needs either gobs of bandwidth with tessellation (binning all the polys and the dynamically generated vertex data), or lots of geometry workload duplication (tessellating each patch for the initial pass and again for every tile it gets binned into).

Tesselation is great way to amplify data yet avoid bandwidth consumption (assuming it's done right - Cayman's spilling into memory isn't needed with a proper architecture). However, that's only the case if you immediately render the triangles rather than defer it.
 
Actually, we are expecting the transition to a new type of memory (M-RAM, P-RAM, T-RAM, Z-RAM, you name it :D ) before the industry is likely to move to optics, right?

At least T-RAM and Z-RAM(which is no more, AFAIK) hold good promises
 
Well, from what I understand from your statements, the state of the art in silicon photonics is too backward/costly/infeasible to call massive memory latency a short term problem. :)
Right, which is why I said this in my original post: "In the long-term, Silicon Photonics is one viable contender" :)

entity279 said:
Actually, we are expecting the transition to a new type of memory (M-RAM, P-RAM, T-RAM, Z-RAM, you name it ) before the industry is likely to move to optics, right?

At least T-RAM and Z-RAM(which is no more, AFAIK) hold good promises
Heh, you'd think so, but all of them are either DOA or will probably never be fast enough to replace anything more than NAND and maybe NOR. The only exception is T-RAM, but so far it's positioned exclusively as an eDRAM and SRAM replacement, and not something for standalone DDR chips. It will be interesting to see if AMD does use it in real products on 32nm or 22nm.

The only really exciting thing on the horizon is the memristor and it remains to be seen whether that will become mainstream before or after photonics.

---

As for Larrabee, I agree that Tesselation probably wasn't a show stopper even if it certainly must have complicated things. It was probably more a combination of factors (late software due to unexpected complexity, slower than expected hardware, stronger competition, etc.)
 
There's quite a few papers out there on software micropoly rendering, and Intel's parallel setup was pretty fast, so I don't think it's the software rasterization that's the problem.

True there're a few micro-polygon setup papers out there, but it's not for Larrabee, and IIRC non of them mentioned how to use the method efficiently for a tile based renderer. The closest thing I found find is like Reyes, which is sorting patches before tessellation into tiles. But how to compute a tight yet conservative bound for a patch in a graphics API without user's hint is an open question. DX11 is just designed for immediate mode GPUs.

Instead, I think it's the way that tile based rendering needs either gobs of bandwidth with tessellation (binning all the polys and the dynamically generated vertex data), or lots of geometry workload duplication (tessellating each patch for the initial pass and again for every tile it gets binned into).

So that's the problem. Tile based rendering does not scale with tessellation. In fact, tile based renderer uses MORE bandwidth if they has to stream out the tessellated vertices, or waste a lot of compute power to re-tessellate a mesh when it's already vertex bound. So the Larrabee's idea is facing a serious dilemma. Especially when the tessellation idea is well received by customers.

PS: Despite Xbox360's plenty vertex processing power, more and more AAA developers are dropping the idea of tiling. And it's only 2-3 tiles compared to hundreds if not thousands on Larrabee!

Tesselation is great way to amplify data yet avoid bandwidth consumption (assuming it's done right - Cayman's spilling into memory isn't needed with a proper architecture). However, that's only the case if you immediately render the triangles rather than defer it.

That's the point ;) And it's also why I think a huge FIFO between all funtional units (for tessellated geometry, for survival pixels, ...) and load balancing (between a lot of different processing types) might be the problem for a software renderer. But one thing software renderer wins: the processor has a lot of cache and very likely, the framebuffer access might be cached. Now we need an complicated cache management that explicitly controls which memory we don't want to be polluted by texture fetching and vertex streaming. Take Xbox360's L2 locking for example.
 
Last edited by a moderator:
Back
Top