Larrabee at Siggraph

A pre-z pass doesn't double the bin data. The z-pass only needs a single vertex component (position), whereas the other passes will have many channels (at a minimum position, UV, normal, and tangent, sometimes additional UV channels and color channels). Eliminating those extra channels also can result in fewer total vertices because it eliminates seams in the normal and UV channels (on the order of 25% less for models in our game engine).
The pre-z pass may require more than position if the shader is using texkill.
One possibility would be to do a pre-Z pass solely for the purpose of generating a hierarchical occlusion map. In such a case, you could store the resulting Z-buffer at a lower resolution than the actual frame buffer (no MSAA, plus half or quarter resolution). It would be good enough for early-rejection, and reduce the bandwidth requirements dramatically. An application hint could indicate how the pre-Z pass is to be used.
A very risky proposal as stated.
 
One advantage of doing a pre-z pass over everything is that you could potentially use the z-pass to do occlusion culling for the later passes
That has always been possible, just not terribly practical. Larrabee changes nothing about that practicality as long as you are using it as an ordinary GPU ... if you move the entire rendering pipeline to Larrabee it becomes practical, but at that point you can also do deferred shading and save yourself the trouble.

PS. all intelligent methods to make sure only visible surfaces get shaded are buggered up by pixel shaders using z-kill ... but you know before hand it will happen and can treat those as transparant.
 
I think you're focusing on something other than I was trying to get across, though. I was just conveying that for Intel, the best-case scenario would be getting Larrabee into the XBox as right there it would ensure a certain level of captive industry support... to say nothing of the bevy of development tools Microsoft would start producing around it. And for MS' own purposes, those tools would automatically be supportive of the PC ecosystem as well. So I wouldn't view it as the death of PC gaming at all, simply the 'arrival' of Larrabee in unquestioned terms.

I know that was your point, but as a zealous, pitchfork-wielding PC gamer that was the first thought that crossed my mind. ;)

Given Microsoft's current treatment of the hand that feeds their console effort (their PC branch customers) I would expect nothing more than a console focus and some crumbs for us PC gamers if Larrabee was to fall into their slimy hands.
 
Do you have any evidence to back up your assertion that the microcode remained identical between P6 and Conroe?

I don't have a good on-line source to cite. My comment is based upon my recollections of conversations with Intel designers at research conferences and such. Which, of course, means my recollection could be faulty.

The existence of micro-op fusion in the Pentium M and Core processors is perhaps circumstantial evidence that could support my assertion that P6 and Conroe's microcode are similar.

All modern x86 dynamically-scheduled cores first spit x86 operations into smaller bite-sized micro-ops. But the Pentium M, Core, and Core 2 then take the extra step of fusing some of the micro-ops from the same x86 instruction back together. Why would it bother to split them and then fuse them again? There might be lots of reasons (such as to simplify register rename), but I think one of the reasons is that it was easier to implement micro-op fusion than modify the decoder/cracker used in prior P6-like chips. That is, they decided it was easier to fuse them together rather then change the x86 decode to generate different micro-ops.
 
I don't have a good on-line source to cite. My comment is based upon my recollections of conversations with Intel designers at research conferences and such. Which, of course, means my recollection could be faulty.

The existence of micro-op fusion in the Pentium M and Core processors is perhaps circumstantial evidence that could support my assertion that P6 and Conroe's microcode are similar.

All modern x86 dynamically-scheduled cores first spit x86 operations into smaller bite-sized micro-ops. But the Pentium M, Core, and Core 2 then take the extra step of fusing some of the micro-ops from the same x86 instruction back together. Why would it bother to split them and then fuse them again? There might be lots of reasons (such as to simplify register rename), but I think one of the reasons is that it was easier to implement micro-op fusion than modify the decoder/cracker used in prior P6-like chips. That is, they decided it was easier to fuse them together rather then change the x86 decode to generate different micro-ops.

Thanks for the insight. I wonder if the split/fuse technique is to get around an architectural limitation, or if there is some performance benefit.
 
I bet who ever came up with that name was thinking of the company 'Adanced Render Technology' at the time? ART is cool achronym for that kind of stuff. ;-)
(Not to get off-topic, but) now that you mention it, I'm not actually 100% sure whether the T in most of the team names stands for "technology" or "team" (I'd still guess team for now)... in any case I still don't know why it's "advanced" (there's no "basic rendering team" to my knowledge ;)), but I guess that makes the acronym cooler!

Whatever, we do fun stuff :)
 
Based on what's been revealed so far I tend to agree. Larrabee seems to be a many core x86 CPU + texture units + ring bus.

I think that's pretty obvious, too. I think that a lot of people are having a heck of a time letting go of their previous notions, though...;) Heck, if it only took them, say, six months to hand-code 25 select frames in a game so that they could match current rasterization in terms of output (25 frames looking at a wall, I wonder, or watching tree limb wave...?), then--shoot!--I bet they could probably code an entire game in a few decades or less, maybe...;)

It's for sure that this kind of effort is far, far beyond most software houses, and when I'm reading this report (which I think is a lot more interesting for what it doesn't say) and I'm thinking about the TOOLS that such development houses are going to require--well, all I can think of is that for compilers and the like we may be watching a disaster of EPIC proportions brewing--er, if you catch my drift.
 
I fully understand what the term "derived" (and its derivatives ;) :p) mean.

I'm saying they are wrong. Raise your hand if you buy the statement that Larrabee is a P54c derivative. x86-64, SMT, vastly improved branch predication, etc are not things you just slap onto an ANCIENT x86 core. This is why I insist Larrabee has nothing to do with P54c. This *analogy* is just a way to relate the fact that it is a dual-issue, short pipeline architecture.


At least on the MT front I think you are grossly overestimating the effort it would take to implement in such a short in-order core; it boarders on almost trivial.
 
I never said adding SMT to a short in-order core was difficult, I said adding it on to P54c (or a derivative) would be difficult.
 
Hardly. If you architect a core from the ground-up with these features in mind, it is simple, as they are complimentary. Slapping them onto a P54c is an entirely different matter.

you do realize that almost ALL MT architectures were slapped on don't you? It doesn't require much, for 4 threads, 2 extra internal bits of register ID, a couple extra bits around the tlbs and memory queues and you are done.

Most MT implementations aren't that complex really.
 
...I'm thinking about the TOOLS that such development houses are going to require--well, all I can think of is that for compilers and the like we may be watching a disaster of EPIC proportions brewing--er, if you catch my drift.

In EPIC/Itanium, Intel moved to an architecture that was more depended on good compilers and tools than the mainstream architecture (x86). In Larrabee, Intel has moved to an architecture that is conceptually simpler to program for than current GPU architectures. They added cache-coherence shared memory and expressive vector operations to make it easier on the software tool chain (and reuse existing x86 tools).

Under the big assumption that Larrabee has good performance (or performance per watt or mm^2 or whatever), the flexibility of Larrabee really puts it in a different class than the EPIC fiasco.
 
Just curious, but what features of the P54 are you thinking of as being incompatible?

I just realized this isn't as far-fetched as I thought.

Apparently, my drunken mind had confused OoO execution with superscalar execution, so I got P54c confused with P6! :LOL:

I'm still a bit hesitant to accept that something so old could just have all these modern pieces bolted onto it and not be a huge cluster-****, but what do I know? (apparently, not much) :p
 
you do realize that almost ALL MT architectures were slapped on don't you? It doesn't require much, for 4 threads, 2 extra internal bits of register ID, a couple extra bits around the tlbs and memory queues and you are done.

Just for completeness, Aaron left out the 4x larger register file. Which usually isn't a huge deal, but could require an extra pipeline stage for the register file read/write. Ironically, x86's small number of architectural registers (16) means make the register file 4x larger is less of a concern than for RISC ISAs that have 32 registers.
 
I'm still a bit hesitant to accept that something so old could just have all these modern pieces bolted onto it and not be a huge cluster-****, but what do I know? (apparently, not much) :p

Oh, it may have be a hugh mess, but much less of a mess than starting from scratch. Just building even a simple x86 core that is fully functional is a huge undertaking. Having something to start with---anything---is likely a big help. Once they decided Larrabee was going to be in-order, it seems natural that they would gravitate toward the most recent proven in-order x86 design: the Pentium.
 
In EPIC/Itanium, Intel moved to an architecture that was more depended on good compilers and tools than the mainstream architecture (x86). In Larrabee, Intel has moved to an architecture that is conceptually simpler to program for than current GPU architectures. They added cache-coherence shared memory and expressive vector operations to make it easier on the software tool chain (and reuse existing x86 tools).
Coherent caches I can see simplifying matters where GPUs require explicit memory control operations, which keeps Larrabee in bounds of current x86 tools. (Though the binning strategy and the limited read after write traffic for standard rendering seems to mean it isn't heavily stressed for a lot of current cases).

The expressive vector operations, though?

The hints of Larrabee's vector functionality and scatter/gather memory access capability seem to indicate its vector functionality exceeds the limitations of current x86 SSE, so why expect current x86 tools to do it justice?

Under the big assumption that Larrabee has good performance (or performance per watt or mm^2 or whatever), the flexibility of Larrabee really puts it in a different class than the EPIC fiasco.
That is the question.
Comparative architectural analysis would be interesting. The trouble is that Larrabee's hardware details are not yet fully disclosed, and comparisons with GPU hardware run into problems finding equivalent measures (when the data is available, and not all of it is).
 
Oh, it may have be a hugh mess, but much less of a mess than starting from scratch. Just building even a simple x86 core that is fully functional is a huge undertaking. Having something to start with---anything---is likely a big help. Once they decided Larrabee was going to be in-order, it seems natural that they would gravitate toward the most recent proven in-order x86 design: the Pentium.

What you say is all true. I just think a design from the early 90's would be more at home in a museum than tomorrow's multi-teraflop C-GPU :p
 
Back
Top