Larrabee at Siggraph

MfA · Aug 6, 2008

nAo said:
To re-state my argument one last time: IMHO it doesn't make sense for LRB's sw rasterizer to try to determine the closest opaque fragment before shading it as most modern games already do it anyway

In an inefficient manner.

Deferred shading is deferred shading, you maintain a list of shading parameters and defer the shading.

nAo · Aug 6, 2008

MfA said:
In an inefficient manner.

inefficient from what point of view? from an IMR standpoint is not efficient, hence the argument about console development.

MfA · Aug 6, 2008

nAo said:
inefficient from what point of view?

Inefficient from the point of view of deferred shading.

Hell, even if you wanted to render in two passes ... the bloody bin has all the data it needs to perform the two passes without the rendering engine throwing the scene through the vertex shaders twice.

Andrew Lauritzen · Aug 6, 2008

MfA said:
In an inefficient manner.

I'd certainly argue that deferred shading makes even *more* sense on a tile-based renderer (and I even like it on typical GPUs as well for several reasons)... almost to the point that it's a no-brainer really IMHO, but that's an argument for another thread/day I suppose.

That said, I think the trade-off of having people write their engine/renderer in an explicitly deferred fashion makes sense, rather than trying to infer and reorganize things like pre-z passes automatically internally. Your point about them doing an explicit pre-z pass effectively uselessly doubling the amount of bin data is certainly true, but I don't know how much pipeline "magic" one would really want to do to applications that are already written that way.

Anyways lots of interesting things to come. I think the most exciting thing to keep in mind is that any of these questions can be answered/implemented/changed on a per-application, per-render-target or similar basis to get the most optimal performance. Being able to dynamically reorganize the graphics pipeline is a very powerful tool.

nAo · Aug 6, 2008

MfA said:
Inefficient from the point of view of deferred shading.

Hell, even if you wanted to render in two passes ... the bloody bin has all the data it needs to perform the two passes without the rendering engine throwing the scene through the vertex shaders twice.

While it's more inefficient this is how games are done on PC and since you can't automagically eliminate a pass from any game you'd better optimize/support your most common scenario (at first at least!)

bowman · Aug 6, 2008

http://www.gamespot.com/hardware/bl...5655/26524729/tim-sweeney-likes-larrabee.html

No surprises there, I guess.

And I see he's not giving up his new overlords that easily.. If I was religious I'd be praying it wouldn't make it into a single console just because of this guy's arrogance.

ArchitectureProfessor · Aug 6, 2008

ShaidarHaran said:
Seriously, Larrabee has as much relation to P54c as Conroe does to Katmai.

Perhaps you're underestimating the similarities between Conroe (Core 2) and Katmai (Pentium III).

I would say that the Pentium M was derived from the Pentium III, and the Core 2 was derived from the Pentium M. Sure, lots of things changed, but the core renaming style, microcode, and such stayed the same.

In contrast the Pentium IV was certainly not a derivative chip. It looks very different from the other P6-style chips.

If I have a hammer and replace the handle three times and the head two times, is it still the same hammer?

Barbarian · Aug 6, 2008

ShaidarHaran said:
I'm saying they are wrong. Raise your hand if you buy the statement that Larrabee is a P54c derivative. x86-64, SMT, vastly improved branch predication, etc are not things you just slap onto an ANCIENT x86 core. This is why I insist Larrabee has nothing to do with P54c. This *analogy* is just a way to relate the fact that it is a dual-issue, short pipeline architecture.

Trust me, they do mean derived! I know you find it ridiculous, but I don't. It makes a lot of sense. They didn't rip out or replace as many things as you think.

ArchitectureProfessor · Aug 6, 2008

Carl B said:
On the console front, I heard a couple of months ago that Intel had approached MS with regard to Larrabee usage in the next XBox

I think Microsoft has lots of reasons to be interested in Larrabee, even if they don't use it in the XBox. Some of the researchers at Microsoft Research have known about Larrabee for some time, so Larrabee certainly won't be a surprise to Microsoft. The PC is still Microsoft's cash cow, and Larrabee could certainly impact that space.

MfA · Aug 6, 2008

nAo said:
While it's more inefficient this is how games are done on PC and since you can't automagically eliminate a pass from any game you'd better optimize/support your most common scenario (at first at least!)

As I said before, they can always bribe devs.

ShaidarHaran · Aug 6, 2008

ArchitectureProfessor said:
Perhaps you're underestimating the similarities between Conroe (Core 2) and Katmai (Pentium III).

Silliness. I understand Conroe shares some design philosophy with P6, but it also shares a lot of design philosophy with Netburst. It is a hybrid of the two.

ArchitectureProfessor said:
I would say that the Pentium M was derived from the Pentium III, and the Core 2 was derived from the Pentium M. Sure, lots of things changed, but the core renaming style, microcode, and such stayed the same.

Do you have any evidence to back up your assertion that the microcode remained identical between P6 and Conroe? I find this hard to believe, given Conroe's hybrid nature.

ArchitectureProfessor said:
In contrast the Pentium IV was certainly not a derivative chip. It looks very different from the other P6-style chips.

True.

ArchitectureProfessor said:
If I have a hammer and replace the handle three times and the head two times, is it still the same hammer?

More silliness. CPUs are not hammers (no matter what AMD says). Again, Conroe has as much in common with Netburst as it does P6.

Andrew Lauritzen · Aug 6, 2008

ShaidarHaran said:
More silliness. CPUs are not hammers (no matter what AMD says).

I think that part *was* an analogy

Thorburn · Aug 6, 2008

ShaidarHaran said:
Silliness. I understand Conroe shares some design philosophy with P6, but it also shares a lot of design philosophy with Netburst. It is a hybrid of the two.

Do you have any evidence to back up your assertion that the microcode remained identical between P6 and Conroe? I find this hard to believe, given Conroe's hybrid nature.

I'd suggest that Conroe was more P6 with some Netburst features mixed in (EM64T, Branch Predictors) than a hybrid.

If I remember correctly the front-end in Netburst worked very differently to P6 or Netburst (stored chunks of commonly used instructions pre-decoded to allow it to skip several pipeline stages?) which Core microarchitecture parts definately don't do - doesn't seem too unlikely that the microcode engine wasn't carried over and expanded on from Yonah.

aaronspink · Aug 6, 2008

ShaidarHaran said:
Silliness. I understand Conroe shares some design philosophy with P6, but it also shares a lot of design philosophy with Netburst. It is a hybrid of the two.

More silliness. CPUs are not hammers (no matter what AMD says). Again, Conroe has as much in common with Netburst as it does P6.

What part of P4 does Core2 have in common?

bowman · Aug 6, 2008

aaronspink said:
What part of P4 does Core2 have in common?

Practically nothing, it's mostly a wider Pentium M with nice features like macro-fusion, micro-fusion and loop caching, none of which came from NetBurst. The long pipeline, trace cache, dual-pumped ALUs, basically the entire front end and the execution engine was ditched. The only notable thing that carried over were the ISA extensions that hadn't dripped down to Pentium M/Core Duo yet.

The Core 2 is much much more similar to the P6 than it is to NetBurst.

Thorburn · Aug 6, 2008

Intel marketing states that Core is a blend of P-M techniques and NetBurst architecture. However, Core is clearly a descendant of the Pentium Pro, or the P6 architecture. It is very hard to find anything "Pentium 4" or "NetBurst" in the Core architecture. While talking to Jack Doweck, it became clear that only the prefetching was inspired by experiences with the Pentium 4. Everything else is an evolution of "Yonah" (Core Duo), which was itself an improvement of Dothan and Banias. Those CPUs inherited the bus of the Pentium 4, but are still clearly children of the hugely successful P6 architecture.

From Anandtechs Core microarchitecture preview.

MfA · Aug 6, 2008

Andrew Lauritzen said:
Your point about them doing an explicit pre-z pass effectively uselessly doubling the amount of bin data is certainly true, but I don't know how much pipeline "magic" one would really want to do to applications that are already written that way.

Faster rendering of legacy software isn't really needed ...

A couple of extra FPS on the popular benchmark games at the time they release Larrabee is certainly worth greasing a few palms though, it's not a big change for the devs.

NetResearchMan · Aug 6, 2008

Andrew Lauritzen said:
That said, I think the trade-off of having people write their engine/renderer in an explicitly deferred fashion makes sense, rather than trying to infer and reorganize things like pre-z passes automatically internally. Your point about them doing an explicit pre-z pass effectively uselessly doubling the amount of bin data is certainly true, but I don't know how much pipeline "magic" one would really want to do to applications that are already written that way.

A pre-z pass doesn't double the bin data. The z-pass only needs a single vertex component (position), whereas the other passes will have many channels (at a minimum position, UV, normal, and tangent, sometimes additional UV channels and color channels). Eliminating those extra channels also can result in fewer total vertices because it eliminates seams in the normal and UV channels (on the order of 25% less for models in our game engine).

It's advantageous for a pre-Z pass to sort it front to back (so you get better hierarchical Z rejection), whereas the shaded passes are most efficient in shader or texture order. One advantage of doing a pre-z pass over everything is that you could potentially use the z-pass to do occlusion culling for the later passes (through either predication or a hierarchical occlusion map). That way, you would completely eliminate the later passes, saving the bandwidth of copying and transforming the extra vertex channel data for primitives that will ultimately be completely occluded.

I'm not sure how practical it would be to implement that on Larrabee, since the idea requires separate binning for the pre-Z and shaded passes. An application could add hints as to where the Z pass starts and ends to help out. Whether it would be efficient may depend on the specific scene (how much is occluded, overall triangle count). You would have to store the entire Z-buffer out to memory, and bring it back in for the shaded pass (rather than keeping it in cache), but it could eliminate many primitives from ever being drawn.

One possibility would be to do a pre-Z pass solely for the purpose of generating a hierarchical occlusion map. In such a case, you could store the resulting Z-buffer at a lower resolution than the actual frame buffer (no MSAA, plus half or quarter resolution). It would be good enough for early-rejection, and reduce the bandwidth requirements dramatically. An application hint could indicate how the pre-Z pass is to be used.

I'm just throwing some ideas out there. I think the possibilities of software rendering are awesome! Proper alpha sorting, irregular shadow mapping, high quality anti-aliasing, adaptive tesselation, the list goes on and on. I don't care if it's somewhat slower than a fixed function GPU, it will provide huge render quality advantages that can never be accomplished on current GPUs. At least not unless GPUs move towards the Larrabee approach (which they should, IMO).

Carl B · Aug 7, 2008

bowman said:
So they wouldn't just have a good GPU, they'd have incredible influence in this new, exciting CGPU programming model.. That would put the definite nail in PC gaming (and Linux gaming obviously). Intel is really good at Linux and open source driver support, they're good at platforms, they could really revitalize PC gaming with this thing. If they pawn it off to MS on a silver platter I will be very sad.

I think you're focusing on something other than I was trying to get across, though. I was just conveying that for Intel, the best-case scenario would be getting Larrabee into the XBox as right there it would ensure a certain level of captive industry support... to say nothing of the bevy of development tools Microsoft would start producing around it. And for MS' own purposes, those tools would automatically be supportive of the PC ecosystem as well. So I wouldn't view it as the death of PC gaming at all, simply the 'arrival' of Larrabee in unquestioned terms.

TimothyFarrar said:
On second thought, if the PS4 guess of Cell 32iv is 1 Tflop machine (4 PPE + 32 eSPE) paired with a somewhat next gen NVidia GPU (1-2 Tflop), a Larrabee alone powered XBox will need something on the order of 2-3 GHz clock and 32-48 cores to match the flops. POWER7 road map is a 4GHz pair of 8 core CPUs, which seems underpowered flops wise compared to the Cell 32iv and Larrabee. Toss in the crazy idea of a pair of Larrabees or Larrabee paired with an ATI GPU and then you have something really interesting (but perhaps too expensive).

I view the next-gen console efforts (assuming contemporaneous launches) as a matter of mainly die area - whatever the architectures, there will be a fixed area budget of sorts, and whoever makes it count within said budget will have an advantage. It's almost easier to hypothesize around MS and Larrabee if only because it's a (quasi)known quantity; if MS is going the POWER route again - which I'd go with as the default - then it's a real guess as to what sort of chip will emerge. Obviously there are a lot of design directions to draw from/work with.

Unrelated to the above, is there a number/sense as to Larrabee's DP performance capacity?

Andrew Lauritzen · Aug 7, 2008

NetResearchMan said:
A pre-z pass doesn't double the bin data. The z-pass only needs a single vertex component (position), whereas the other passes will have many channels (at a minimum position, UV, normal, and tangent, sometimes additional UV channels and color channels). Eliminating those extra channels also can result in fewer total vertices because it eliminates seams in the normal and UV channels (on the order of 25% less for models in our game engine).

Well as has been noted earlier in this thread, you don't necessarily need to store anything other than positions in the bins anyways (even for normal rendering). In any case, the gain from just doing an in-core/tile z-pass before shading would be much greater than any cleverness with reducing your pre-z data set before sending it to the API.

NetResearchMan said:
That way, you would completely eliminate the later passes, saving the bandwidth of copying and transforming the extra vertex channel data for primitives that will ultimately be completely occluded.

Any sort of occlusion queries that alter your draw calls are going to be by necessity at least a frame behind the GPU command buffer consumption. Thus you're not gaining anything/very much by doing a pre-z pass, since you might as well just use the "final" z buffer from the previous frame. Certainly hierarchical occlusion culling stuff is very useful, but it's just as good without a pre-z pass as far as API-level draw calls are concerned.

NetResearchMan said:
I'm not sure how practical it would be to implement that on Larrabee, since the idea requires separate binning for the pre-Z and shaded passes. An application could add hints as to where the Z pass starts and ends to help out.

I'm not entirely sure what you're getting at here, but certainly if z information was known "up front", a binning render could avoid even dumping things into bins that were guaranteed to be entirely occluded. That said, you're only going to get a (likely small) constant factor over doing some simple occlusion culling at the scene graph level yourself. Even if you want to do something fancier, there's nothing preventing you from just rendering a low-resolution image to get conservative visibility data and using that to drive rendering... if you rendered at a resolution approximately equal to the number of tiles on the screen you'd be doing about the same thing.

NetResearchMan said:
One possibility would be to do a pre-Z pass solely for the purpose of generating a hierarchical occlusion map. In such a case, you could store the resulting Z-buffer at a lower resolution than the actual frame buffer (no MSAA, plus half or quarter resolution). It would be good enough for early-rejection, and reduce the bandwidth requirements dramatically.

Indeed I was suggesting something similar above, except that I suspect you can get the majority of the benefit by just doing it in "software" and culling more coarsely (objects vs triangles for instance).

Also note that the Larrabee paper shows that vertex processing is practically free and rasterization isn't really that expensive either. Certainly these trade-offs change from game to game, but part of the reason why pre-z is such a win on current GPUs is because pixel shading costs so much more than entirely re-transforming and re-rasterizing the scene... to that end, just looping over your bin queue - or even better IMHO, just using deferred shading! - is going to be more than enough.

NetResearchMan said:
I'm just throwing some ideas out there. I think the possibilities of software rendering are awesome!

Yeah, all good ideas and I'm looking forward to seeing what people come up with when they get their hands dirty!

PS: I would be remiss if I didn't note that if you're into this sort of stuff, our group at Intel (Advanced Rendering Team) is hiring! Please feel free to fire me a PM if any of you or your smart friends are interested

. Also if anyone is going to SIGGRAPH next week and wants to chat, please also fire me a PM.

Larrabee at Siggraph

MfA

nAo

Nutella Nutellae

MfA

Andrew Lauritzen

Moderator

nAo

Nutella Nutellae

bowman

ArchitectureProfessor

Barbarian

ArchitectureProfessor

MfA

ShaidarHaran

hardware monkey

Andrew Lauritzen

Moderator

Thorburn

Moderator

aaronspink

bowman

Thorburn

Moderator

MfA

NetResearchMan

Carl B

Friends call me xbd

Andrew Lauritzen

Moderator

Similar threads