Larrabee at Siggraph

ShaidarHaran · Jul 26, 2008

nutball said:
Well fixed-function is what I meant by "with extensions", I wasn't really meaning SSEx in that sense.

They're not the same though. Accepted usage of these terms is as I was using them.

nutball said:
OK, but traditional GPUs are progressively moving away from fixed-function for most or all of the higher-order functionality it seems to me. Does DX11 bring stuff to the table which requires/benefits from additional fixed-function hardware not already available? The statements I've read from MS suggest not.

That's what confused me about MD1988's question, it seemed somewhat anachronistic in terms of what's fixed-function, programmable and what's hardware, software.

Here's the thing WRT the transition to ALU-only accelerated operations on GPUs (aka getting rid of the traditional FF hardware such as ROPs and TMUs) - native support for these operations will be rolled into future ALUs. It's not as though they will be translated into more generalized existing operations.

ShootMyMonkey · Jul 26, 2008

ShaidarHaran said:
Damnit Jon Stokes, why did you start this rumor?

Larrabee's individual cores have no relation to P54c, or any other (relatively) early pipelined x86 core.

Based on some talks I'd been to, I'm not so sure it is entirely Jon Stokes as people within Intel have used similar words. However, if you actually listen to the way they use it, they don't mean it literally at all, but as an analogy. It refers to the fact that they drop the self-scheduling engine that goes all the way back to the P6 architecture, so the idea of mentioning the old P54 series cores was simply to say to software developers "you can think of it that way." Part of it was also the way certain presentations went into the history and thought process overview saying things like "if we took something like an old-fashioned 486 core and built it on today's process tech, we could get xxxxx cores and with blah blah blah... etc."

I remember similar talk from AMD back when they were pondering on a many-core vision of the future server space in the wake of Niagara. Still isn't quite the same thing as saying they're putting a P54C core on a GPU.

Andrew Lauritzen · Jul 26, 2008

ShaidarHaran said:
Here's the thing WRT the transition to ALU-only accelerated operations on GPUs (aka getting rid of the traditional FF hardware such as ROPs and TMUs) - native support for these operations will be rolled into future ALUs. It's not as though they will be translated into more generalized existing operations.

Are you suggesting that texture sampling will somehow be rolled in as a "feature" to the ALUs? I don't even know what you mean by this - could you be more specific?

ShaidarHaran · Jul 26, 2008

ShootMyMonkey said:
Based on some talks I'd been to, I'm not so sure it is entirely Jon Stokes as people within Intel have used similar words. However, if you actually listen to the way they use it, they don't mean it literally at all, but as an analogy. It refers to the fact that they drop the self-scheduling engine that goes all the way back to the P6 architecture, so the idea of mentioning the old P54 series cores was simply to say to software developers "you can think of it that way." Part of it was also the way certain presentations went into the history and thought process overview saying things like "if we took something like an old-fashioned 486 core and built it on today's process tech, we could get xxxxx cores and with blah blah blah... etc."

I remember similar talk from AMD back when they were pondering on a many-core vision of the future server space in the wake of Niagara. Still isn't quite the same thing as saying they're putting a P54C core on a GPU.

Thanks for the additional info and clarification. It all makes sense now. I still dislike the analogy though. Reminds me of the old "Core 2 is nothing more than P6 on steroids" argument (not saying anyone here is saying this).

ShaidarHaran · Jul 26, 2008

Andrew Lauritzen said:
Are you suggesting that texture sampling will somehow be rolled in as a "feature" to the ALUs? I don't even know what you mean by this - could you be more specific?

I'm no E.E., so if I've misinterpreted the predicted flow of events then feel free to correct me, as I have no inside information WRT this matter.

MfA · Jul 26, 2008

Andrew Lauritzen said:
Well it's not a guarantee, but you can arrange and run your rays in such a way that you expect them to be relatively cache-coherent manner

You only know your ray is going to hit an object when the intersection occurs, so I don't see it. You could do raytracing down to given level and then if you end up with a branch with potentially visible voxels give up on image order rendering for a moment. Instead switching to object order rendering and splatting the voxels beneath that level on the screen before continuing with raytracing (with the splatting providing Z-buffer data which will allow all subsequent rays hitting that cube trivial intersection testing). That's admitting defeat though, even if a lot less explicitly.

Neeyik · Jul 27, 2008

ShootMyMonkey said:
Based on some talks I'd been to, I'm not so sure it is entirely Jon Stokes as people within Intel have used similar words.

This rings remarkably similar to how the inner workings of the Larabee were explained to me via somebody who has regular meetings with Intel, but isn't an architectural engineer. I won't repeated the exact words said to me but I think my responses was along the lines of "you must be f&%*ing joking?!" - fortunately things seemed to improved somewhat from then.

Andrew Lauritzen · Jul 27, 2008

ShaidarHaran said:
I'm no E.E., so if I've misinterpreted the predicted flow of events then feel free to correct me, as I have no inside information WRT this matter.

Well I'm no E.E. either but I do agree that the end of hardware TMUs will probably happen eventually... at least for the more complicated parts. 8-bit/component bilinear filtering is one case that I've been told is a hell of a lot more efficient to do in hardware, but maybe trilinear, aniso or >8bit/component filtering would make sense to move into software.

What I was asking about though is when you said "rolled into ALUs", what extras in particular do you expect to be added to the ALUs? I mean, you can already do texture filtering entirely in software if you want, and once you start to throw in features to accelerate it (which tend to be at a minimum bilinear filtering math/hardware nearer to the memory), you're back to a TMU again...

MfA said:
You only know your ray is going to hit an object when the intersection occurs, so I don't see it.

Ah, but if you trace nearby rays they will tend to follow the same path down the spatial acceleration structure, and thus will tend to be cache coherent. There are a number of improvements that you can make to the storage layout of the accelerators as well to help even further. Take a look at some of the Cell ray tracing literature for more info. The cache/local working set is explicitly managed there which makes it more obvious what's going on, but the concept is the same on a hardware-managed cache, except it's implicit.

MfA said:
You could do raytracing down to given level and then if you end up with a branch with potentially visible voxels give up on image order rendering for a moment. Instead switching to object order rendering and splatting the voxels beneath that level on the screen before continuing with raytracing (with the splatting providing Z-buffer data which will allow all subsequent rays hitting that cube trivial intersection testing).

Right, well that's basically what an optimized rasterization engine does, except with fairly large "leaf nodes" since that keeps the number of batches down.

However even if you ray trace down to the very finest level, coherent rays that all hit the same triangle will all run in lock step with no incoherent branching and perfectly coherent caching too. This is basically the same case as a hierarchical rasterizer, which is why I take every opportunity to point out the insane amount of similarity to people who dogmatically approach the rasterization vs ray tracing argument

MfA · Jul 27, 2008

Andrew Lauritzen said:
Ah, but if you trace nearby rays they will tend to follow the same path down the spatial acceleration structure

It will tend to get it wrong at every depth boundary too.

ShootMyMonkey · Jul 27, 2008

Neeyik said:
This rings remarkably similar to how the inner workings of the Larabee were explained to me via somebody who has regular meetings with Intel, but isn't an architectural engineer. I won't repeated the exact words said to me but I think my responses was along the lines of "you must be f&%*ing joking?!" - fortunately things seemed to improved somewhat from then.

The funny thing is that I'd heard it even from engineers, and not just the software engineers working on the libraries, either. And this was in talks specifically meant not to echo the whole "raytracing is the future" marketing shpiel but actually to try and get some game developers on board for Larrabee development.

I also vaguely recall something at one point about sustained per-core IPC being in that general range as well, but even that was something like "if you adjust it on the basis of a hypothetical Pentium with this short a pipeline" and so on. Even so, the general message was to think of the cores as being "Pentium-like."

Andrew Lauritzen · Jul 27, 2008

MfA said:
It will tend to get it wrong at every depth boundary too.

Yes but that's precisely the situation that causes less efficiency in rasterization too

ShaidarHaran · Jul 27, 2008

Andrew Lauritzen said:
Well I'm no E.E. either but I do agree that the end of hardware TMUs will probably happen eventually... at least for the more complicated parts. 8-bit/component bilinear filtering is one case that I've been told is a hell of a lot more efficient to do in hardware, but maybe trilinear, aniso or >8bit/component filtering would make sense to move into software.

What I was asking about though is when you said "rolled into ALUs", what extras in particular do you expect to be added to the ALUs? I mean, you can already do texture filtering entirely in software if you want, and once you start to throw in features to accelerate it (which tend to be at a minimum bilinear filtering math/hardware nearer to the memory), you're back to a TMU again...

I assumed the functions necessary to perform texture filtering/sampling are not already present in ALUs, but I appear to have assumed incorrectly.

Andrew Lauritzen · Jul 27, 2008

ShaidarHaran said:
I assumed the functions necessary to perform texture filtering/sampling are not already present in ALUs, but I appear to have assumed incorrectly.

Nah, you can definitely implement Sample* in terms of Load... it's just a lot slower

Geo · Aug 3, 2008

There's an embargo ending at 9pm Pacific tonight, btw.

And at 7am (also PT) or later tomorrow (August 4th), a paper is being released detailing a good bit more.

PT is London time -8 hrs for our Euro friends, so we should know more soon.

Rys · Aug 4, 2008

Geo said:
There's an embargo ending at 9pm Pacific tonight, btw.

The paper is what folks want to concentrate on. There's not enough in the data presented for that embargo to say enough about the architecture at this point, sadly.

kyetech · Aug 4, 2008

Rys said:
The paper is what folks want to concentrate on. There's not enough in the data presented for that embargo to say enough about the architecture at this point, sadly.

And so what we will the data actually shed light on?

Geo · Aug 4, 2008

Rys said:
The paper is what folks want to concentrate on. There's not enough in the data presented for that embargo to say enough about the architecture at this point, sadly.

I suppose.

But, looking at the clock, at least I can say they are binning/tiling, rasterizing in software, that they put in fixed function texturing because if they hadn't it would have taken 12-40x (!) longer to do that function, that yes they really do say the core is built on. . .uhh, no, that's a different document, so I can't say that just yet. . .

; that they expect pretty much purely linear scaling on number of cores for modern games like FEAR, HL2, and GoW, and their unit heirarchy goes Core --> Thread --> Fiber --> Strand.

They've pretty much exploded the idea of a "pipeline". This should, in theory, help out from time to time. Modern gpus when faced with an "atypical workload" could sometimes stall because not all of the chip is really parallel. There are still a lot of potential chokepoints that are avoided on a practical basis by telling developers, "Well, Don't Do That!". Part of Intel's point in these documents is that in reality *all* workloads are "atypical" and thus nearly perfect flexibility should pay off with smoother performance profiles. Or maybe I'm inferring that rather than them stating it explicitly. That will be very interesting to see if they can make that payoff. Maybe we can get RoOoBo to comment on that.

psurge · Aug 4, 2008

http://news.cnet.com/8301-13924_3-10005391-64.html

nAo · Aug 4, 2008

A TBDR-like software rasterizer makes a lot of sense on a multicore CPU as makes easier to distribute the workload over multiple cores, while keeping the working data set in L2.
Just think how inefficient software ROPs would be on a immediate mode renderer..

MfA · Aug 4, 2008

You don't really want to do direct shading with 16 pixels at a time either.

Larrabee at Siggraph

ShaidarHaran

hardware monkey

ShootMyMonkey

Andrew Lauritzen

Moderator

ShaidarHaran

hardware monkey

ShaidarHaran

hardware monkey

MfA

Neeyik

Homo ergaster

Andrew Lauritzen

Moderator

MfA

ShootMyMonkey

Andrew Lauritzen

Moderator

ShaidarHaran

hardware monkey

Andrew Lauritzen

Moderator

Geo

Mostly Harmless

Rys

Graphics @ AMD

kyetech

Geo

Mostly Harmless

psurge

nAo

Nutella Nutellae

MfA

Similar threads