Larrabee at Siggraph

Well fixed-function is what I meant by "with extensions", I wasn't really meaning SSEx in that sense.

They're not the same though. Accepted usage of these terms is as I was using them.

OK, but traditional GPUs are progressively moving away from fixed-function for most or all of the higher-order functionality it seems to me. Does DX11 bring stuff to the table which requires/benefits from additional fixed-function hardware not already available? The statements I've read from MS suggest not.

That's what confused me about MD1988's question, it seemed somewhat anachronistic in terms of what's fixed-function, programmable and what's hardware, software.

Here's the thing WRT the transition to ALU-only accelerated operations on GPUs (aka getting rid of the traditional FF hardware such as ROPs and TMUs) - native support for these operations will be rolled into future ALUs. It's not as though they will be translated into more generalized existing operations.
 
Damnit Jon Stokes, why did you start this rumor?

Larrabee's individual cores have no relation to P54c, or any other (relatively) early pipelined x86 core.
Based on some talks I'd been to, I'm not so sure it is entirely Jon Stokes as people within Intel have used similar words. However, if you actually listen to the way they use it, they don't mean it literally at all, but as an analogy. It refers to the fact that they drop the self-scheduling engine that goes all the way back to the P6 architecture, so the idea of mentioning the old P54 series cores was simply to say to software developers "you can think of it that way." Part of it was also the way certain presentations went into the history and thought process overview saying things like "if we took something like an old-fashioned 486 core and built it on today's process tech, we could get xxxxx cores and with blah blah blah... etc."

I remember similar talk from AMD back when they were pondering on a many-core vision of the future server space in the wake of Niagara. Still isn't quite the same thing as saying they're putting a P54C core on a GPU.
 
Here's the thing WRT the transition to ALU-only accelerated operations on GPUs (aka getting rid of the traditional FF hardware such as ROPs and TMUs) - native support for these operations will be rolled into future ALUs. It's not as though they will be translated into more generalized existing operations.
Are you suggesting that texture sampling will somehow be rolled in as a "feature" to the ALUs? I don't even know what you mean by this - could you be more specific?
 
Based on some talks I'd been to, I'm not so sure it is entirely Jon Stokes as people within Intel have used similar words. However, if you actually listen to the way they use it, they don't mean it literally at all, but as an analogy. It refers to the fact that they drop the self-scheduling engine that goes all the way back to the P6 architecture, so the idea of mentioning the old P54 series cores was simply to say to software developers "you can think of it that way." Part of it was also the way certain presentations went into the history and thought process overview saying things like "if we took something like an old-fashioned 486 core and built it on today's process tech, we could get xxxxx cores and with blah blah blah... etc."

I remember similar talk from AMD back when they were pondering on a many-core vision of the future server space in the wake of Niagara. Still isn't quite the same thing as saying they're putting a P54C core on a GPU.

Thanks for the additional info and clarification. It all makes sense now. I still dislike the analogy though. Reminds me of the old "Core 2 is nothing more than P6 on steroids" argument (not saying anyone here is saying this).
 
Are you suggesting that texture sampling will somehow be rolled in as a "feature" to the ALUs? I don't even know what you mean by this - could you be more specific?

I'm no E.E., so if I've misinterpreted the predicted flow of events then feel free to correct me, as I have no inside information WRT this matter.
 
Well it's not a guarantee, but you can arrange and run your rays in such a way that you expect them to be relatively cache-coherent manner
You only know your ray is going to hit an object when the intersection occurs, so I don't see it. You could do raytracing down to given level and then if you end up with a branch with potentially visible voxels give up on image order rendering for a moment. Instead switching to object order rendering and splatting the voxels beneath that level on the screen before continuing with raytracing (with the splatting providing Z-buffer data which will allow all subsequent rays hitting that cube trivial intersection testing). That's admitting defeat though, even if a lot less explicitly.
 
Based on some talks I'd been to, I'm not so sure it is entirely Jon Stokes as people within Intel have used similar words.
This rings remarkably similar to how the inner workings of the Larabee were explained to me via somebody who has regular meetings with Intel, but isn't an architectural engineer. I won't repeated the exact words said to me but I think my responses was along the lines of "you must be f&%*ing joking?!" - fortunately things seemed to improved somewhat from then.
 
I'm no E.E., so if I've misinterpreted the predicted flow of events then feel free to correct me, as I have no inside information WRT this matter.
Well I'm no E.E. either but I do agree that the end of hardware TMUs will probably happen eventually... at least for the more complicated parts. 8-bit/component bilinear filtering is one case that I've been told is a hell of a lot more efficient to do in hardware, but maybe trilinear, aniso or >8bit/component filtering would make sense to move into software.

What I was asking about though is when you said "rolled into ALUs", what extras in particular do you expect to be added to the ALUs? I mean, you can already do texture filtering entirely in software if you want, and once you start to throw in features to accelerate it (which tend to be at a minimum bilinear filtering math/hardware nearer to the memory), you're back to a TMU again...

You only know your ray is going to hit an object when the intersection occurs, so I don't see it.
Ah, but if you trace nearby rays they will tend to follow the same path down the spatial acceleration structure, and thus will tend to be cache coherent. There are a number of improvements that you can make to the storage layout of the accelerators as well to help even further. Take a look at some of the Cell ray tracing literature for more info. The cache/local working set is explicitly managed there which makes it more obvious what's going on, but the concept is the same on a hardware-managed cache, except it's implicit.

You could do raytracing down to given level and then if you end up with a branch with potentially visible voxels give up on image order rendering for a moment. Instead switching to object order rendering and splatting the voxels beneath that level on the screen before continuing with raytracing (with the splatting providing Z-buffer data which will allow all subsequent rays hitting that cube trivial intersection testing).
Right, well that's basically what an optimized rasterization engine does, except with fairly large "leaf nodes" since that keeps the number of batches down.

However even if you ray trace down to the very finest level, coherent rays that all hit the same triangle will all run in lock step with no incoherent branching and perfectly coherent caching too. This is basically the same case as a hierarchical rasterizer, which is why I take every opportunity to point out the insane amount of similarity to people who dogmatically approach the rasterization vs ray tracing argument ;)
 
This rings remarkably similar to how the inner workings of the Larabee were explained to me via somebody who has regular meetings with Intel, but isn't an architectural engineer. I won't repeated the exact words said to me but I think my responses was along the lines of "you must be f&%*ing joking?!" - fortunately things seemed to improved somewhat from then.
The funny thing is that I'd heard it even from engineers, and not just the software engineers working on the libraries, either. And this was in talks specifically meant not to echo the whole "raytracing is the future" marketing shpiel but actually to try and get some game developers on board for Larrabee development.

I also vaguely recall something at one point about sustained per-core IPC being in that general range as well, but even that was something like "if you adjust it on the basis of a hypothetical Pentium with this short a pipeline" and so on. Even so, the general message was to think of the cores as being "Pentium-like."
 
Well I'm no E.E. either but I do agree that the end of hardware TMUs will probably happen eventually... at least for the more complicated parts. 8-bit/component bilinear filtering is one case that I've been told is a hell of a lot more efficient to do in hardware, but maybe trilinear, aniso or >8bit/component filtering would make sense to move into software.

What I was asking about though is when you said "rolled into ALUs", what extras in particular do you expect to be added to the ALUs? I mean, you can already do texture filtering entirely in software if you want, and once you start to throw in features to accelerate it (which tend to be at a minimum bilinear filtering math/hardware nearer to the memory), you're back to a TMU again...

I assumed the functions necessary to perform texture filtering/sampling are not already present in ALUs, but I appear to have assumed incorrectly.
 
There's an embargo ending at 9pm Pacific tonight, btw.

And at 7am (also PT) or later tomorrow (August 4th), a paper is being released detailing a good bit more.

PT is London time -8 hrs for our Euro friends, so we should know more soon.
 
There's an embargo ending at 9pm Pacific tonight, btw.
The paper is what folks want to concentrate on. There's not enough in the data presented for that embargo to say enough about the architecture at this point, sadly.
 
The paper is what folks want to concentrate on. There's not enough in the data presented for that embargo to say enough about the architecture at this point, sadly.

And so what we will the data actually shed light on?
 
The paper is what folks want to concentrate on. There's not enough in the data presented for that embargo to say enough about the architecture at this point, sadly.

I suppose.

But, looking at the clock, at least I can say they are binning/tiling, rasterizing in software, that they put in fixed function texturing because if they hadn't it would have taken 12-40x (!) longer to do that function, that yes they really do say the core is built on. . .uhh, no, that's a different document, so I can't say that just yet. . . :p ; that they expect pretty much purely linear scaling on number of cores for modern games like FEAR, HL2, and GoW, and their unit heirarchy goes Core --> Thread --> Fiber --> Strand.

They've pretty much exploded the idea of a "pipeline". This should, in theory, help out from time to time. Modern gpus when faced with an "atypical workload" could sometimes stall because not all of the chip is really parallel. There are still a lot of potential chokepoints that are avoided on a practical basis by telling developers, "Well, Don't Do That!". Part of Intel's point in these documents is that in reality *all* workloads are "atypical" and thus nearly perfect flexibility should pay off with smoother performance profiles. Or maybe I'm inferring that rather than them stating it explicitly. That will be very interesting to see if they can make that payoff. Maybe we can get RoOoBo to comment on that. ;)
 
A TBDR-like software rasterizer makes a lot of sense on a multicore CPU as makes easier to distribute the workload over multiple cores, while keeping the working data set in L2.
Just think how inefficient software ROPs would be on a immediate mode renderer..
 
Back
Top