Nvidia GT300 core: Speculation

Status
Not open for further replies.
On a more basic note, why should the order of submission of geometric primitives (atleast in the fixed function pipeline) matter?

Even in programmable pipline, why should the order in which primitives are submitted, shouldn't matter. After all, the shaders are explicitly guaranteed to execute in an order less fashion. For stuff like OIT, I suspect the order here is referring to the order of vbo's, and not to the order of primitives within that vbo.
 
Even in programmable pipline, why should the order in which primitives are submitted, shouldn't matter. After all, the shaders are explicitly guaranteed to execute in an order less fashion. For stuff like OIT, I suspect the order here is referring to the order of vbo's, and not to the order of primitives within that vbo.

Well, the order of triangles can make a difference when you are alphablending for example.
Granted, you create special cases (eg like how PowerVR hardware was treated in the DreamCast). Eg, for fully opaque, z-buffered triangles, the order shouldn't matter, at least not for the resulting image. Rendering the triangles front-to-back may be more efficient, so in that sense order could still be important.
Also, as long as you only have convex meshes in a single vbo, the order of triangles doesn't matter either (backfaces will always be behind frontfaces, so they are sorted implicitly).

The order of the pixels within the triangle shouldn't matter, but the order of triangles themselves is not exactly the same.
 
After looking at the R7xx assembly generated by the tools, I am coming around to the view that it may be best for the compiler to think of AMD gpu's as having float registers, instead of float4 registers. I mean that instead of thinking that the registers are r0, r1, r2 etc (like in nv gpu's), think of them as r0.x, r0.y, r0.z, r0.w, r1.x, r1.y, r1.z, r1.w etc. The huge ILP afforded by the vliw design should mean long, deeply unrolled kernels would do very well.
It seems to me it is solely up to the hardware compiler to pack registers - and irrelevant whether Brook+ sourced IL packs or doesn't.

I am curious here, what specific takeaway's would you have from volkov's work. The one which saw was treating registers as a large block of memory as they are 4x the size of shared memory.
Arithmetic intensity is the prime motivator, use as many registers as you can get away with, use shared memory as sparingly as possible due to latency/bandwidth constraints but prefer shared memory over re-fetching from memory, to the greatest extent possible.

Let's hope the lack of improvements on brook side mean a good opencl toolset delivered asap.
I imagine work on OpenCL will have some impact on Brook+. The latest changes to Brook+ (what I understand to be the C++ templated approach as opposed to the old C approach) is effectively a major version increment - I suspect OpenCL has something to do with that. Still, I don't know who'll be using Brook+ seriously - apart from anything else the language isn't even finished (e.g. no arrays in kernel scope - which makes "unrolling" really bloody awkward).

Jawed
 
The order of the pixels within the triangle shouldn't matter, but the order of triangles themselves is not exactly the same.
NVidia has a scoreboard for pixel-triangle-fragment ordering so is able to run stuff out of order.

Jawed
 
Well, the order of triangles can make a difference when you are alphablending for example.....
The order of the pixels within the triangle shouldn't matter, but the order of triangles themselves is not exactly the same.
It does matter in some cases; for example with intersecting or coplanar polygons. You don't want an otherwise stationary image to flash from frame to frame.
 
OK, so primitive assembly then would be hard to parallelize. nAo, do you have any thoughts on this btw, on how this might be done? Perhaps two different vbo's could be setup in parallel?
 
It does matter in some cases; for example with intersecting or coplanar polygons. You don't want an otherwise stationary image to flash from frame to frame.

Well, within a single vbo you shouldn't have intersecting or coplanar polygons.
They will cause problems anyway (eg z-fighting), and should be avoided by the artist.

But asssuming you avoid intersecting and coplanar polygons within vbo's, then there are cases where the triangle order within the vbo doesn't matter, only the order in which the vbo's are rendered. So with a few extra rules, you can create a simpler, more parallel environment for rendering. I'm not sure how the overhead/complexity of the scoreboarding would explode if you were going to render things in 'completely random' order. Perhaps it's not even an issue in the first place.

Anyway, this case you describe requires at least 2 triangles, so it's still not directly dependent on the order of the pixel order in the triangles themselves, but rather on which triangle gets rendered first.
 
Theo in the LRB article:
Theo said:
Recently, we learned the targeted die-size for GT300 and saw that nVidia isn't changing itself, and you can expect that the next-gen DX11 part will fit somewhere between Larrabee and GTX295. Yes, with just one die. Performance wise, look for strong accent on GPGPU applications and not-so-much accent on games, with LRB performing in the range of mainstream cards…
 
What's the major reason why setup units are not parallelized? Is it because of die area, or because of (I suspect) concurrency issues?
Probably both reasons. Post transformed caches need extra read ports to feed the setup units to begin with, then concurrency issues kick in as we still need to mantain the original rasterization order. Also we might need multiple rasterizers as well..
 
It seems to me it is solely up to the hardware compiler to pack registers - and irrelevant whether Brook+ sourced IL packs or doesn't.

It's not true in every case. Certain compiler optimizations get in the way. As I pointed out a few years ago in the NV30 era, common subexpression elimination in high level compilers consumes extra registers in cases where using less would allow a more efficient pack. This was incredibly important for the NV3x as using more than 4 registers incurred a horrendous penalty, and MS's HLSL compiler was adept at CSE. It may not be true as much for ATI.

Also, less work at the compiler level puts more stress on the driver compiler where you may be under time and memory constraints on how much optimization you can do (linear scan vs graph coloring for register allocation) I haven't checked the APIs in recent years, but did they ever add functions to allow one to ask the driver to precompile a shader and allow you to cache it, thus potentially reducing startup time?

Optimal packing and scheduling might be cost prohibitive to do in the driver, so there still does seem a rolle for having better tuned HLSL/GLSL/OpenCL "build time" compilers that work with increased semantic information.

The issue isn't just that the shader output code may or may not be the best intermediate representation for optimizations, the issue is one of how much time you have to spend doing JIT-optimizations, as well as higher level compilation potentially doing transformations that interfere with what your driver would like to do, making the job harder.
 
It's not true in every case. Certain compiler optimizations get in the way. As I pointed out a few years ago in the NV30 era, common subexpression elimination in high level compilers consumes extra registers in cases where using less would allow a more efficient pack. This was incredibly important for the NV3x as using more than 4 registers incurred a horrendous penalty, and MS's HLSL compiler was adept at CSE. It may not be true as much for ATI.
As far as I can tell the ATI hardware compiler is doing CSE reasonably freely. It's helped by two hardware features: clause-temporary registers that expire at the end of an ALU clause, and the in-pipe registers.

Also, less work at the compiler level puts more stress on the driver compiler where you may be under time and memory constraints on how much optimization you can do (linear scan vs graph coloring for register allocation) I haven't checked the APIs in recent years, but did they ever add functions to allow one to ask the driver to precompile a shader and allow you to cache it, thus potentially reducing startup time?
A feature of OpenCL is to pre-compile if targetting a specific piece of hardware and/or to cache what's been compiled. I think it's possible to pre-compile for multiple targets. There's always the option, it seems, to JIT even if a pre-compiled object already exists.

I think the IHV's drivers cache GPU code during run time under D3D and OpenGL.

I've experienced seriously slow compilation times. e.g. a double-precision heavily unrolled Brook+ kernel, ~400 lines, takes about 2s to compile into IL (over 4000 lines) and about 25s to compile into assembly (about 5800 instruction cycles - another bug, I can't see beyond instruction 1635). I've seen the compiler chew through hundreds of MB of system memory, too.

Optimal packing and scheduling might be cost prohibitive to do in the driver, so there still does seem a rolle for having better tuned HLSL/GLSL/OpenCL "build time" compilers that work with increased semantic information.

The issue isn't just that the shader output code may or may not be the best intermediate representation for optimizations, the issue is one of how much time you have to spend doing JIT-optimizations, as well as higher level compilation potentially doing transformations that interfere with what your driver would like to do, making the job harder.
It seems one of the things that gives AMD's driver team "grief" is that D3D's compiler does a round of optimisations and they then have to deal with that result in order to perform their own optimisations. They'd prefer not to have that intermediate optimisation.

The specific issue I'm referring to here is a table of variables that runs r0.x, r0.y, r0.z, r0.w, r1.x, r1.w ... but the code only ever refers to r0.x, r1.x ... rN.x. The hardware compiler can clearly see that r0.y, r0.z etc. are entirely unused - and isn't re-allocating. Not always. But I haven't found a pattern to this so far.

Jawed
 

If NV packs say at least twice the amount of transistors as GT200 into a 40nm die, I don't see why the result would be bigger than the theoretical LRB die size he's mentioning. Unless of course he meant something else, it sounds like utter bullshit to me.
 
Last edited by a moderator:
We cannot estimate the current cost of the Larrabee-based graphics card, since LRB chip features 1024-bit internal and 512-bit external crossbar bus with 1Tbps internal bandwidth.
and
The ssage of a 512-bit interface will result in slightly more complex PCB [much less than GDDR3-powered GTX200 series], so that will not radically increase the complexity and the price.
:?:

Theo, Master of Non-Sequiturs (and other nonsense.)
 
If you're not impressed at the specs of this chip, with its 12 P55C-based cores + Vector units, think again.
That's the part I found a bit disturbing. Only 12 cores? Or more likely Theo messed up.
 
The following is predicated on there being accurate numbers in the article:

12 cores can happen if Intel reserves two cores for redundancy out of 16.

16 cores in 600 mm2 at 45nm kind of makes me wonder about core size.

Mentally, I had been thinking Larrabee cores might be something in the realm of 18-25 mm2. (guesstimation based on an Atom core, remove half the L2, then bulk back up with a big vector unit stated to be 1/3 the die with the rest of the core area bounded by Atom's size). The error bars were pretty big because of the unknown density that Intel could achieve, since Atom's density is pretty poor.

The 1.7 billion in 600 mm2 would put Larrabee's overall density of 2.8 million per mm2 between that of Atom (1.88) and Penryn (3.8).

Fun with numbers:
If 1/3 of Larrabee is not part of the cores: 16 cores in 400 mm2 is 25 mm2
If 1/2, it's 18.75 per core.

Theo's 12 core data point appears difficult to reconcile with the over 1 TFLOP claim Intel's position states.
To do so with 12 cores with 32 floating point operations per core cycle would require 2.8 GHz clocks.
That would be a pretty impressive push of such a short pipeline.

Similarly, the 1 Tb/s ring bandwidth seems off-putting to me.
The ring is 1024 bits wide.
At core clock, it would be nearly 3 Tb.
At 1/2, which might have been hinted by some of the Larrabee slides a while back, it's still higher than what Theo stated.
Theo's figure would depend on a 1/3 divisor.

The FLOP/mm2 measure of this chip is not record shattering, compared to mainstream GPUs.
 
The 1.7 billion in 600 mm2 would put Larrabee's overall density of 2.8 million per mm2 between that of Atom (1.88) and Penryn (3.8).

I think the density is for a reasonable part influenced ('inflated') by the large caches on Penryn. Do these figures make sense when looking at Larrabee's caches? I think they might.

Theo's 12 core data point appears difficult to reconcile with the over 1 TFLOP claim Intel's position states.

In earlier news the number of cores were generally estimated at 16-32 (possibly with different number of cores for low/mid/highend parts) as far as I can recall.
But I've never seen them as low as 12. I wonder where that figure comes from.
 
I think the density is for a reasonable part influenced ('inflated') by the large caches on Penryn. Do these figures make sense when looking at Larrabee's caches? I think they might.
The caches do inflate density numbers, but I focused on overall density per chip because there were so many unknown confounding factors.
 
Status
Not open for further replies.
Back
Top