DX9 API specs

bdmosky said:
A little off topic of the current trend of this thread, but definitely going along the lines of the title... Am I reading this right that DX 9 will have an official "fix" for the refresh rate "feature?" If so, would this imply that just merely upgrading to DX 9 will remove the refresh rate lock in Win2k and XP?

Yes, I'm running on a DX9 beta and now I get 85Hz in every game that doesn't specify otherwise. In OpenGL though you still need to use a refreshrate forcing utility for all those old and/or sloppy implemented games out there.
 
Basic,

One more thing - even with a 2x2 pipe arrangement, how do you compute
DDX,DDY when a tri renders as a single pixel on 1 pipe? The other 3 don't have any valid values in their temp-registers, so you're in the same situation as when each pipe is rendering arbitrary pixels...
 
The whole way of exposing it is a bit weird IMO. The only differential values you are interested in are those for the interpolated vertex parameters (or even more specifically the texture coordinates). Just like those you dont really have to calculate them iteratively anyway ... with near pixel size polygons it wouldnt make sense to even try.
 
I realized that rendezvous at ENDIF isn't enough. "Recursive" IFs would obviously be a problem. Luckily there's a cure for it. You need four "rendevouz registers" per 2x2 block, one for each pair of neighbouring pipes. Whenever two neighbouring pipes run in sync and encounters an IF, and they go different ways, you keep track of "IF recursion depth". I can see different ways, IF/ENDIF counting, storing ENDIF addres in register. The only important thing is that when you reaches the corresponding ENDIF, you'll wait for a rendezvous.

I can also see a completely different way, that only will need a rendezvous at a DDX/DDY (and if it just depends on a iterator, you won't need a rendezvous at all). It's a realy simple idea. If a pipe reaches a DD*, it wait for the neighbouring pipes to gat a PC that is >= than its own. If all pipes run in sync, there's no stall. If you don't have any DD*, there's no stall. The only unneccesary stalls are if there's a DD* in an ELSE clause, and a neighbouring pipe runs in the corresponding ELSE clause.
There's of course a problem with subroutines, since it would screw up the othersPC>=myPC condition. But it's easily solved. Concider the whole call stack as one large number. The return address of the topmost function is stored in the MSB of this large number, and deeper calls are stored in less significant bytes, finish with the PC. Using this as an "macro PC" will take care of subroutines. But I still need to think of a good way for DD* inside a loop.

Anyway, even if it isn't optimal to get stalls sometimes, you'd only get them when different pipes in the same 2x2 block take different paths. And even when you get the stalls, it's still faster than the "calculate everything and select the right one" way. (With a cost in more complex hardware though.)

I agree that the PS definitions and (what we think) we know about the comming cards doesn't match up in a nice way.


The single pixel problem is the same as what you get at edges. There's no realy good solution. But you could just extend the surface to fit all four pipes. With the locked pipe-pattern (I believe) we have now, you wouldn't lose any performance anyway. The extended pixels will of course be thrown away at the end, but will generate the needed values on the way.
It won't be exactly right, but in most cases it will be rather good. Most triangles are a part of a mesh, and the texture just outside the triangle is usualy mapped on a neighbouring triangle. so the continuity will be reasonably good.


MfA:
Whatever you need differentiated interpolators for could be used dependent on a texture read instead. And then you need to differentiate temporary registers. And even if you just want to differentiate a function of an iterator, it could be far easier if you can differentiate a temp reg.
 
Only as long as there is something worth iterating.

Using the difference value just feels so icky, to use your filtering of perturbed environment maps as an example ... instead of using ddx/ddy on the perturbed texture coordinates the developer could also precalculate the gradient for the perturbing texture and store it along with it and combine that with the ddx/ddy of the texture coordinates to find the footprint in final texture. You can always replace ddx/ddy by doing more or less redundant work in the pixel shader, in the case of single pixel polygons no redundant work at all, what practical uses does ddx/ddy have that it should have to restrict the implementation of pixel pipes so much?
 
I wouldn't call it icky.
If the pipes already is running in sync at neighbouring pixels (which seems to be the case for the coming generation, possibly except P10), then I'd call it rather elegant. It's easy to use and almost free. If they don't run in sync (due to prior dynamic branches that went in different directions), then it's probably still more efficient to wait for the trailing pipes, than to do some chain rule calcs.

I did describe one way of doing the rendezvous that wouldn't cause any stalls unless you realy want to do DD* on a temp register, and it's also rather cheap in HW. The only "big" drawback is that you need to run the pipes in a 2x2 pattern, and that's a restriction that I've got the feeling that 3Dchip designers put upon themselfs anyway, since it simplifies the design.

I would actually not be supprised if pipes stayed in that way even when the pixel count of the average triangle starts to reach the number of pixel pipes. With some "mesh mending hardware" that run in advance of the rasterization and mends strips into meshes with 2D connectivity. So that the rasterizer knows what polygon that is on the other side of the edges, and can fill all pipes with the correct pixels. Of course only when they have the same shader and textures, and actually share the edge, but that will probably be the most common case with small triangles.
Then we're back to efficient DD* even for single pixel triangles.

(Such "mending hardware" could also be useful to find object silhouettes for shadow volumes.)

Btw, I am not by any means any Renderman expert. (In fact I hardly know anything.) But I have seen Renderman shaders that used Perlin noice to make a displacement map, and then they called a function to get the normal of the surface. My first impression was "What, that's not possible. They can't do derivations when they just calculated one point." Then I realized that they could do it rather easily. I wouldn't want to explicitly calculate the derivates though.
 
Basic,
Just had another (i think pretty good) idea :

Each pipe either processes a 2x2 block of pixels (or subpixels
for 4xSSAA).

Now, each pixel pipe can issue at 4 32bit fp ops per cycle,
and is executing 4 pixels p1,p2,p3,p4 concurrently as follows
(instructions i1, i2, ...in) :

Code:
cycle    instruction
1          i1 (operating on p1.x, p2.x, p3.x, p4.x)
2          i2 (operating on p1.y, p2.y, p3.y, p4.y)
3          i3 (operating on p1.z, p2.z, p3.z, p4.z)
4          i4 (operating on p1.w, p2.w, p3.w, p4.w)

5          i5 (operating on p1.x, p2.x, p3.x, p4.x)
6          i6 (operating on p1.y, p2.y, p3.y, p4.y)
7          i7 (operating on p1.z, p2.z, p3.z, p4.z)
8          i8 (operating on p1.w, p2.w, p3.w, p4.w)

...

This means

- Since we are operating on individual components of the 4 pixels at a
time, the actual underlying ISA can be scalar (not SIMD). Read/Write
masks and arbitrary swizzles are trivial to implement since all that is
required is compile time register renaming. Furthermore, a SIMD op
operating on 1,2,3 SIMD register components will have a throughput
of 4 pixels in 1,2,or 3 cycles (in the absence of stalls).

- less i-cache bandwidth needed (smaller instructions)

- arithmetic instructions with latency of 1 to 4 cycles have their latency
hidden.

- 4 copies of the register file are required, but you only need a quarter
of the read and write b/w to each.

- DDX and DDY implementation can use the arithmetic units of a single
pipe, and need only interact with the register files of a single pipe.
In fact, DD[X|Y] is a simple subtraction followed by a multiply with a
constant scaling factor - an fmac unit can handle this.

- texture loads : produce 4 filtered texels at a time, can share
LOD/anisotropy calculations across the 4 pixels.

- Better texture cache efficiency (same TMU and cache handling
texture loads on addresses very close to eachother), possibly
some computational optimizations possible from the 2x2 sample
block organization?


Now, each of the pipes can be made independent (process different pixels from different triangles and/or with different shader programs, need not operate on (or output) data in lockstep).

I need to think about how branching would work with this scheme...

Regards, Serge
 
1 more thing :
Small tris/ edge pixels could automatically use SSAA -
i.e. something like, edge pixels automatically get 4xSSAA -
it seems like it should be possible to combine this with Z3
(storage for 4 samples).
 
Unfortunatley it doesn't matter how you arrange your pipes or what clever synchronisation schemes you come up with, as soon as you have dynamic flow control DSX/DSY can run into problems as no one pixel is gaurenteed to take the same execution path. The only way to make them work is for the application to always make sure, irrespective of execution path, the source argument contains meaningful values at all times across all pixels (executed).

John.
 
JohnH,

True. DSX, DSY are effectively fences for threads corresponding to different pixels. But... The logic to handle this would be internal to a single "pipe", leading to (AFAICS) an independent pipeline unit, which doesn't need to talk to any other unit to do it's work. This would make the unit easier to replicate across the chip.

Furthermore, for the LOD calcs, texture lookups, DSX, and DSY you are still taking advantage of the fact that multiple pixels in a well-defined
spatial arrangement are being processed.

Plus you could get fancy with the instruction issue (if one thread/pixel stalls on a texture load, or reaches a DSX instruction, only issue instructions from unstalled threads/pixels). This seems feasible with 4 threads; EV8 was to be 4-way SMT, P4 is 2way SMT, and these chips are handling threads in a much more general context than a pixel pipe. This would further reduce stalls inside the pixel pipe.

As described, the approach should scale much better than simply increasing the number of interlocked pixel pipes - it has to deal with 4 diverging threads maximum, versus N. You have the potential for a unit to accomplish useful work while one pixel's thread is stalled (via SMT), and execution units aren't wasted as much when issuing scalar ops, or SIMD ops with read/write masks.

Beyond handling DSX/DSY in a nice way (at least to me ;), it localizes the problem of diverging execution paths for different threads/pixels.

--

somewhat OT, 2 more items on my shader wish list - some addressible scratch pad memory for each unit, support for multiple input/output streams per unit, and the ability to link the output of one unit to the input of another.
 
psurge said:
JohnH,

somewhat OT, 2 more items on my shader wish list - some addressible scratch pad memory for each unit, support for multiple input/output streams per unit, and the ability to link the output of one unit to the input of another.

Creating then nice dependencies between vertexs which is something that most hardware graphic engineers really 'like'. In any case it seems it is the way GPUs are going, more CPU like and with that the same problems and limitations that a CPU has (for example I wonder how are they handling the 'delay slot' in branches, or the equivalent technique used).
 
Psurge,

the issue is that it doesn't matter if you process the 4 pixels required for the 2x2 kernel sequentially on one pipe or in parrallel across 4 pipes as any of those pixel can end up going down very different execution paths. This results in unpredictable, unless specifically coded against, input to any instruction that requires knowledge of adjacent pixels (e.g. think about calculating LOD for a dependent read). Or have I misunderstood your suggestion ?

Ultimatly this stuff means that rates of change, LOD's etc must be calculated, in some cases, by the shader code, per pixel, based on the underlying plane equations of the source components or other methods...

Hey, this thread has actually turned into something interresting, a refreshing change to other recent topics...

John.
 
JohnH,

AFAICS we are in agreement. My suggestion does not change the fact that pixels inside a 2x2 block can have wildly varying execution paths.

Again as I was saying the implementation of DSX(DSY) would have to
stall until the adjacent pixels also reach a DSX(DSY). The coder would have to ensure that if one pixel hits a DSX instruction all adjacent pixels also eventually hit a consistent DSX instruction operating on the same register. Basic's posts above have some good ideas on how to enforce
this.

What I am getting at :
Take the approach of 4 or 8 SIMD pipes each executing the same instruction on different pixels each cycle. DSX/DSY now requires communication between pixel pipes (not desireable). You can implement branching by stalling pipes which don't take a branch, and issuing instructions from the taken branch to all pipelines - this also applies to DSX,DSY. However AFAICS efficiency of such an approach will be terrible, which is what my suggestion is trying to address.

i hope that's clear... if not i'll try again :).


RoOoBo,

The scratch pad RAM would be used for a single stream record (vertex/pixel) at a time. I.e. it would store local parameters for a program, not global/shared values.

As far as the input/output stream buffers go, yes it creates dependencies between streams, but there already are such dependencies - vertex shader output is tied to rasterization, rasterization is tied to the pixel shader, etc...

What I am suggesting is making these dependencies programmeable instead of hardwiring them. At this point you have something general enough for much more than a flexible OpenGL pipeline, without sacrificing optimizations/features specific to it (triangle rasterization, z/color buffer compression, occlusion tests, etc...)

For instance, look at this paper for a Reyes implementation on a stream processor: Reyes and OpenGL[/url]
 
I am fairly new to programming and processing architectues in general (now learning C), and would like to know why it is such a big deal to have the pixel pipelines running in sync? Why no branches? It seems to be alright to branch in the vertex pipelines.
 
psurge said:
JohnH,
RoOoBo,

The scratch pad RAM would be used for a single stream record (vertex/pixel) at a time. I.e. it would store local parameters for a program, not global/shared values.

As far as the input/output stream buffers go, yes it creates dependencies between streams, but there already are such dependencies - vertex shader output is tied to rasterization, rasterization is tied to the pixel shader, etc...

What I am suggesting is making these dependencies programmeable instead of hardwiring them. At this point you have something general enough for much more than a flexible OpenGL pipeline, without sacrificing optimizations/features specific to it (triangle rasterization, z/color buffer compression, occlusion tests, etc...)

For instance, look at this paper for a Reyes implementation on a stream processor: Reyes and OpenGL[/url]

So you are talking about having something like the Stanford stream processor (IMAGINE). You would have a number of those processors in the GPU and each one could be connected with any of the others. It is a bit like the PS2 EE but with more units and more configurable. The EE has the MIPS main CPU and two vector/SIMD plus VLIW units (a bit messy eh? ;). VU0 is connected always to the main CPU but for the input stream, but the output stream can go either to the GS (Graphic Syntethyzer, the PS2 rasterizer) or to the VU1. The VU1 can execute streams from the VU0 or from the CPU and I think it uses to be directed to the GS. Each unit would execute different task (rasterization, fragments or vertex) and shaders.

In any case I was talking about dependeces between vertexs (or inputs in the same stream) if you have a writeable RAM. That would be a problem for example if you have two stream processors executing the same vertex shader and the original vertex stream (what was being feed from outside the GPU) was divided in two substreams for each of the stream processors. If the shader would be writing in the RAM there could be dependencies between the two vertex substreams. Some for pixel streams.
 
It's not so much a dependency, it just means your thread context grows larger. If a developer chose to use absolute addressing for writing by the vertex shader into shared RAM it would be a very conscious choice (and a severe hack). The norm would be for each thread to have it's own bit of memory from the heap.
 
Psurge, all clear now... The down side of sequentularily processing the pixels in the 2x2 kernel in a single pipeline is that you need to duplicate all the gradient/LOD logic on each pipe. So you're really trading undelying logic area for layout complexity (aniso calcs are very expensive so the duplication can be costly). This said, for anistropic lighting effects you really need to be able to supply per pixel gradiants/LOD's so the duplication is going to have to happen at some point.

John.
 
RoOoBo,

Sort of, but note that Imagine issues identical instructions each cyle to all 8 of it's functional unit clusters and does not allow branching inside stream kernels.

As for the scratch pad memory, it would be pretty useless as shared RAM unless you explicitly specify the records (and their order) of a stream being handled by a specific pipe. That basically makes load-balancing between pipes really hard...

Serge
 
JohnH, true, but perhaps you wouldn't necessarily have to pay in logic as much as you might think:

1. LOD/gradient calculations would be shared across all 4 pixels in the 2x2 block. This means that you can make the LOD/gradient unit unpipelined, with a 4 cycle latency, and still effectively have a throughput of 1 LOD/gradient calculation per cycle.


2. Now, assume a LOD gradient calculation G involving steps s1, s2, s3, s4 (each taking a single cycle).

In the traditional locked pixel-pipe approach, you probably want a throughput of 1 calculation per cycle - so i'm guessing the calculation would be pipelined: s1->s2->s3->s4 (throughput 1 calc per cycle, latency 4 cycles). The output would then be shared across (say 4) pixel pipes, just as my suggestion shares them across 4 pixels in a single pipe.

Now in the case of my suggestion, you would only need a throughput of 1 calculation every 4 cycles. If each step of the calculation is identical, or at least similar enough for the calculation to be performed as s->s->s->s,
then you need HW for only one s unit. The pipeline on the other hand would require HW for s1, s2, s3, and s4, even if the steps were very similar.

So basically since you need less throughput per LOD/gradient unit, it might allow for a much smaller implementation of such a unit. That's just a maybe, as I said I don't know all the calculations involved. This is definitely arm-chair design on my part.

Regards,
Serge
 
Going from a throughput of 1 per clock to 1 every 4 clocks would make it smaller but probably not 1/4 the size, maybe 1/2... To be honest the only way to find out is to redesign the logic to work over 4 clocks...
Note, to get a throughput of 1 per clock per pipe, as required for full speed per pixel aniso effects, you'd still pretty much have to have 4x the logic.

John.
 
Back
Top