DX9 API specs

On the other hand, i thought i read that CineFX had 2 address registers in the VS, and 64 temporaries in the PS...

Kristof, are these VS/PS3.0 specs final do you think? I was thinking about it, and doesn't having dynamic flow control in the pixel shader units pretty much require an intruction cache per pipeline (since with K pipes you potentially need to fetch K instructions from different addresses in the i-cache)? With a max instruction count of 1024 this sounds really expensive, assuming Basic's estimated 64bit+ instruction encoding is what's used.

Apart from that the VS/PS3.0 units look very similar - both have "texture samplers". Also since pixel programs which start at the same time will not necessarily end at the same time, does this mean that a set of K=NxM pipelines will stall until the "longest" pixel program finishes before starting on new pixels? Perhaps PS/VS 3.0 units can be made identical in hardware implementation, allowing for load-balancing of these units...

Anyway, very interesting stuff.
 
psurge said:
Kristof, are these VS/PS3.0 specs final do you think? I was thinking about it, and doesn't having dynamic flow control in the pixel shader units pretty much require an intruction cache per pipeline (since with K pipes you potentially need to fetch K instructions from different addresses in the i-cache)? With a max instruction count of 1024 this sounds really expensive, assuming Basic's estimated 64bit+ instruction encoding is what's used.

Final... hmm I would say nothing is final until its completely released, things often change last minute.

You ask some instresting questions... structures that have inherently been designed as SIMD now suddenly have to support branching and conditionals, things that SIMD was not designed for - heck it completely breaks the SIMD concept :)

Decoupling pipelines is also not trivial since then different pipes might work on completely different areas of the polygon, maybe even different polygons - currently all hardware has its pipelines locked to working on the same "2x2 block" of pixels and whatever inefficiency (e.g. triangle edges) this gives is accepted.

K-
 
The DSX/DSY instructions won't be fun to implement unless you run 2x2 pixel pipelines in SIMD.

[Edit]
And a lot of other things too of course. But those are special since they likely share data between the pipes.
 
Basic,

I'm unclear on what the rate of change instructions apply to - presumably color, z, alpha, texcoords (any of the pixel shader inputs)?

if so then they get data from the interpolators in the rasterization stage, right?

OK now imagine the situation where pipes are completely decoupled,
that is they can operate on arbitrary pixels of arbitrary triangles with only 1 caveat: they must be running the same pixel program -- so each pipe has it's own i-cache, no data is shared (appart from a read-only scratch memory with space for however many constants) between processing elements.

Now, I don't know this for sure, but I would guess that most of the optimizations (besides the sharing texture data) made possible by
rendering NxM blocks come from the fact that you need to generate
values for lots of parameters which vary linearly across a tri, and
this can be efficiently (transistor count / speed wise) done in parallel for an NxM block.


slightly edited quote from a previous post by me :

The rasterizer still outputs NxM blocks of pixels - for each pixel covered by the triangle do culling,Z testing, stencil testing. If the pixel passes store the pixel and it's associated shader inputs into a FIFO buffer. Beef up rasterization stage so that it can output more than 1 NxM pixel block per cycle. So long as different tris share the same pixel program, AFAICS it doesn't matter which tri these blocks come from - just insert the pixel program input into the buffer. Every time a pipeline finishes a pixels, it grabs a new one from the buffer.

So - put the DSX, DSY results for all the required inputs into this buffer.
You can still use the standard optimizations for the computation of all the usual things (z, color, texcoords, texture LOD) since you are generating them NxM at a time.

It sounds good to me (maybe because I'm not a hardware engineer), since it doesn't require extra logic, just more cache (the duplication of i-cache, and the pixel input FIFO).

What does sound problematic is that this FIFO would need a fairly insane store/fetch bandwidth.

anyway these are just armchair ideas...
Serge

[edit] This approach also accomodates super fast z/stencil only fill: since the PSunits are decoupled from the rasterizer and simple ops, those types of fill-operations won't ever go to the PS units. They are handled by fixed function hardware right after rasterization, in NxM blocks (where NxM is presumably already larger than the number of pixel pipes due to multi-sample AA requirements).
 
I got the impression that DDX/DDY can be applied to any register. If it's a iterator, then you could get the delta value from the iterator. But if it's a temporary register, then you're out of luck. So you need to differentiate against neighbouring pixels, which would work fine with a 2x2 configuration. The same calculated difference value is then sent to all the pipes. This is kind of (or maybe exactly) like the mipmap selection in GF, which is done per 2x2 block.

This will of course give strange results if there's discontinuities in the register values, but that's the programmers problem. If he decides to calc the derivate of a discontinuous functioin, then he deserves it.

But it wouldn't work with your trick. If your trick did general placement of blocks that are 2x2 each, then it would be possible. But you'd still have the problems with irregular memory access.

I agree that at some point you'd have to do something about the more parallel pixels pipes vs smaller triangles problem. But right now I'm to tiread to say if I think your idea is feasible. :) And even if I weren't I still wouldn't know how much optimizations are done right now, so it's hard to say how much of those you're loosing.
 
Hi there,
Kristof said:
What's the difference between Dynamic Flow Control and Dynamic Branching ?
LOL

that was evil. ;)

Regarding those leaked files: they appear to come from the DX9 Beta 2 release.

ta,
-Sascha.rb
 
Oh come on guys! Give me a break! :D

They're the same thing, no difference, a man can make a mistake now, can he? :D
 
alexsok said:
Oh come on guys! Give me a break! :D

They're the same thing, no difference, a man can make a mistake now, can he? :D

Just making sure ;) You could have meant a direct "case" structure rather than massively nested IF/ELSEIF structures.

K-
 
Just making sure ;) You could have meant a direct "case" structure rather than massively nested IF/ELSEIF structures.

K-

nah. i meant in general terms, where u guys were right that there is no difference.

To tell u the truth, I'm not that smart when it comes to such stuff, so mistakes is a common thing for me in this regard! :D
 
Basic,

(for the sake of argument) With data-dependent flow control in the PS how can you be sure that a register of a neighbouring pipe has valid data stored in it? Also, even if you are doing 2x2 blocks, how do you get ddx's for the 2 right-most pixels, and ddy's for the two bottom pixels in the block?

[edit] never-mind, re-read your post and noticed you are sending the same ddx ddy values to each pipe.

Regarding irregular memory access (to frame/z buffer). I don't see that this approach would have that much of an effect since you are processing multiple NxM blocks from the rasterization stage simultaneously, and these will still have good spatial locality. The biggest benefit in my view is for long pixel programs (which have comparatively infrequent output to main memory) - for a PRMan scene most tris are going to be 1pixel or less, giving 25% utilization of the PS units in the blocked case. Or maybe I'm wrong, and geometry will be decimated to ~1pixel resolution and then rendered with (at least) 2x2 supersampling?

As for the texture access... i just don't know enough to even guess at how it would be affected.

Regards,
Serge
 
Even if you have dynamic flow, it's still quite possible that all pipes executes the same instruction. When different paths are taken in different pipes, you get stalls for the pipes that don't do a certain instruction. But if the flow converges (after a IF...ELSE...ENDIF) then the pipes could be in sync again. So a DDX/DDY would be OK when the flow has converged. If someone tries to do a DDX/DDY on a temp register inside an IF, then it's his own fault because it's like differentiating over a discontinuity. Doing a DDX/DDY on an iterator should work anywhere, since all pipes should have the correct value even if some pipes isn't running the particular branch your in.

So either don't do DDX/DDY in those cases, or make sure that the code handles the discontinuities.

[Edit]
You'll get irregular memory access to textures also. That might be worse than the frame-/z-buffer. But it is a big unknown factor since it's no longer obvious how to do the caching. With fixed function pipes the hardware could know what to precache, but with general dependant reads it's a lot harder. Well, maybe you allready need a generic enough cache system that it can handle pipes executing pixels far apart.


Humus:
Yes, and I hate it. Do I have to learn DX now? :( ;)
 
Personally, after looking at the CineFX info out there, I'd be rather surprised if actual dynamic flow control was ever implemented in the pixel shader. It just seems like it's much less taxing on performance for the hardware to execute all possible branches, and choose the correct result at the end (This is what the NV30 does...).

Given infinite shader length, any possible branch could be done with this sort of architecture. The only real limitation would be loops. That is, all possible situations will be computed, so you cannot realistically have loops whose termination condition is dynamic (i.e. based on data calculated within the shader program).

An example of something you couldn't do:

Execute the same program on a set of data over and over again until the data changed by some set tiny amount (i.e. iterating until the series converges).

Instead, here's what you'd need to do:

Do some experiementation/calculation to figure out how many iterations it takes to allow some maximum error value. Always iterate that number of times.

This may make it very hard if the series has no maximum set error value for a certain number of iterations, and is entirely data-dependent. Such would basically require that the programmer makes a compromise.

Anyway, given the potential speed hit from allowing dynamic branching, I think the loss of dynamic-length loops is a small price to pay.
 
Basic,

If you enforce flow convergence by stalling the pipes not executing a branch target, isn't this equivalent to predicated execution of the branch target instructions? it seems complicated to do this with looping and subroutine calls, as does subtracting two values coming from different register files... on the other hand i can't figure out a way to compute DDX/DDY of a temp register without doing what you suggest (any ideas?).

Regards,
Serge
 
A little off topic of the current trend of this thread, but definitely going along the lines of the title... Am I reading this right that DX 9 will have an official "fix" for the refresh rate "feature?" If so, would this imply that just merely upgrading to DX 9 will remove the refresh rate lock in Win2k and XP?
 
Serge:

Yes it is, and that's the reason I thought of it that way. The only chip I've seen DDX/DDY directly connected to is NV30, and they seem to do "dynamic" flow through conditional assignment.
Btw, did you notice that you can't use the predicates inside an IF. Wonder why? :) (I hope I remembered that right, I don't have the docs here.)

But even if you have a more "real" dynamic flow, you could make sure that DDX/DDY instructions are executed in parallel. Let the pipes run independently inside IFs/LOOPs, but do a rendezvous after ENDIF/ENDLOOP (or just before the DDX/DDY, but then you must add some logic to only do the rendezvous for the pipes that eventually will get to that point, still doable though).

I don't see the problem with subroutines, what exactly did you mean?
 
Basic,

not a problem... all I was saying is that per-pipe predicate isn't enough to handle it any more (at least i can't see how to do it). I think the instruction dispatch "unit" would have to keep track of which pipes are stalled (send them NOPs, send the running pipes the current instruction), and when to unstall them...

What you're saying is to force a rendevouz at every endif, endloop, call instruction? That would work. Still I don't like it :). there is enough execution resource waste already. For conditional execution in the SIMD model, something like the conditional stream approach used in Imagine seems much nicer, but that makes things more complicated too...

Serge

[edit] i found the part of the docs you mentioned :

looks like they are using the predicate register to keep pipes in sync.

the more i think about it the more i think you're right (i bet they keep the vertex pipes in sync just as you describe). If they do it for vertex pipes... why not pixel pipes? The whole thing makes me wonder if NV30 or some chip from another manufacturer is actually VS/PS3.0...

It think it's just a tad strange that MS would add a PS specification which it seems no-one will implement any-time soon in DX9.

Serge
 
Back
Top