When to expect ps 3.0 hardware?

Joe DeFuria said:
Rumors at one point said that the NV40 had taped out...but then the rumors were "corrected", and that it was NV36/38 that had taped out instead.

Unless you have inside info that says otherwise, the current belief is that NV40 had not yet taped out.

I was referring to the internal timetable, rather than the NV36, which we mistakenly believed was NV40. That was ahead of schedule (if it was the NV40, which it isn't), whereas we're now running late.

I expect both the R420 and NV40 to tape out in a similar time frame...within a month or so of each other. I also don't expect any volume shipment of either part to occur until late 1Q '04.

Given that the ATi part is commonly believed to be shipping at least 2 or 3 months after the NV40, it seems sensible that the tapeout dates would be similar distances apart.
 
PaulS said:
That was ahead of schedule (if it was the NV40, which it isn't), whereas we're now running late.

I'm not sure what you're trying to say here.

Both ATI and nVidia had been previously rumored to want their "next-gen" parts out for Fall '03. (NV40 and R400).

For whatever reason, R400 got canned, and "Loki" is now dubbed the nex-gen part. The product to replace R400 for the fall launch is the R360.

For whatever reason, NV40 is not going to make a fall release either, and is being "replaced" by a NV38 at launch. Who knows why they couldn't get NV40 up and running at this time, or how long the delay is figured.

As far as I'm concerned, both NV40 and Loki are late...passed the original plan of a fall '03 launch. The question is, which one is "more" delayed.

Given that the ATi part is commonly believed to be shipping at least 2 or 3 months after the NV40...

Is that really common belief? (Honest question). I'm not even sure if common rumor/belief has the NV40 at TSMC or at IBM at this point.
 
Joe DeFuria said:
PaulS said:
That was ahead of schedule (if it was the NV40, which it isn't), whereas we're now running late.

I'm not sure what you're trying to say here.

Both ATI and nVidia had been previously rumored to want their "next-gen" parts out for Fall '03. (NV40 and R400).

The original "NV40" tapeout (which later turned out to be the NV36) was late June/early July iirc. That it 1 month ahead of when the NV40 was supposed to tapeout (end of July). We're now in August, thus beyond the expected tapeout date. Sorry for the confusion :)

For whatever reason, NV40 is not going to make a fall release either, and is being "replaced" by a NV38 at launch. Who knows why they couldn't get NV40 up and running at this time, or how long the delay is figured.

I agree that availability this year is looking less and less likely (particuarly with the NV38 turning up), but i certainly think we'll see something of the NV40 this year - If they stick to the timetable, it'll be November for an announcement (not shipping). As i said, however, the NV38 has thrown a bit of a spanner in the works there.
 
PaulS said:
The original "NV40" tapeout (which later turned out to be the NV36) was late June/early July iirc. That it 1 month ahead of when the NV40 was supposed to tapeout (end of July). We're now in August, thus beyond the expected tapeout date. Sorry for the confusion :)

No prob. ;)

I just wasn't aware that the NV40 was supposed to tape out at the end of July. (How was that tape-out date arrived at? Was it a specific piece of inside info, or just a guess based on the assumption that NV40 is to be "shown in some form" at Comdex?)

I agree that availability this year is looking less and less likely (particuarly with the NV38 turning up), but i certainly think we'll see something of the NV40 this year - If they stick to the timetable, it'll be November for an announcement (not shipping). As i said, however, the NV38 has thrown a bit of a spanner in the works there.

Yeah, despite any hype that we may hear from nVidia, ATI or their fans, when I see a company launch a new high-end product one month (as both ATI and nVidia are expected to in August / September), that is the best indiaction to me that the "next" high-end product won't ship for another 6.

Interestingly, ATI has taken the position of "no paper launches". That they strive to only launch a product within 30 days of it shipping. If they hold themselves to this, we might see a considerable time difference between product "launches", though a similar time-frame for shipping.
 
Who said that the NV40 was supposed to tape out in July?

If those that spread rumours cannot validate any possible insider tidbits they get as they should, than it's entirely their fault.

Apart from that, I'm a layman but it is my understanding that it is more important when an IHV has a SUCCESSFUL tape out.

The "if all goes well" scenarios apply for any IHV. If you get a dead chip back don't tell me that you send it into production hm? :oops:
 
I agree that availability this year is looking less and less likely (particuarly with the NV38 turning up), but i certainly think we'll see something of the NV40 this year - If they stick to the timetable, it'll be November for an announcement (not shipping). As i said, however, the NV38 has thrown a bit of a spanner in the works there.

Personal speculation:

A. Two paper launches out of three, one IHV relaxing and announcing shortly before mass availability (more likely)

B. Three paper launches more or less at the same timeframe (less likely).

You may ask why I consider (A) more likely: publicly announced deadlines for debuts ;)
 
H1 2004 for the R420 was said by a few sources AFAIK, and also confirmed by MuFu.

What I can confirm for the NV40 right now is that nVidia is preparing for-developer documentation right now ( mostly done ) , just like they did for CineFX. I guess it'll be public info, but then again, they might reserve it for privileged developers - that would surprise me, but I just don't want people to say my info wasn't accurate.

Also, I got no idea whether the NV40 tape-out... And I don't expect me to reveal when it will have taped-out, either.
Why? Because my source(s) is/are obviously under NDAs, and that while one/they *will* give me info, if some info is too trackable, too risky, or damaging to the company, they will NOT give it to me, or if they do, they'll not let me post it ( that case is rare though )
And I know for a fact tape-outs of high-end parts seem to be the type of things my source(s) does/do not like to leak.

In relation to the NV38: I doubt the NV38 is much of a stop for the NV40. It's really a respin with slightly higher clocks, and maybe slightly lower ( or at least equal ) per-chip costs.

As for the NV40 tape-out date of July: Could be we misinterpreted that thing too - maybe that was a conservative guess for the NV36, which taped-out in May. It IS true that the NV36 has been going very smoothly, even smoother than expected I believe. It's nVidia's record at their "from-working-silicon-to-fragging-in-quake" so called 'test' ;)


Uttar
 
Pavlos said:
A straightforward (and I think the optimal) solution is to shade a grid of pixels at once. The implementation of partial derivatives (dsx/dsy) becomes trivial but a bit-mask of "active pixels" must be maintained for branching. The opcodes operate or update only the active pixels. Here is a simple and unoptimized implementation of dsx using C: ...
That's more or less the approach I had in mind. Executing the shader and stopping at every dsx/dsy instruction, for a 2x2 pixel block. But I think it can have a significant performance impact because then I'm rendering more pixels. Especially for small polygons there are many pixels in the 2x2 blocks that fall outside the polygon. In the worst case I'm doing four times more work. Even if that is acceptable performance-wise, I just don't think it's an elegant solution.
And here is a reference implementation of the opcode add ...
Oh, but that's not the way my shader works. I effectively compile the ps 3.0 instrutions into MMX/SSE instructions, and with an automatic register allocator I make optimal use of the available registers. With your method you would also need a gigantic memory bandwidth, while I make optimal use of the cache (spatial and temporal coherency).

And I can't use it just for the dsx/dsy instructions either. Per pixel I would have to store the contents of all registers, which is more than 128 bytes. So for a 1024x768 resolution that's at least 100 MB. That's unacceptable memory-intensive.

So, I hope you see the problem is not as trivial as it first seems. Thanks anyway!
 
Nick,
I was not clear in my first post so you have misunderstood my suggestion. Of course you are right about the bad performance when you are rendering small polygons. This is also a problem for hardware architectures.

In my previous post NUM_PIXELS_X and NUM_PIXELS_Y are not the dimensions of the final image but the dimensions of the pixel block you are working on (2x2). So I think the memory and the bandwidth requirements are not an issue since you need only 4 copies of the shader state.

Let me explain a bit more. You want to shade a block of pixels.

One approach is to shade every pixel of this tile separately, one after the other. Every opcode/instruction in this approach operates on a single pixel. Probably you are thinking something similar.

The other approach (the one I was referring) is to shade all the pixels of this 2x2 block at the same time. Every instruction of the shader then must operate on a 2x2 grid of pixels, not a single pixel. So the instructions must take as input a 2x2 matrix and output a 2x2 matrix of values (scalars, vectors, matrices and whatever d3d defines). Much like the examples I gave in my previous post.

As for the implementation, you can convert very easily the examples I gave (and the rest of instructions) to a sequence of SSE instructions and then you can use Softwire to compile them at load time (Is there any problems with that?).

Also note that the only way to take full advantage of the SSE instructions is to shade 4 or more pixels at once (not sequentially), since shaders usually use scalars and three component vectors* and it's sub-optimal to use SSE for operations between two operands.

As for the pixels in the group outside of the triangle you cat set the corresponding is_active[][] flag (see my previous post) to zero and the shader will never touch them.

As far as I can see, the only shortcoming of this approach (also used by Pixar and many others) is the little overhead it introduces on every instruction, to check if each pixel is active.

Of course, if you find something else I’m very interested to hear it. I’ m facing the same problems with my RenderMan renderer and the corresponding instructions and I want the shaders to execute as fast as possible.

* I’m not sure if d3d exposes only four component vectors, but if that‘s the case then it will probably change in future versions. RenderMan doesn’t even define a four component vector and GLslang defines also a vec2 and vec3 datatype. Four component vectors aren’t very useful for shading.

The SRT rendering Toolkit
 
Pavlos said:
Also note that the only way to take full advantage of the SSE instructions is to shade 4 or more pixels at once (not sequentially), since shaders usually use scalars and three component vectors* and it's sub-optimal to use SSE for operations between two operands.
See! I'm not the only one that believes this! :)

I am pretty sure it will be at least twice as fast to use SoA.
 
And I know for a fact tape-outs of high-end parts seem to be the type of things my source(s) does/do not like to leak.

No surprise there. I don't think it's any different at other IHVs.

What I can confirm for the NV40 right now is that nVidia is preparing for-developer documentation right now ( mostly done ) ...

I doubt others would be much behind in that department, if any. Availability it still the key word for each one of them.

It then comes down to definitions too. In my definition there will be most likely no PS/VS3.0 on shelves before this year runs out. Frankly if your speculations on R4xx should be accurate, then I personally can't see much reason why it should come that much later than the NV40.
 
Pavlos said:
In my previous post NUM_PIXELS_X and NUM_PIXELS_Y are not the dimensions of the final image but the dimensions of the pixel block you are working on (2x2). So I think the memory and the bandwidth requirements are not an issue since you need only 4 copies of the shader state.
Oh, sorry, I know I must have misunderstood that...
The other approach (the one I was referring) is to shade all the pixels of this 2x2 block at the same time. Every instruction of the shader then must operate on a 2x2 grid of pixels, not a single pixel. So the instructions must take as input a 2x2 matrix and output a 2x2 matrix of values (scalars, vectors, matrices and whatever d3d defines). Much like the examples I gave in my previous post.
I've got good news and bad news. ;)

The bad news is that I only have 8 SSE registers, which each have 4 floating-point components. Most shader operations take two 4D vectors as input and one as output. If you do that operation on 2x2 pixels at once, you would need 12 registers (or 8 for 3D vectors, or 8-6 when two operands are equal). The problem with that is that for every instruction I mostly need to load and store all registers. Some data can be kept in registers, but it won't be an exception that 256 bytes have to be moved from and to the cache. For a simple add instruction that's unattractive...

The good news is that the Pentium has a feature to make up for its relatively low register count, namely register renaming. What it does is that two data-independant instructions, but which operate on the same registers, can execute in parallel by using different physical registers. For example this means that a very tight loop can be at a different iteration at the same time if they have no data dependency!

So my best idea was to process the pixels in a 2x2 block sequentially, but only the part before the dsx/dsy. This way very little register loading/storing between instructions is required, and physically it can still execute independently in parallel. Once we reach the dsx/dsy instruction, we store the register which we wish to differentiate. Once the 2x2 block is done, we continue with the rest of the shader, starting with the dsx/dsy.
As for the implementation, you can convert very easily the examples I gave (and the rest of instructions) to a sequence of SSE instructions and then you can use Softwire to compile them at load time (Is there any problems with that?).
SoftWire is -not meaning to give myself too much credit- extremely suited for the situation. Its build-in automatic register allocator made it possible to work with symbolic names instead of directly with registers (although still possible), and still have the performance of hand-written assembly. So it's very easy to redesign things without the trouble of remembering what registers hold what data. I also plan an automatic sheduler, so that in my above design dependencies can be avoided and parallel execution improved even more.
Also note that the only way to take full advantage of the SSE instructions is to shade 4 or more pixels at once (not sequentially), since shaders usually use scalars and three component vectors* and it's sub-optimal to use SSE for operations between two operands.
I've discussed this with Dio as well, but I'm not convinced it would give a big performance increase. I've even done some tests and they clearly showed that memory operations, even when it's in the cache, have a considerable latency. In the worst case using your implementation would result in three memory operations and only one arithmetic operation per pixel for a simple add instruction. In my implementation there's a much greater chance that the data is already in registers and it translates to only one arithmetic operation.
As for the pixels in the group outside of the triangle you cat set the corresponding is_active[][] flag (see my previous post) to zero and the shader will never touch them.
With my implementation that's even simpler. Every instruction can have its own control which is only needed at branch instructions (and dsx/dsy and writing results).
As far as I can see, the only shortcoming of this approach (also used by Pixar and many others) is the little overhead it introduces on every instruction, to check if each pixel is active.
I wouldn't call it a small overhead, since it's needed for every pixel and every shader instruction. That's an extra index calculation, memory lookup, compare and jump instruction. In my implementation the shader can just keep executing for the same pixel until a new branch or dsx/dsy is reached. Although more complex to implement that would require a minimal extra overhead at shader execution time.
Of course, if you find something else I’m very interested to hear it. I’ m facing the same problems with my RenderMan renderer and the corresponding instructions and I want the shaders to execute as fast as possible.
Well, I'm probably very demanding but I was actually asking how to do things even faster. :D My main problem is that it requires so much to execute these dsx/dsy instructions correctly. Especially since the 2x2 blocks have so many pixels that fall outside the polygon seems very unoptimal. It's horrible for tiny polygons, but for 'medium' or even 'large' polygons. I can draw cases of polygons with 50 pixels that need 30 pixels extra, or 200 pixels with 60 pixels extra. I don't know what the 'average' size for polygons is, but they are getting smaller with every generation of games and you could easily be computing 20% pixels 'too much'. I know some people who would kill for a performance increase like that. :devilish:
* I’m not sure if d3d exposes only four component vectors, but if that‘s the case then it will probably change in future versions. RenderMan doesn’t even define a four component vector and GLslang defines also a vec2 and vec3 datatype. Four component vectors aren’t very useful for shading.
There are also scalar SSE instructions so in average I'm only loosing 1/4 performance, but that's only on the arithmetic operations and is nothing compared to the extra memory insturctions needed when processing pixels in parallel. Furhtermore I attempt to pack some interpolants together.

Anyway, I'm having exams now so I can't experiment with it all...

Your SRT renderer is very impressive!
 
I don't expect drivers to be up to snuff upon card's release. Maybe 3-6 months behind to get most of the quirks worked out. Though, they do have ver2 experience to fall back on. Is nvidia's arb shader path still 50% behind of ihv speed? Both JC and Joe(abducted) mentioned this so I wonder if it's still true.
 
Nick said:
The bad news is that I only have 8 SSE registers, which each have 4 floating-point components. Most shader operations take two 4D vectors as input and one as output. If you do that operation on 2x2 pixels at once, you would need 12 registers (or 8 for 3D vectors, or 8-6 when two operands are equal).
This is a fallacy.... take this (typical) instruction stream
Code:
mul r0, r0, r1
mad r0, r0, r1, r2
add r1, r0, r2
can be compiled to
Code:
; initial state is all empty
; read mul operation; 2 input operands; destination is r0
; allocate xmm0-3 to result of this instruction
; look up r0. Not in register cache - so load up
movaps xmm0, [r0_r]
movaps xmm1, [r0_g]
movaps xmm2, [r0_b]
movaps xmm3, [r0_a]
; Look up r1. Not in register cache. Apply directly to intermediate
mulps xmm0, [r1_r]
mulps xmm1, [r1_g]
mulps xmm2, [r1_b]
mulps xmm3, [r1_a]
; End of instruction - mark that r0 is now xmm0-3
; read mad operation: three input operands, output is r0
; Look up r0 - cached
; skip reading r0, we know it is in xmm0-3
; Look up r1. Not in register cache. Apply directly to intermediate
mulps xmm0, [r1_r]
mulps xmm1, [r1_g]
mulps xmm2, [r1_b]
mulps xmm3, [r1_a]
; Look up r2. Not in register cache. Apply directly to intermediate
addps xmm0, [r2_r]
addps xmm1, [r2_g]
addps xmm2, [r2_b]
addps xmm3, [r2_a]
; End of instruction - mark that r0 is now xmm0-3
; read mul operation: two input operands, output is r1
; Look up r0. In register cache, in 0-3. r0 output is not overwritten by this instruction, therefore we must write r0. We can then use 0-3 as temporary registers
movaps [r0_r], xmm0
movaps [r0_g], xmm1
movaps [r0_b], xmm2
movaps [r0_a], xmm3
; Look up r2. Not in register cache. Apply directly to intermediate
addps xmm0, [r2_r]
addps xmm1, [r2_g]
addps xmm2, [r2_b]
addps xmm3, [r2_a]
; End of instruction - mark that r1 is now xmm0-3
and so on...

Why do you need 12 registers? You have to use the load-execute form of instructions to make SSE efficient. Count the cycles to execute that, and you'll find it's many times faster than you can do 4 pixels in in SoA. And I've still got 4 registers free - and the closest any reload of r0 can be scheduled to the writes is probably 8 instructions and at least 10 cycles.

To summarise: it is impossible to need more than 4 register loads and 4 register stores per (basic) DirectX assembly instruction. The 4 stores are used to free up to 4 registers in order to allocate a new destination operand; the 4 loads load up one of the two or more source registers.
 
PaulS said:
...
Given that the ATi part is commonly believed to be shipping at least 2 or 3 months after the NV40, it seems sensible that the tapeout dates would be similar distances apart.

Just a gentle reminder that what is often "commonly believed" is often commonly not the case...;)

It was commonly believed for a good portion of 2002 that nv30 would ship by xmas 2002 *at the latest*. It was commonly believed that nv30 would run rings around R300. The commonly believed, but erroneous, hardware specs for nv30 are too numerous to recount. It was commonly believed that nv35 would ship in volume in June 2003, but nVidia's recent reported comments on its .13 Micron yield situation indicate the September time frame at the earliest for volume shipment. It was briefly and erroneously also commonly believed that IBM is currently fabbing nv40 instead of other nv3x variants. There are many more examples which I'm sure you're aware of.
 
Dio said:
This is a fallacy.... take this (typical) instruction stream
<snip>
can be compiled to
<snip>
You're right, you don't need that many registers if you use memory operands all the time...

But I did a test, again. ;) I was convinced that if you replace all memory accesses with registers, things would run much faster. And your example lends itself perfectly for this because the memory operands can be replaced by xmm4-xmm7. My results were shocking:

Memory operands: 0.038 s
Register operands: 0.599 s

I used an accurate timing method, one million iterations, and the whole loop written in assembly. It was run on my Celeron 1200, so a little calculation shows us that the first result is within expectations. But what would make the version with register operands more than ten times slower? I've been programming assembly for quite some time but I never expercienced behaviour like this. So, Does anyone have a logical explanation for it?

I don't get it. What I expected was that the register version was at least as fast as the memory version. I just wanted to test how much of a 'speedup' it would be. I made sure that I didn't make any stupid mistakes. Does anyone have VTune to check if there are any serious penalties?

Anyway, it only started to get interesting. I thought maybe there's some 'register read limit' so I replaced the xmm4-xmm7 with xmm0-xmm3 to see if that made things worse. Here are the results:

Memory operands: 0.038 s
Register operands: 0.029 s

The same happens when only using xmm4-xmm7. That's more like what I expected for the first test as well (or even better since there are less dependencies). A 30% performance increase could be worth it in this situation. Of course I won't argue that most algorithms can be done more efficiently with SoA but I believe that's only possible when enough registers are available.
To summarise: it is impossible to need more than 4 register loads and 4 register stores per (basic) DirectX assembly instruction. The 4 stores are used to free up to 4 registers in order to allocate a new destination operand; the 4 loads load up one of the two or more source registers.
I would count a load-execution instruction as a register load as well. In microcode it probably looks very similar to a load followed by the arithmetic instruction so theoretically the overhead is the same. The only advantage is that you save registers because they are 'internal'. This all assumes of course that the above results are anomalies...

Take for example this section from your code:
Code:
mulps xmm0, [r1_r]
mulps xmm1, [r1_g]
mulps xmm2, [r1_b]
mulps xmm3, [r1_a]
mulps xmm0, [r1_r]
mulps xmm1, [r1_g]
mulps xmm2, [r1_b]
mulps xmm3, [r1_a]
That's eight memory loads for only four vectors. This can be four memory loads by loading them into xmm4-xmm7 and reusing them. I tested that as well and the results were 14 milliseconds for your version and 13 for mine. A small difference, but it was consistent. And it gets worse if registers are reused more than once, and with 50+ shader instructions that's obvioulsy the case. In my implementation there are only memory accesses for the first read, the spills (which are few) and the last write. In your example the reuse is maybe two in average and you have to write extra optimization passes to detect and make use of it.

But you've convinced me that SoA could have great potential. Thanks a lot for that clear example! It's certainly worth it to try it out and I'll do it as soon as I finish my exams...
 
Nick said:
My results were shocking:

Memory operands: 0.038 s
Register operands: 0.599 s

I used an accurate timing method, one million iterations, and the whole loop written in assembly. It was run on my Celeron 1200, so a little calculation shows us that the first result is within expectations. But what would make the version with register operands more than ten times slower? I've been programming assembly for quite some time but I never expercienced behaviour like this. So, Does anyone have a logical explanation for it?
My immediate guess would be that the register-operand version has some kind of long dependency chain that doesn't exist in the memory-operand version.

I would count a load-execution instruction as a register load as well. In microcode it probably looks very similar to a load followed by the arithmetic instruction so theoretically the overhead is the same.
Yes, it is. But the loads are free, because they are not on the dependent path and so can be hoisted during the waits, and as you say the register is an 'internal' one that therefore eases your register pressure.

It's reasonably safe with Pentium 4 SSE code to assume it is looking 10-15 instructions ahead - so if the latency between the 'current' instruction and that later one is less than 10 (the load is only 6) then it's free.

That's eight memory loads for only four vectors. This can be four memory loads by loading them into xmm4-xmm7 and reusing them. I tested that as well and the results were 14 milliseconds for your version and 13 for mine. A small difference, but it was consistent.
There might be a small difference - I'd have expected about 5%, you measured it a little larger than that. If you know you have free registers you can improve your register alloc to optimise this out. (Or maybe do it as a post-processing pass?). Of course, assembly code can be twitchy to +- 10% under some circumstances, so it might just be noise.

And it gets worse if registers are reused more than once
Does it? :) I'm not sure it would... of course, it might.

In your example the reuse is maybe two in average and you have to write extra optimization passes to detect and make use of it.
Yep, it's pretty limited if the code's working on 4D vectors, although both myself and Pavlos think that 4D vectors are rare in pixel shader code - I generally see a mix of 3D vectors and scalars. It's not much better for 3D vectors, of course!

But you've convinced me that SoA could have great potential. Thanks a lot for that clear example! It's certainly worth it to try it out and I'll do it as soon as I finish my exams...
At last! :) But it's my fault - I should have posted this code example before. It took a similar example to knock me out of AoS into SoA, and the instant I implemented it I was a complete convert.
 
Nick said:
Take for example this section from your code:
Code:
mulps xmm0, [r1_r]
mulps xmm1, [r1_g]
mulps xmm2, [r1_b]
mulps xmm3, [r1_a]
mulps xmm0, [r1_r]
mulps xmm1, [r1_g]
mulps xmm2, [r1_b]
mulps xmm3, [r1_a]
That's eight memory loads for only four vectors. This can be four memory loads by loading them into xmm4-xmm7 and reusing them. I tested that as well and the results were 14 milliseconds for your version and 13 for mine.
There's always better ways to do things. For example, this could be a much faster way:
Code:
movaps xmm4, [r1_r]
(repeat for gba)
mulps xmm4, xmm4
(repeat)
mulps xmm0, xmm4
(repeat)
because then you have shortened the main dependency chain by lengthening a short chain (the short chain in the original case being just a load, replacing it with a square operation) rather than one. You'd need a long chain of code on the xmm0 chain to show the gain, of course, and it would be more valuable in AoS code than SoA because you tend to have less interleaved code in AoS.

When you get to writing optimisers, dependency chain analysis and shortening can get quite valuable when you have architectures like SSE...
 
Back
Top