What are Extreme Pipelines?

DemoCoder,
Short response: read my post again. :-?

...

That isn't usually successful, so I'll explain in more detail:

DemoCoder said:
demalion said:
The NV40 was 1.32 full precision, and 1.96 partial precision, I presume due to extracting normalization for its special functionality. Based on that assumption: for partial precision performance outside of the assumed normalization functionality, calculated by counting normalization as 1 extracted instruction instead of however many assembly instruction, indicates an IPC of about 1.27.
Why would you extract free norm vs any of the other "special functions" like POW, LIT, SINCOS, etc?

Welll...I targeted partial precision normalization specifically because...the NV40 is stated to have "free" partial precision normalization. It seems clear that discounting fp16 normalization as a factor would try to isolate fp16 IPC aside from fp16 normalization. I labelled that attempted data, and included it...that's pretty much the story.
I'm confused as to why you are asking about POW, LIT, SINCOS when I explicitly stated I was isolating partial precision normalization.

Further explanation, if that isn't clear...

If you just want to measure vector IPC, you also need to correct for special functions of ATI's scalar ALUs as well. That doesn't seem to be a fair comparison.
Hmm? I only tried to isolate fp16 normalization specifically, as I tried to make clear by...mentioning that straight out. :-?
I don't see how the reason is mysterious at all, nor do I think the 1.96 and my explanation of what it seems to indicate were obscure for you to make this the issue you do.

Stating what I thought was obvious: 1.96 is a higher IPC than anything else listed, and my stated presumption was that it was due to the NV40's partial precision normalization showing benefit. I didn't find an IPC benefit surprising, or feel a need to point out that 1.96 was greater than various other numbers, as it seems obvious to me that the normalization functionality is a boon when it can be used and a higher IPC goes along with that idea.
I listed the IPC without adjustment first, prominently, and with an explanation of what I think primarily accounted for the jump for partial precision in comparison to full precision.
I then made sure to provide the other _pp figure only after an explanation of exactly what it discounted, to give some indication about partial precision IPC outside of this beneficial partial precision normalization feature for context (and accordingly annoted: "for partial precision performance outside of the assumed normalization functionality...").

Did this really need to be explained again?
...
Perhaps if someone didn't read anything but "1.27", there might be some unfairness in what that person took away from the figures. As it stands, I only see another piece of data for analysis, listed after other data, with an explanation of what each piece of data was.
Normalization takes 3 instructions. This lower IPC, it would raise it.
...
Yes, 1.96 is indeed higher than the other IPC numbers I mentioned. :oops: Also, 1.27 is fairly close to 1.32. Things about the NV40 could be proposed if it was the topic under the discussion.
However, my purpose here was limited to a context for the R420 evolution from the R3xx.
 
Sorry, I did misread your original post, but I still don't see how you arrive at the 1.27 figure. How was that calculated from 1.96?
 
DaveBauman said:
With R300 ATI had a regionalised quad dispatch system, such that the screen-space is tiled and assigned to different pipelines - this is how the R300 can be scalable internally and externally in that the tiles can be allocated to quad rendering pipelines internally in a chip and externally across multiple chips. R420 adopts the same type of quad dispatch system, which is how the system was easily extended to 4 quads, however it has been slightly altered to allow for programmable tile sizes so the load balancing between the pipes can be controlled in a much finer way and potentially altered according to resolution. Reducing the tile size allows for higher efficiency with smaller triangles, while larger triangles favour texturing efficiency. This quad dispatch system can result in different quad pipelines operating on the same triangle if the triangle coverage is larger than the currently set tile size, or quad rendering pipelines will be operating on completely different triangles at any one time.
This is pretty extreme functionality & explains something about high performance at high resolutions (where triangles are effectively bigger)
 
DemoCoder said:
Sorry, I did misread your original post, but I still don't see how you arrive at the 1.27 figure. How was that calculated from 1.96?

No problem as long as we can move on to useful stuff like exposing any errors in my calculations :p and making analysis more fruitful for whichever card.

I calculated the figures based on the thoughts in the linked post, from MDolenc fillrate figures. I got a fillrate per pipe figure, multiplied by instructions, divided by clock speed (Megas cancelling).
For the shader in question, I got 20 as instruction count, and for that IPC figure I discounted dp3/rsq/mul sequences as 1 nrm instruction (I counted "add" separately for the dp3/rsq/mad).
 
DaveBaumann said:
Xmas said:
Hm, I wonder why the ALU1 blocks are smaller than the ALU2 ones... maybe it's not that much different from R300 after all.

Read the review! ;)
Done :D
Unfortunately it's not very precise on what ALU1 can really do.

But from a pure looking-at-the-pipeline-configuration POV, I'd expect NV40 to have a significantly higher IPC on average. That is, if the compiler does its work properly.
 
Back
Top