If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#176 | |
|
Senior Member
|
Quote:
That is, if this really is a matter of units present and not being able to feed them with according data in time.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
|
#177 |
|
Regular
|
Since R300 at least, ATI GPUs have had dedicated interpolators. Prolly all of them.
Since not all attributes are for texture coordinates, it wouldn't make sense to do interpolation in the TUs. Jawed |
|
|
|
|
|
#178 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,105
|
The idea that the last 2 SIMDs are "bonus" SIMDs is also supported by the discrepancy in the L1 and L2 bandwidths stated by AMD.
480 GiB/sec L1 and 384 GiB/sec L2 480/10 = 48 2*48 = 96 and 480 - 384 = 96 If the extra SIMDs had not been laid down, the L1 and L2 bandwidths would have been matched. Is there any data on the size of L2 transfers? The numbers seem to indicate each L2 cache quadrant can transfer 128 bytes/cycle. If it's one transfer per section, that means 4 SIMDs can be fed 128 bytes a cycle. If the sections are dual ported, it's 8 SIMDs that can be fed 64 bytes a cycle. Either way, the last two SIMDs are a minor source of asymmetry.
__________________
Dreaming of a .065 micron etch-a-sketch. Last edited by 3dilettante; 21-Jul-2008 at 15:07. Reason: switched from \ to / |
|
|
|
|
|
#179 | |
|
hardware monkey
Join Date: Mar 2007
Posts: 3,904
|
Quote:
I'm done with this discussion. If you don't want to admit ATi fixed their lack of texture capability with RV770 - ok. What's so bad about acknowledging you've addressed your weaknesses anyway? Most people would consider that a GOOD thing. |
|
|
|
|
|
|
#180 |
|
Member
Join Date: Nov 2006
Location: San Antonio, TX
Posts: 931
|
So it would be a combination of the TU's are not attached to the interpolators but scale evenly with each anyway, coupled with the notation that two extra SIMD's were added which is why the interpolators stayed at 32?
__________________
ATi Beta Tester AMD Panelist Member |
|
|
|
|
|
#181 | ||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
The only problem with this logic is that some texture operations aren't ever affected by AF, like shadow map samples and post-processing. Given the prevalence of these in recent games, you may be right that no-AF/AF doesn't help us much. I suppose you can still look at the difference between NVidia and ATI, though, to see if AF is ever the reason that ATI's chips become less competitive than NVidia's. I remember this being the case with Crysis with R600 at one point, but it may have changed with drivers. Quote:
|
||
|
|
|
|
|
#182 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
First of all, each batch takes 16 clocks to be serviced by the TMU quad in the SIMD, so its 400 clocks of latency hiding with 10-register shaders. Second of all, even if you are latency limited due to 20+ regs per shader, it doesn't change the fact that clustered dependent tex ops are no different than evenly spread out tex ops followed by math dependent on each result. Statistically, you'll only be idling either the TMU or the ALU, not both, so the overall ratio is all that matters. You'd need a pretty pathological shader for even 10 batches to be insufficient for this to be the case. |
|
|
|
|
|
|
#183 | |
|
Member
Join Date: May 2002
Location: Santa Clara
Posts: 584
|
Quote:
|
|
|
|
|
|
|
#184 | |
|
Regular
|
Quote:
Really we should be talking about Interpolator:Fragment ratio. Is RV670 commonly interpolator-bound?... It's worth noting that GPUSA provides an indicator for pixel shader code when it is interpolator-bound. Messing about with some silly shaders in GPUSA it appears that both RV670 and RV770 have 32 interpolators: Code:
struct vertex
{
float4 colorA : color0;
float4 colorB : color1;
float4 colorC : color2;
float4 colorD : color3;
};
float4 main(vertex IN) : COLOR
{
float4 A = IN.colorA.xyzw;
float4 B = IN.colorB.xyzw;
float4 C = IN.colorC.xyzw;
float4 D = IN.colorD.xyzw;
// 2 cycles of math are used to check that the GPU is not interpolator bound
// Note that ALU:Fragment ratios vary across these GPUS
// RV610 and RV630 - which are Interpolator bound?
//return A*A; // Both GPUs run at full speed here
// 2 attributes makes both GPUs interpolation bound:
//return A*B; // 1 instruction group
// RV610 needs 2 cycles
//return A*B*B; // 2 instruction groups - 2 cycles on RV610
// find the limit for RV630:
//return 1/(A-B); // 5 instruction groups - interpolation bound
//return A/(A-B); // 6 instruction groups - 2 cycles on RV630
// RV670 and RV770
// 4 attributes makes both GPUs interpolation bound:
//return A*(A-B)*(A-B)*(C-D)/C; // 7 instruction groups
// RV670 needs 2 cycles for 4 attributes, while RV770 is interpolation bound:
//return A*B*(A-B)*(C-D)/C; // 8 instruction groups - 2 cycles on RV670
// RV770 needs 2 cycles for 4 attributes:
return A*A*B*B*C*D*exp(B)*exp(C)/exp(D); // 20 instruction groups - 2 cycles on RV770
}
So the smaller GPUs are 1:1 Interpolator:Fragment while the bigger GPUs are 2:1. Jawed |
|
|
|
|
|
|
#185 | |
|
Regular
|
Quote:
You can't average-out this kind of bottleneck unless you have substantially more batches in flight. The Veined Marble (Procedural Stone) shader in Shadermark is a good example of this. The hardware's effective ALU:TEX ratio (>1) looks fine... Jawed |
|
|
|
|
|
|
#186 | |
|
Senior Member
|
Quote:
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
|
#187 | |||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
If you're saturating the TMUs, you can't avoid having the ALU idle, and vice versa. Give me some more details in your example. You mentioned 10 registers and 2-level dependency, but we give me an instruction stream characterization like X ALU ops, Y dependent tex ops, Z ALU ops. Quote:
I almost feel like I need to make a Flash animation of GPU scheduling to prove this to you. Quote:
|
|||
|
|
|
|
|
#188 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
This, along with the scalar setup, may be one of the reasons that NVidia can get away with smaller register files, but they have to keep vertex attribute information around for a lot longer and worry about feeding it into the SF units in the SPs. OTOH, ATI can discard it as soon as the batches are initialized, keeping the data flow neat and simple. |
|
|
|
|
|
|
#189 | |||||
|
Regular
|
Quote:
Quote:
Quote:
Quote:
The same thing goes for shaders with texture heavy parts and math heavy parts. The overall ratio is all that matters because there are enough batches in flight to statistically even this out. With dependent texturing, 3D textures and dynamic branching, texturing latency can be severely increased. When this happens the shader's apparent ALU:TEX ratio becomes useless and the "shader-wide averaging" you allude to is no longer meaningful. Quote:
http://forum.beyond3d.com/showthread...94#post1164394 which uses 8 vec4 fp32s under PS2.0 with hardware ALU:TEX of 1.69 on RV670. Note if you compile for PS3.0 then the shader uses even more registers As far as I can tell the poor performance of the shader (i.e. it is not ALU-limited as suggested by the ALU:TEX ratio) is down to the dependent fetches into the 3D texture. I know you're going to retort "of course a TEX-limited shader is going to leave the ALUs idling". But I'm correcting your earlier statement about the overall ALU:TEX ratio of a shader, which is clearly not true in general. The shader's overall ratio can easily lie due to short sections of unruly code which turn out to have a radically different ratio. Jawed |
|||||
|
|
|
|
|
#190 | |
|
Regular
|
Quote:
Code:
RS Instructions: rs 00: r00.rgb- = txc00 rs 01: r01.rgb- = txc01 rs 02: r02.rgb- = txc02 rs 03: r03.rgb- = txc03 Not sure why GPUSA doesn't show this stuff for R600 and later. Jawed |
|
|
|
|
|
|
#191 | ||||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
The are 4 possibilities about utilization: A) TMU is saturated B) ALU is saturated C) Both are saturated D) Neither are saturated C) is extremely rare, as it can only happen for a small subset of shaders. If a GPU is running a shader at the maximum possible speed, then it will be either A or B. Your job is to tell me why your example shader would fall into class D. Quote:
Again, your job is to tell me why you'd be in situation D. Quote:
Quote:
Last edited by Mintmaster; 22-Jul-2008 at 17:30. |
||||
|
|
|
|
|
#192 | |||||
|
Regular
|
Quote:
Quote:
I'm talking about a class B shader as defined by its overall ALU:TEX ratio which executes as a class A shader due to sustained worst-case TEX throughput that exhausts the available ALU batches. The PS2.0 compilation of the veined marble shader, with only 8 registers, exhausts all 32 batches in flight. Quote:
If the texture operations are cache-thrashing dependent-2D, GPUSA certainly can't take account of extra cycles caused by repeated L2 misses. Quote:
Quote:
Jawed |
|||||
|
|
|
|
|
#193 | |
|
Member
Join Date: Nov 2006
Posts: 128
|
Quote:
|
|
|
|
|
|
|
#194 |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Vertex attribute information is going to be in the post-transform cache for both methods, but in ATI's case the rasterizer only has to access at most three of those vertices every clock cycle. Nothing downstream from the rasterizer needs that, and it processes triangles serially so data flow is exceedingly simple. You may have quads from different triangles in a batch, and a triangle can have quads in different batches, but the rasterizer will only be working on one triangle at a time and won't move to the next one until the current one is completely fed to the shader arrays.
With NVidia's method, each cluster that got quads from a triangle will have to store a copy of its vertex attributes, as seriously I doubt all the SMs all have direct access to the post-transform cache. Each quad is going to have to keep pointers to the vertices so that the on-the-fly interpolation can be done. It's not just how long, but how many have to be kept for that long. Again, I'm not talking about the post-transform cache. With ATI, the rasterizer only needs data from 3 verts at a time. With NVidia, each of the 16 SMs could need tens or even hundreds of vertices available to feed the SF/interpolator unit. |
|
|
|
|
|
#195 | |||||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
Quote:
Quote:
Just think about it: If there are no available ALU instructions, then one of three things happened: A) This is a texture throughput limited shader. All the threads in the world couldn't help you, as you're limited by texture filtering and/or the TMU data bus. B) You just got astronomically unlucky. In a theoretically ALU limited shader, at any given time a batch ready for the sequencer to dispatch has a greater chance of waiting for the ALU to service it than the TEX unit (I didn't quite describe that right, but hopefully you know what I mean). For 32 batches, we're looking at less than a one in billion chance of all needing to go to the TMU. C) Latency. Maybe you got unlucky and 5 batches were just given to the TMU in the last few cycles. The other 27, though, have had the TMU make memory requests for but nothing has come back yet. That's 432 cycles, though. Not the 128 figure your math comes up with. I really think an animation would help explain all this, but I don't know how to do it without taking up gobs of time. Maybe I can come up with a text-based simulation that's descriptive enough. Quote:
Quote:
|
|||||
|
|
|
|
|
#196 | |||
|
Regular
|
Quote:
Quote:
http://forum.beyond3d.com/showthread...32#post1165632 Quote:
The curious thing is, if you are doing one-triangle-at-a-time rasterisation and you are doing screen-space tiled rasterisation, why can't you have multiple rasterisers running in parallel, one per screen-space tile with the triangles per tile serialised? Jawed |
|||
|
|
|
|
|
#197 | ||||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
Quote:
Quote:
Quote:
|
||||
|
|
|
|
|
#198 | |||||
|
Regular
|
Quote:
Quote:
Veined Marble is just a gross example of the way that instantaneous ALU:TEX ratio can radically differ from a shader's overall ALU:TEX, making estimates of performance tricky. Quote:
Quote:
Using the X1800XT numbers that Carsten posted here: http://forum.beyond3d.com/showthread...50#post1194350 and seeing that X1800XT runs Car Surface and Veined Marble at almost exactly the same speed (implying that Veined Marble is, at least, very close to being ALU-limited), it's worth noting that GPUSA says that this shader is 6.75 ALU:TEX on X1800XT. Since the hardware is 1:1, that implies that these volume texture lookups are taking way more than 2 clocks on average! Quote:
Jawed |
|||||
|
|
|
|
|
#199 | |
|
Member
Join Date: Feb 2002
Location: LA, California
Posts: 826
|
Quote:
Tying the interpolation rate to the number of TA units in each cluster (regardless of whether interpolation is done just in time or up front) seems like the way to go - am I missing something? The fact that NV sometimes makes multipliers otherwised used for interpolation available for general shading would seem to support that they actually have too much interpolation capacity. So I guess my question is, would it be beneficial to have dedicated interpolation HW per cluster? This would basically involve dropping the 8 "sometimes missing" multipliers per SM, and some scheduler (and possibly register file) simplifications. |
|
|
|
|
|
|
#200 | ||||||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
Quote:
Take shader A, which has clustered texture access, and create a new shader B that has the same number/type of instructions (including the same sampling characteristics) but with a more even distribution. My claim is that the execution time of A will not differ from the execution time of B. Local ALU:TEX ratio is inconsequential. Quote:
In this case, it's class A 100% of the time. A shader is only class C if it is class A and B all the time. In retrospect I never should have included it, and instead should have called it class A-B or something. Quote:
Just out of curiosity, why do you call it 0.85 instead of 3.4? Quote:
But yes, it does look like 4+ clocks per lookup. Trilinear filtering would take 4 clocks, FYI. Quote:
The G80 results you found suggest the same thing. Even if we assume that it can compile to needing only 16 FP32 registers due to scalar architecture, it can only have 4092 pixels in flight, which is half of what R600 can hold. It gets twice the framerate, though, meaning it has 1/4 the time to get the fetches from memory. I'm very sure that this shader is filtering throughput limited. Anyway, look closer at C. Do you have any objection to the 432 cycle figure? That's how long a batch has been waiting for data. That's how long memory latency has to be for the 27 batches to be insufficient when situation A doesn't apply, i.e. the shader is ALU throughput limited. I assumed the bare minimum ALU instructions, too, as more instructions would mean more latency hiding. |
||||||
|
|
|
![]() |
| Tags |
| amd, speculation |
| Thread Tools | |
| Display Modes | |
|
|