Yeah, precisely, you want to have available instructions to co-issue on MAD if at all possible. I was distinguishing between 16-vertex batches and 32-fragment warps, and suggesting that vertex instructions tend to be wider (hence will be more likely to keep MAD occupied) while fragment instructions tend to be narrower, hence a lowered chance of keeping MAD busy - corrected by doubling the number of fragments, by joining two batches together.Dealing with the SF is more of an issue with throughput rather than latency, isn't it?
Either 1 or 2 results per clock depending on instruction. So that impacts dependency - you can't use the result of an SF until at least 16 of them have been produced.Per 8 ALUs, you only get two SF results per clock.
I'm unclear whether you're referring to the duration of an instruction from fetch to retire or whether you're referring to read-after-write.The ALUs have 8 clock latency too, don't they?
Well if you can come up with a better reason for warps being 32 fragments, when the batch size of G80 is 16, then you're welcome... It seems to me it's the same reason CUDA uses warps of 32. It can't be that minor if it squanders dynamic branching performance.Anyway, this seems like a very minor consideration to me, focussed too narrowly on worst case scenarios.
We started discussing this originally because of the idea of doubling the width of the ALU pipeline, to 16 objects. If you do that, then you consume fragments twice as fast. The 768 fragments which in G80 provide a minimum of 96 clocks of latency hiding will only provide 48 clocks of latency hiding in the new GPU. So the register file has to double in its fragment capacity. And that's before you clock the new GPU's ALU core to 2GHz+...Yup, I know that. ALU clocks of latency hiding isn't particularly relevant, though. You're trying to saturate the texture unit, which fetches values for 4 pixels each clock. With a ~200 clock latency, you need a minimum of ~800 threads per cluster to avoid stalls. That's a lot more than the 192x2 needed to avoid those register limitations, which shouldn't change with a double-width SIMD array. The only thing that will change is the number of threads needed in flight to saturate the ALUs, and that's still a lot less than ~800.
Per ALU pipeline, it's 128 batches.Fewer texture units on ATI hardware is something that should allow it to get away with a much smaller register file than NVidia, but they seem to go the other way. Why on earth does RV580 handle 512 batches of 48 pixels?
You can't spread the latency around like you're doing. e.g. G80's two SIMDs per cluster each have to hide their own texturing latency, independently. Threads can't be shared across the two SIMDs.That's enough for 1536 cycles between texture lookups! For each TMU-quad, G80 has a max of 48 warps while R580 has 128 batches, with the latter bigger than the former too boot.
For the "no HDR" scores I doubt it. The 8800GTS wouldn't perform similarly to the HD2900XT in that case. Remember that anything above 130 fps is getting a lot of CPU limitation. If you plotted framerate vs. CPU speed for any video card (or framerate vs. video card speed for any CPU), it would gradually level off, not be a sudden transition from linear to plateau. You start having parts of the test that are CPU limited and parts that are GPU limited instead of being 90% one or the other.But I have some doubt, because looking across the cards/resolutions/settings I see bandwidth playing a factor which makes me wonder if fillrate is dominating.
Yeah, I'll agree with that. Remember how texturing takes away ALU ability and vice versa for G7x, and stalls can't be hidden. I don't even know if I'd attribute it to "out of order" processing - even R300 could do math while the texture units were doing their thing.7950GT should be ~44% faster if bilinear texturing rate were dominant here. Trouble is, G7x texturing isn't a bastion of efficiency, and R580's out of order texturing may be making a huge difference (even though this is supposedly "DX8" type code) so this datapoint may be moot
There's definately something weird about G84's texturing. On pretty much every NVidia GPU for the last 5 years you'd get >95% efficiency in multitexturing tests. G84 only manages 70% efficiency (or, according to RightMark, only 50%!).So you're saying that G84's lacklustre performance advantage over RV630 is because it has too much texture filtering/addressing/fetching? Maybe the ALU pipeline clocks in G84 are way lower than NVidia was hoping to get. Or maybe the core clock just goes much higher than they were expecting.
I was under the impression that G84's co-issued MUL works much better than G80's. Maybe that's a driver problem yet to be fixed, it is still pretty new after all.
All texturing tests out there measure bilinear filter speed. Even if they enable trilinear, they'd have to make sure that the pixel:texel ratio was right so that even optimized drivers would sample from two mipmaps. Otherwise you wouldn't see the TA:TF exposed.I see there's some debate on the single-textured test, oh well... I don't understand why the RightMark test doesn't show any sign of the TA:TF ratio as number of textures increases - unless each texture has independent coordinates.
Nope, I meant INT8, though for some reason ATI does INT16 filtering at half the speed of FP16, whereas G80 does both at the same speed.I presume you meant INT16 filtering.
G80 has 136% of the theoretical trilinear/AF/fp16 capability of G71. Out of order threading should provide 30-50% more performance. So G80 is in the region of being theoretically 200% more capable than G71. What was the matter with 2x faster? Why go through the pain of shoving NVIO off die? 3x the TF when you're aiming to construct a 2x faster GPU is just bizarre.
Well, I'm told it's 8 in Xenos, and in Cell it's 7 cycles for MADD. 8 stages is probably enough to get 1.5 GHz, and they certainly want to make it as small as possible if they want small batches or else they need to flip back and forth between more batches.I wouldn't be surprised if ALUs latency is MUCH more than 8 cycles
Maybe a warp count limitation? Maybe NVidia would rather have 16 object granularity for the VS instead of extra texture hiding latency. High register usage in the VS (lots of inputs and outputs) probably precludes them from having lots of threads anyway.Well if you can come up with a better reason for warps being 32 fragments, when the batch size of G80 is 16, then you're welcome... It seems to me it's the same reason CUDA uses warps of 32. It can't be that minor if it squanders dynamic branching performance.
You just told me that 192 fragments (6 warps) is enough to hide the register file limitations and keep the ALUs fed. Double the warp size and SIMD width, and now 384 fragments is enough. That's still less than what G80 can currently handle.We started discussing this originally because of the idea of doubling the width of the ALU pipeline, to 16 objects. If you do that, then you consume fragments twice as fast. The 768 fragments which in G80 provide a minimum of 96 clocks of latency hiding will only provide 48 clocks of latency hiding in the new GPU. So the register file has to double in its fragment capacity. And that's before you clock the new GPU's ALU core to 2GHz+...
Hence my next paragraph.Per ALU pipeline, it's 128 batches.
I know the threads can't be shared, but it doesn't matter. TMU throughput is 4 pixels per base clock. If you have 500 threads in each SIMD array needing a texture request, regardless of what order you choose it will take 250 clocks to get through them. Those 500 threads are more than enough per SIMD to do some math in the mean time, and you can double the SIMD width and still have enough threads. By the time the result for the first texture request comes back (~200 cycles), the TMU is almost ready to do more fetching.You can't spread the latency around like you're doing. e.g. G80's two SIMDs per cluster each have to hide their own texturing latency, independently. Threads can't be shared across the two SIMDs.
Good thing they weren't aiming for 8800GTS performance then
But this is a pure bilinear texturing case. The texturing latency should be hidden 100% of the time and losing the top MAD ALU to TEX still leaves 1 instruction co-issuing.Yeah, I'll agree with that. Remember how texturing takes away ALU ability and vice versa for G7x, and stalls can't be hidden. I don't even know if I'd attribute it to "out of order" processing - even R300 could do math while the texture units were doing their thing.
Yeah, I decided "come back in a few months".There's definately something weird about G84's texturing. On pretty much every NVidia GPU for the last 5 years you'd get >95% efficiency in multitexturing tests. G84 only manages 70% efficiency (or, according to RightMark, only 50%!).
Ah, you think that it's attribute interpolation that's slowing down G84? That's a good thought. Though in the past I've speculated that NVidia has the option to perform interpolation "early" and dump the results in CB (or, worst case, RF). I think a concensus arose that NVidia is storing 1/w, but I'm not sure now.One thing that bugs me is that you need to iterate 2 texture coordinate values in order to get a fetch a texture. If G84 can only do 4 per cluster per ALU clock (like G80), then it should be limited to 5800 Mtex/s.
I don't know why G84 is supposed to be better at MUL co-issue. I'm waiting for the journalists at this site to dig, they've got the hardware and the connections ...Does G84 have more SF units per cluster than G80? That might explain better MUL co-issue, I guess, and why they'd put so much texture addressing there.
Hmm, don't get what you meant by "adds a little more math for the INT8 filtering". Whereas we know that NVidia must have added math for INT16 filtering, because ATI is running at half rate, and INT16 requires more precision than FP16.Nope, I meant INT8, though for some reason ATI does INT16 filtering at half the speed of FP16, whereas G80 does both at the same speed.
Maybe that gives the lie to the "G7x is wonderfully efficient per mm bullshit" then, because HQ texturing was, until the bitter end, not something enabled in benchtests on G71. And G71 still fell as much as 50% behind R580 - though I wouldn't put the blame on texturing per se.Anyway, I was saying an ATI TMU quad and G80 TMU quad both have the same 4-channel filtering rate for FP16, FP32, sRGB, and other new >8bpc HDR formats. The only difference is with INT8 filtering, and then only for aniso, trilinear, and volume textures.
It's not bizarre because this particular addition is fairly cheap. The G71 had very shimmer-prone AF and that star-shaped pattern. G80 gives you 2x the performance and better image quality.
It's a very sensible design decision for high end cards where people care about AF. I personally think it's important for low end cards too, but my guess is that sales aren't particularly dependent on it.
Yeah, it can move. I'll see if I can knock up some movies for you, actually (and some from a game or two as well).I'm prolly in the minority, I think R600's AF looks better than G80's. Those 3DMk06 tunnel tests show R600 producing cleaner results and producing detail further into the tunnel. But I haven't seen that test in motion (does it move?).
G80 has 768 threads (fragments) per multiprocessor. It takes 226 base clocks to get through them, worst case (8800GTX, 2.35:1 ALU:core clock).I know the threads can't be shared, but it doesn't matter. TMU throughput is 4 pixels per base clock. If you have 500 threads in each SIMD array needing a texture request, regardless of what order you choose it will take 250 clocks to get through them.
I don't think that's necessarily a good guide. The reason I say that is that G7x architecture is scaled up and down for various SKUs, but is a common pipeline architecture. Different SKUs have different clocks, there's a variety of DDR types and therefore the ratio of core:memory clocks or, ultimately, the ratio of instruction-throughput:memory-latency varies. It'd be nice to see an analysis of this SKU-scaling effect on architecture...Now, it's true that you can get away with fewer threads if you can do multiple fetches per pixel before needing a result. We don't really know how long it takes for a fetch to come back, but judging from G7x it appears 220 cycles is the minimum latency hiding that NVidia feels is necessary.
Yes, dependent texturing is another area where GPUs seem infeasibly good...Dependent texturing tests with a controlled number of registers is probably the best way to find out how long latency is.
One of the things that puzzles me about those 3DMk06 images is why they're non-circular. What the hell is going on there?Yeah, it can move. I'll see if I can knock up some movies for you, actually (and some from a game or two as well).
This describes an 8-stage main-ALU:Well, I'm told it's 8 in Xenos, and in Cell it's 7 cycles for MADD. 8 stages is probably enough to get 1.5 GHz, and they certainly want to make it as small as possible if they want small batches or else they need to flip back and forth between more batches.
Certainly, with these G84 texturing results (which I wasn't aware of) the mystery deepens.
In the second comparison posted:But, If I read well the B3D article, G84 has more texture address units per cluster, but not more texture filtering units per cluster with respect to G80.
So, total filtering power per clock should be 1/4 of G80's but "texture addressing power" should be 1/2.
8600GTS is being compared against 8800GTS. Theoretically it has 90% of the theoretical bilinear rate (10800 MT/s versus 12000 MT/s). The 8 textures case shows 8600GTS achieving 47% of 8800GTS. 1 texture is 52%, 2 textures is 54% and 3 textures is 56% (the best).
Multitexturing tests with more than 1 or 2 textures are never bandwidth limited unless you use an enormous texture (netting you a 1:1 pixel:texel ratio) and no compression. I'm 99% sure that RightMark and 3DMark do not fall into that category. That's why nearly every video card reaches its peak when running them.Incidentally, G84-GTS has 50% of the bandwidth of G80-GTS. If you run a bandwidth test, you shouldn't be surprised that things scale by the bandwidth ratio...
And you can't say that with a completely straight face either. There's nothing easy in writing drivers for new GPU architectures, and the opportunities to leverage expertise in writing a driver for an old arch on the same API aren't as wide ranging as you might think. It'd be nice to see AMD or NVIDIA publish a diagram or two of the driver stack that highlights the reality of the complexity they contain.
We're still talking about SS2, right? Games are never purely a bilinear texturing test. Moreover, drivers and benchmark choice are always a potential factor. Everything scales as expected here, for example.But this is a pure bilinear texturing case. The texturing latency should be hidden 100% of the time and losing the top MAD ALU to TEX still leaves 1 instruction co-issuing.
Sort of, but if you work out the numbers that doesn't seem like the right theory either unless 3DMark is re-using interpolators in some non integer ratio.Ah, you think that it's attribute interpolation that's slowing down G84? That's a good thought.
INT16 filtering can be done by chaining together two INT8 units at minimal cost (I worked out the logic myself), because the interpolation factors are constant width (8-bit?) for all texture formats. Xenos does it the same way, with bilinear filtering speed always dependent on total width of the format (except for FP16).Hmm, don't get what you meant by "adds a little more math for the INT8 filtering". Whereas we know that NVidia must have added math for INT16 filtering, because ATI is running at half rate, and INT16 requires more precision than FP16.
HQ on both cards is not apples to apples. R580 looked better due to better AF detection math, and NVidia probably disabled a bunch of optimizations (which ATI still mostly had on) that don't affect quality too much but tank performance.Maybe that gives the lie to the "G7x is wonderfully efficient per mm bullshit" then, because HQ texturing was, until the bitter end, not something enabled in benchtests on G71. And G71 still fell as much as 50% behind R580 - though I wouldn't put the blame on texturing per se.