Sir Eric Demers on AMD R600

Status
Not open for further replies.
Over 160 GB/s on R600 for the SGL part.

Too pity that due to driver issues, I can't complete the entire set of GPUbench. :rolleyes:

The DX version in CVS will work, but it's a bit of a pain to get built and working. You need to build the binaries with Visual Studio (patching your lib/bin directories) and copy all the shaders to the DX9 directory. I haven't had time to clean up the build and tests... (If someone wants to fixup the build system like the GL side, I'm happy to accept patch sets and help were I can :devilish: ) We also haven't gone through the same depth of testing and tweaking on the DX9 benchmarks are compared to the GL ones, but CTM and DX tests seem to agree modulo CTM specific tweaks.

AMD has been told about the issues with the GL driver and there are drivers that behave and will run all the tests, but DX9 has the memory controller better tuned from what I've seen thus far.
 
What's your point? You were claiming that "G80 is so wildly wasteful of its texturing and raster capabilities". I was demonstrating to you that it probably isn't.
This is getting there:

http://www.ixbt.com/video/itogi-video/0607/itogi-video-ss1-wxp-2560-pcie.html

but 8800GTX is 35% faster than HD2900XT, despite its 55% theoretical advantage in bilinear texturing.

I don't think shadowmap fillrate is much of a limitation on the high end cards. Even a 2048x2048 shadow map with 2x overdraw (remember that arial views have low overdraw) would get filled in under a millisecond on R580, so that's 57fps vs. 60fps for an infinitely fast fillrate. R600 has 2.3x that rate, G80 much more. Triangle setup is usually the big cost for shadow maps.
OK, so setup rate and texel fetch rate are the key bottlenecks, for current "simple PCF" type shadowing in games. It'd be nice to see some results that confirm this...

I was just talking about a test where we can take G80's 1:2 TA:TF advantage out of the equation. In such a scenario the GTS is still close to R600 in games, despite having much lower math ability, bandwidth, vertex setup, etc.
Math is moot, it's hard to find a game that's math-bound (certain driving games that are known for their high math load rarely seem to be benchmarked). Bandwidth is also moot, because without AA/AF the bottleneck moves somewhere else, e.g. CPU or zixel rate. Though that SS2 test I linked above hints that bandwidth may be very important - but I dunno what's happening with fillrate there, either.

It's also why I introduced the G84/RV630 comparison, since G84 is 1:1. But looking at scalings comparing the two couplings (G80 v R600 and G84 v RV630) makes me think results are sabotaged by drivers. 8600GTS is theoretically faster (69% bilinear, 69% colour fillrate), than HD2600XT-GDDR4 across the board than 8800GTX (55% bilinear, 16% colour fillrate) versus HD2900XT, yet it's the latter coupling that shows the bigger performance margin in games.

It could be that G84's ALU:TEX ratio, being effectively lower for bilinear than G80, is running into the "not enough latency hiding" problem you mentioned earlier. Whoops. Not enough objects in flight. Yet those patents imply that math throughput in G84 should be higher than in G80.

Don't forget hierarchical stencil. That makes a huge difference in Doom3. I don't think this game's performance is indicative of texture fetch performance at all.
It was the closest I could find with AF/AA off :cry: I don't know enough about SS2's engine to analyse the new data.

Effectively, yeah. I just avoided "TA rate" because ATI claims more than 16 per clock when really that isn't true for pixel shader texturing.
Yeah, point filtering with existing games seems to be moot, except perhaps when dealing with shader-filtered shadows (rather than PCF). Which prolly only ever happens on ATI hardware because of NVidia's historical advantage with PCF.

Not sure what you mean by "G80 falls down", though. It's still 55% more than R600 for the GTX, and matches R600 in the GTS.
I'm thinking of the percentage of the time that 32 of G80's TFs are idle because the GPU is doing bilinear or point-sampling. We also don't know if vertex texture fetch is performed using the TMU hardware in G80...

Everyone's made a ballyhoo about the "100% utilisation" of the ALU pipeline in G80 (which is far from true - hello SFU). I think TMUs prolly use up dramatically more die space, but hey, I don't know of any decent NVidia GPU die shots.

For G84, I think it's partly because reviewers don't enable AF for cards of that budget as consistently as they do for the high end cards (or at least that's what NVidia thought at design time).
It's got a vast amount of theoretical filtering rate, 91% of HD2900XT for bilinear and 82% more trilinear :oops:

It seems like people really want to run at native LCD resolutions, even if they have to disable AF/AA (which I think is stupid, but whatever).
Native LCD resolutions do seem to have mucked-up the stratification of GPUs, what with 1280x1024 being practically "entry-level" for LCDs. But, yeah, that's a whole other topic.

Right now, yeah, but ATI also wanted to avoid halving the filter rate for these formats, and they did it in a way that didn't have the auxillary benefits of NVidia's approach. If both IHV's are doing this, they clearly think there's some future for these formats.
Right now I'm thinking ATI's ratio is the right way. Xmas's point about RGBE, which is also relevant to sRGB-aware texture filtering (sRGB will presumably become the majority of the usage once devs embrace the concept), is fairly important. But those are both D3D10-specific arguments for 32-bits per texel formats, which need "fp16" capable TUs.

G80's TA:TF ratio just seems to be a matter of circumstance. I'd be surprised to see it retained in the future.

The "backwall" has little to do with AF load.
I disagree. Driving games for example already show that AF runs out well short of the view distance. Though you might argue that only affects a relatively small portion of the screen area (one that grows perceptually as resolution increases) - it's just a pity that it's where you're looking...

The reason AF load per pixel drops is that all the near pixels don't need as much AF, so the "frontwall" (i.e. the point where at least 2xAF is needed) moves back. In any case, I believe that texture size will increase faster than resolution, and thus AF load per pixel will increase, all else being equal.
I agree about the front wall, and that's where a dramatic difference due to higher-resolution texturing appears. And, yeah, screen resolution has hit 2560x1600 and it seems like it'll be years before we get better than that.

However, your second point makes sense, so all else is not equal. Assuming they enable AF selectively (say, on just the base texture), you're probably right about AF performance playing a lesser role.
And deferred rendering will also chip away at this issue, with overdraw falling off dramatically.

I also wonder if MegaTexturing changes the landscape of AF load :LOL:

Again, there's too many variables to make any conclusion like that. D3 performance is also heavily dependent on early fragment rejection (Z and stencil), which is quite different between R600 and G80 (rate, granularity, corner cases, etc). Add in drivers (sometimes the GTX is more than 53% faster than the GTS!), memory efficiency, per-frame loads (resolution scaling isn't perfect), etc and now there are a lot of unquantified factors from which you're trying to isolate the effect of just one.
Hence my general displeasure with the "analysis" of the reviews out there though ATI's drivers are a major source of grief, making analysis really difficult. Oh well.

You have no evidence for such a theory. Texturing tests show that efficiency is just fine.
http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=4

I'm not sure of the quality of the 3DMk06 single-texture test though.

Game tests have too many variables for you to blame TA ability.
Game tests show that 8800GTS is pissing its TF (and fillrate) up the wall, while HD2900XT is pissing its bandwidth up the wall :p

Perhaps this is why R6xx appears to be scaling down better - even though the advantage should lie with G8x based on theoreticals? All R6xx variants seem to have an "excess" of bandwidth against the competing NVidia parts.

Are you complaining about why the filtering units aren't saturated? Is that what this whole rant is about? You're complaining about why the GTX isn't 3x the speed of R600 in games???

Who cares? There's no reason to expect that. They're there to reduce the performance hit of high TF scenarios.
Relatively, they're expensive to build and fly in the face of "high utilisation" as a design goal. As I've said before, I think G80 just came out the way it did through "least resistance". Off-die NVIO is clearly a bugbear. My suggestions for alternative configurations (TA:TF 1:1 or 12 clusters) seem to be the victims of die size one way or another. Die size will be much less of a problem in the new variants, I trust.

I still don't see why, for a warp size of 16, these would affect scalar/vec2 code more so than vec4 code. I understand how instruction issue rate changes, but not register related issues. Are you talking about a latency between writing a result and reading it again? That's easy to solve with a temp register caching the write port, so this is not an issue that's holding NVidia back from reducing warp size.
It's true that's a solution (to latency), something you'll find in Xenos and R600. Wasn't one of the recent patents pointing at that? Anyway, the signs are that warp size will stay at 32 - it's my suspicion that batch size will increase from 16 to 32 (G80's warp is two batches married).

It's also about co-issue - MAD and SF co-issue breaks apart if you only have 16 objects (one batch, half a warp). SFs last for 4 or 8 clocks and co-issue becomes problematic if you run out of issuable instructions on the MAD pipe, when the MAD operands are scalars, not vec4s.

Finally, it's about RF address rate. If you have lots of scalar operands, then the RF is being pushed to access more addresses than it can handle. A 2-clock instruction (for a single batch) requiring 3 operands for 16 pixels is requesting 3 different addresses in 2 clocks :cry: A four clock (2 batches in a warp) instruction requiring 3 operands for 16 pixels is requesting 3 different addresses in 4 clocks :D

Yeah, it's pretty trivial.
32 scalar operands, each of 32 bits, that's 1024 bits.

PDC might be different since I don't really know the details, but for a fixed number of ports in the RF and CB, doubling bus width is easy. An equally sized RF or CB with double the bus width and double the granularity simply halves the number of partitions you're selecting from and doubles some wires.
RF is banked. I think it's 16-banked in G80. Same for CB and PDC I expect. Each of these 3 data sources is presumably feeding an operand re-order buffer.

Also, you have to double the size of the RF (can't keep the size constant), because you want to double the number objects in flight, since your ALU pipe is now chewing through them twice as fast. Otherwise you've just lost half your latency-hiding.

Doubling SIMD width and warp size is way cheaper than doubling the number of arrays. I can't see why you'd think otherwise.
I'm sceptical about the feasibility of the 3x1024-bit buses and the doubled scale of the operand re-ordering "crossbar" (doubling its bank/port count I guess, erm...). 4-clock instructions in the ALU pipe are even cheaper to implement, I would say...

So R600 has 16 of these 20-ALU blocks. G80 has 16 groups of 8-ALU array. Care to explain why it's so hard for NVidia to double the SIMD width next gen when it's still smaller than ATI this gen?
I don't know R600's RF fetch rate. At minimum it's presumably 640-bits wide (20x32) into the 8KB cache (which is used as an operand reordering buffer) - or at least a quarter of the 8KB cache, presumably dedicated to those 20 pipes. It might be double that, of course, 1280 bits.

Jawed
 
Why do you think that?
  • R600 has in-pipe registers to solve the read after write hazard (at least if the patent documents are to be believed - it's a fact in Xenos (though only for SF results), so very likely in R600)
  • the register file in R600 supports a large number of batches in flight making thread-swapping an option
  • R600 hides the latency of register spillage to video memory and fetching them back
  • the ALU pipeline consumes instructions in an AAAABBBBAAAABBBB... pattern (for batches A and B) easing addressing load
  • ATI has a history of ensuring that register bandwidth doesn't impact ALU pipeline throughput
Jawed
 
Jawed, I'm going to chop up my reply into three to make it more manageable in size. Hopefully we can put part of our discussion to rest.

I disagree. Driving games for example already show that AF runs out well short of the view distance. Though you might argue that only affects a relatively small portion of the screen area (one that grows perceptually as resolution increases) - it's just a pity that it's where you're looking...

...

And deferred rendering will also chip away at this issue, with overdraw falling off dramatically.
Even in driving games, the number of pixels with 16xAF is very small. Consider driving on an infinite plane with a 70 degree vertical FOV. Only 5% of the pixels have over 16:1 anisotropy (and I think clever texel walking can cover that area with only 8 bilinear samples, so maybe 2.5% is the relevant figure). More importantly, this is resolution independent - only the mipmap selection changes. It doesn't matter that 16xAF is inadequate for you in driving games. Hence the "backwall" does not make AF load go down with resolution as you claimed.

AF load decreases with res due to all the other pixels. Extra magnification of the same sized texture at high res leads to fewer pixels with 2xAF needed, and reduced AF needed for other pixels. That's why I think increased texture resolution is a bigger factor in future AF load than devs being more selective about textures needing AF.

Deferred rendering or a Z-only pass will reduce overdraw, but they barely affect the percentage of pixels needing AF. Remember that for DR you fetch the textures during the forward rendering part.
 
This is getting there:

http://www.ixbt.com/video/itogi-video/0607/itogi-video-ss1-wxp-2560-pcie.html

but 8800GTX is 35% faster than HD2900XT, despite its 55% theoretical advantage in bilinear texturing.
You're being disingenuous here. The GTS is barely any slower than the 2900XT in that same graph. CPU limitations are starting to kick in for the GTX (as seen here, and it's also evident in the minimal improvement of the Ultra over the GTX in your graph).

Math is moot, it's hard to find a game that's math-bound (certain driving games that are known for their high math load rarely seem to be benchmarked). Bandwidth is also moot, because without AA/AF the bottleneck moves somewhere else, e.g. CPU or zixel rate.
Bandwidth per clock rarely increases with AA/AF, as you have marginally more data per pixel spread over extra clocks. It is hard to isolate the effect of bandwidth, but look at the performance hit of AA/AF on the 9700 vs. the 9500 Pro. Twice the BW makes little difference in perf hit. Without AA/AF, the GTS is outclassed by the 2900XT in every respect except texturing, where they match. Nonetheless, it performs similarly (especially in SS2), so clearly it is not wasteful of its texturing ability. That you say math is moot only further weakens your argument.

8600GTS is theoretically faster (69% bilinear, 69% colour fillrate), than HD2600XT-GDDR4 across the board than 8800GTX (55% bilinear, 16% colour fillrate) versus HD2900XT, yet it's the latter coupling that shows the bigger performance margin in games.
Too many factors here to pin it on texturing/fillrate. The math deficit of G84 vs. RV630 is huge compared to G80 vs R600, and don't tell me games aren't using enough math. G84 can do only 4.3 scalar instructions per tex fetch.

http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=4

I'm not sure of the quality of the 3DMk06 single-texture test though.
The single texturing test in every iteration of 3DMark is a measure of alpha-blended fillrate without Z, and G80 reaches its theoretical rate. In the multitexturing test G80 also reaches its theoretical texel rate.

Game tests show that 8800GTS is pissing its TF (and fillrate) up the wall, while HD2900XT is pissing its bandwidth up the wall :p

Relatively, they're expensive to build and fly in the face of "high utilisation" as a design goal. As I've said before, I think G80 just came out the way it did through "least resistance". Off-die NVIO is clearly a bugbear. My suggestions for alternative configurations (TA:TF 1:1 or 12 clusters) seem to be the victims of die size one way or another. Die size will be much less of a problem in the new variants, I trust.
Okay, so let me just make this clear. When you say "G80 is wildly wasteful of its texturing ability", are you referring solely to INT8 texture filtering ability? That's a more reasonable statement, but I still disagree as to its relevence.

As I've said before, TF is not that expensive. This is a 680M (+NVIO) transistor part with 64 TF units. NV10 had 8 TF units in only 23M trannies (sorry about the 15M number - conflicting info on the web). 8 times the TF in ~30 times the trannies is nothing. Yes, the filtering units are a bit more complicated now in the DX10 era, but the difference between NVidia's approach and ATI's approach is solely in INT8 filtering ability - the same thing you find in NV10.

NVidia uses the same bus widths per TU-quad as ATI, but adds a little more math for the INT8 filtering and a little more logic for locating a related second quad of texels for trilinear, aniso, or volume textures (ATI has to do these anyway, but can afford to do them at half the rate per TU-quad). The math is relatively tiny, and the hardest part is altering the texture cache to fetch twice as many half-width values.

You are not only exaggerating the cost of TF and fillrate, but you are also undermining their value in keeping the performance of the GTS close to the HD2900XT.
 
It's also about co-issue - MAD and SF co-issue breaks apart if you only have 16 objects (one batch, half a warp). SFs last for 4 or 8 clocks and co-issue becomes problematic if you run out of issuable instructions on the MAD pipe, when the MAD operands are scalars, not vec4s.
The ratio of SF-pipe to MAD-pipe instructions doesn't change with warp size. If you don't have enough MAD instructions in a 16-object group, you don't have enough for the 32-object group either.

Finally, it's about RF address rate. If you have lots of scalar operands, then the RF is being pushed to access more addresses than it can handle.
There's no difference between scalar and vector here, so I don't see your point. Whether it's a vec4 MADD or 4 scalar MADDs being run on 16 objects, the 8 ALUs still need to read 12 blocks of 512-bit operand data in 48 clocks.

32 scalar operands, each of 32 bits, that's 1024 bits.
So? It's a lot easier than your solution of doubling the number of shader arrays, resulting in 2x512-bit busses, each with half the granularity of this single 1024 bit bus. There's a whole bunch of other things you need too with your solution.

Also, you have to double the size of the RF (can't keep the size constant), because you want to double the number objects in flight, since your ALU pipe is now chewing through them twice as fast. Otherwise you've just lost half your latency-hiding.
No you don't, because the amount of latency hiding necessary is determined primarily by the number of texture units you have. ALU latency is what, 8 clocks? A G80 cluster, with 16 ALUs, only needs 128 pixels in flight. 4 texture lookups per clock, however, needs ~800 pixels in flight, assuming 200 clock latency and the ability to immediately use the texture's value without penalty. NVidia can easily double the SIMD width without doubling the register file. (They probably will, though, because 64KB of register file split among 800 threads is only 20 FP32 scalars.)

I'm sceptical about the feasibility of the 3x1024-bit buses and the doubled scale of the operand re-ordering "crossbar" (doubling its bank/port count I guess, erm...). 4-clock instructions in the ALU pipe are even cheaper to implement, I would say...
You're not making sense, Jawed. You're saying that four 8-wide SIMD arrays per cluster operating on 32-object warps is cheaper than two 16-wide SIMD arrays operating on 64-object warps? How? The latter is waaay cheaper. Half the warps to juggle. Half the scheduling logic. Etc, etc...

I don't know where 3x1024-bit vs. 640-bit comes from either. You're saying R600 only needs a 32-bit RF bus per ALU, but a G80 with double width SIMD arrays needs a 192-bit bus per ALU. Complete nonsense.
 
The single texturing test in every iteration of 3DMark is a measure of alpha-blended fillrate without Z, and G80 reaches its theoretical rate. In the multitexturing test G80 also reaches its theoretical texel rate.

In some practical tests with various 3DMarks' ST-Fills it seems to be scaling exceptionally well with available bandwidth.
 
Even in driving games, the number of pixels with 16xAF is very small. Consider driving on an infinite plane with a 70 degree vertical FOV. Only 5% of the pixels have over 16:1 anisotropy (and I think clever texel walking can cover that area with only 8 bilinear samples, so maybe 2.5% is the relevant figure). More importantly, this is resolution independent - only the mipmap selection changes. It doesn't matter that 16xAF is inadequate for you in driving games. Hence the "backwall" does not make AF load go down with resolution as you claimed.
I was thinking of this:

http://www.computerbase.de/bild/article/648/123/

and presuming that as resolution goes up that central "null" area (not the black, but the area that's "under-filtered") just grows. Yes, it's a small part of the screen.

AF load decreases with res due to all the other pixels. Extra magnification of the same sized texture at high res leads to fewer pixels with 2xAF needed, and reduced AF needed for other pixels. That's why I think increased texture resolution is a bigger factor in future AF load than devs being more selective about textures needing AF.
I wasn't trying to say "AF load has stopped increasing" or anything like that, just trying to identify why AF load growth might be slowed in comparison with hardware capability.

Deferred rendering or a Z-only pass will reduce overdraw, but they barely affect the percentage of pixels needing AF. Remember that for DR you fetch the textures during the forward rendering part.
Screen pixel count won't be reduced, no, but overdraw in forward renderers is still in the region of 5x, that's quite a difference.

Jawed
 
You're being disingenuous here. The GTS is barely any slower than the 2900XT in that same graph. CPU limitations are starting to kick in for the GTX (as seen here, and it's also evident in the minimal improvement of the Ultra over the GTX in your graph).
I'm not being disingenuous, I'm trying to find evidence one way or the other. I think it's pretty close to proving your point, which is why I said "This is getting there." But I have some doubt, because looking across the cards/resolutions/settings I see bandwidth playing a factor which makes me wonder if fillrate is dominating.

I wanted your opinion because I don't know what SS2 is doing, except that it has a reputation for being very hard on texturing hardware. e.g. what kind of shadowing (if any) is it doing?

Compare 7950GT (600/1430) and X1900XT 256MB (625/1450):

http://www.ixbt.com/video/itogi-video/0607/itogi-video-ss1-wxp-1600-pcie.html

7950GT should be ~44% faster if bilinear texturing rate were dominant here. Trouble is, G7x texturing isn't a bastion of efficiency, and R580's out of order texturing may be making a huge difference (even though this is supposedly "DX8" type code) so this datapoint may be moot :cry:

Too many factors here to pin it on texturing/fillrate. The math deficit of G84 vs. RV630 is huge compared to G80 vs R600, and don't tell me games aren't using enough math. G84 can do only 4.3 scalar instructions per tex fetch.
So you're saying that G84's lacklustre performance advantage over RV630 is because it has too much texture filtering/addressing/fetching? Maybe the ALU pipeline clocks in G84 are way lower than NVidia was hoping to get. Or maybe the core clock just goes much higher than they were expecting.

I was under the impression that G84's co-issued MUL works much better than G80's. Maybe that's a driver problem yet to be fixed, it is still pretty new after all.

The single texturing test in every iteration of 3DMark is a measure of alpha-blended fillrate without Z, and G80 reaches its theoretical rate. In the multitexturing test G80 also reaches its theoretical texel rate.
I see there's some debate on the single-textured test, oh well... I don't understand why the RightMark test doesn't show any sign of the TA:TF ratio as number of textures increases - unless each texture has independent coordinates.

NVidia uses the same bus widths per TU-quad as ATI, but adds a little more math for the INT8 filtering and a little more logic for locating a related second quad of texels for trilinear, aniso, or volume textures (ATI has to do these anyway, but can afford to do them at half the rate per TU-quad). The math is relatively tiny, and the hardest part is altering the texture cache to fetch twice as many half-width values.
I presume you meant INT16 filtering.

G80 has 136% of the theoretical trilinear/AF/fp16 capability of G71. Out of order threading should provide 30-50% more performance. So G80 is in the region of being theoretically 200% more capable than G71. What was the matter with 2x faster? Why go through the pain of shoving NVIO off die? 3x the TF when you're aiming to construct a 2x faster GPU is just bizarre.

As it happens, I think G80 is basically the way it is through force of circumstance - the way the architecture's granularity is organised there isn't much choice.

ATI built vast amounts of bandwidth for no apparent reason, too. ~2x too much, if they were aiming for 8800GTS performance. The notable spikes beyond 8800GTS's performance hint at something else, but oh well.

Every so often I wonder if R6xx is wasteful of bandwidth (well I think it must be, 4xAA per loop should be in there :LOL: ). Again, something I'd like to see analysed, but the drivers...

Jawed
 
ATI built vast amounts of bandwidth for no apparent reason, too. ~2x too much, if they were aiming for 8800GTS performance. The notable spikes beyond 8800GTS's performance hint at something else, but oh well.


Good thing they weren't aiming for 8800GTS performance then ;)
 
I see there's some debate on the single-textured test, oh well... I don't understand why the RightMark test doesn't show any sign of the TA:TF ratio as number of textures increases - unless each texture has independent coordinates.
What are you expecting to see, double speed bilinear when you use the same coordinates for two different textures? That's not how it works. You need to calculate texel addresses and filter coefficients per texture fetch.
 
The ratio of SF-pipe to MAD-pipe instructions doesn't change with warp size. If you don't have enough MAD instructions in a 16-object group, you don't have enough for the 32-object group either.
This is analogous to SF being like an out of order texture fetch, which only lasts 4 or 8 clocks. To hide that latency, the MAD pipe can only issue instructions from the same batch/warp. If you only have scalar instructions to co-issue against the SF latency, then you're more likely to run out of MAD instruction clocks if you only have 16 objects in flight. It's a question of how tight the dependencies are within the code, e.g. once the SF result is returned for 16 of the 32 objects in a warp, the succeeding instructions for those objects become available to hide the latency of the SF for the second set of 16 objects.

There's no difference between scalar and vector here, so I don't see your point. Whether it's a vec4 MADD or 4 scalar MADDs being run on 16 objects, the 8 ALUs still need to read 12 blocks of 512-bit operand data in 48 clocks.
What I said about addressing rate is a load of bull, best ignore it :oops:

No you don't, because the amount of latency hiding necessary is determined primarily by the number of texture units you have. ALU latency is what, 8 clocks? A G80 cluster, with 16 ALUs, only needs 128 pixels in flight. 4 texture lookups per clock, however, needs ~800 pixels in flight, assuming 200 clock latency and the ability to immediately use the texture's value without penalty. NVidia can easily double the SIMD width without doubling the register file. (They probably will, though, because 64KB of register file split among 800 threads is only 20 FP32 scalars.)
G80 supports 768 pixels in flight, 24 warps of 32 each. At absolute minimum, those 24 warps provide 96 ALU clocks of latency hiding. There's 32KB of register file per multiprocessor (SIMD), with 2 of them per cluster.

CUDA Programming Guide said:
Generally, accessing a register is zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts.

The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 concurrent threads per multiprocessor to hide them.
192 threads is 6 warps of 32, which amounts to 24 ALU clocks at 4 clocks per warp.

You're not making sense, Jawed. You're saying that four 8-wide SIMD arrays per cluster operating on 32-object warps is cheaper than two 16-wide SIMD arrays operating on 64-object warps? How? The latter is waaay cheaper. Half the warps to juggle. Half the scheduling logic. Etc, etc...
Sorry, I should never have said "4-clock instructions in the ALU pipe are even cheaper to implement, I would say... " because it's irrelevant in this context.

Yes, 4x SIMDs per cluster is very expensive, much more so than going from 3x512-bit buses to 3x1024-bit buses. But I think NVidia intends to stick with 32-object warps and I expect batches to be 32-objects, too. When talking about CUDA, they place a hell of a lot of emphasis on 32 for the future.

Having thought about it some more, I don't think the bus width is going to be a roadblock.

I don't know where 3x1024-bit vs. 640-bit comes from either. You're saying R600 only needs a 32-bit RF bus per ALU, but a G80 with double width SIMD arrays needs a 192-bit bus per ALU. Complete nonsense.
I was just making a stab in the right direction, as I don't think it's a single 7680-bit bus for a 16-SIMD (80 ALU pipes). I think it must be segmented in some way, like the bus to video memory is segmented. Also R600 uses cache for re-ordering RF and CB operands, so that has to be able to provide the full 3 operands per clock for the ALUs. Segmented into quarters, that's 1920 bits feeding the input of the 20 ALUs for a "quad".

See if you can untangle this patent application:

SIMD processor and addressing method

which seems to describe a 2-ported RF. I'm puzzled why you'd want memory accessible on byte boundaries, perhaps this memory is used like this for video data.

R600 only seems to have RF and CB as operand sources. Though vertex attributes, per pixel, seem to be stored somewhere. So maybe that's a third distinct source.

Jawed
 
I was thinking of this:

http://www.computerbase.de/bild/article/648/123/

and presuming that as resolution goes up that central "null" area (not the black, but the area that's "under-filtered") just grows. Yes, it's a small part of the screen.
Okay, but that area doesn't grow at all with resolution. You're talking about surfaces that form less than a 3.6 degree angle with the camera, and hence have more than 16:1 anisotropy. The fraction of pixels in this region stays constant with resolution.

Screen pixel count won't be reduced, no, but overdraw in forward renderers is still in the region of 5x, that's quite a difference.
True, but the percentage of pixels needing AF doesn't really change. If it really was 5x from the ROP's point of view, then people would use a Z-only pass all the time. In actuality, rough front to back sorting is enough to really chop that down.

Most importantly, deferred rendering still has a forward rendering component at the beginning of each frame. You have to fill the various full screen buffers that serve as the input for your deferred lighting pass. It is here that you fetch your textures for base, specular, etc. so the total number of cycles spent on AF isn't reduced. In fact, it will probably increase since the lighting math is deferred, and the AF penalty will be fully exposed instead of being partially hidden in traditional renderers.
 
In fact, it will probably increase since the lighting math is deferred, and the AF penalty will be fully exposed instead of being partially hidden in traditional renderers.
We need to add zprepasses to deferred renderers! ;)
 
We need to add zprepasses to deferred renderers! ;)
Or simply create better algorithms! *coughs at nAo to try to remind him of something* (oh, and making stencil-based branching less retarded. kthxbye! although I'm still pretty sure it's possible to get it working for more original stuff somehow, hmmm)
 
This is analogous to SF being like an out of order texture fetch, which only lasts 4 or 8 clocks. To hide that latency, the MAD pipe can only issue instructions from the same batch/warp.
Dealing with the SF is more of an issue with throughput rather than latency, isn't it? Per 8 ALUs, you only get two SF results per clock. The ALUs have 8 clock latency too, don't they? Anyway, this seems like a very minor consideration to me, focussed too narrowly on worst case scenarios.

G80 supports 768 pixels in flight, 24 warps of 32 each. At absolute minimum, those 24 warps provide 96 ALU clocks of latency hiding. There's 32KB of register file per multiprocessor (SIMD), with 2 of them per cluster.

192 threads is 6 warps of 32, which amounts to 24 ALU clocks at 4 clocks per warp.
Yup, I know that. ALU clocks of latency hiding isn't particularly relevant, though. You're trying to saturate the texture unit, which fetches values for 4 pixels each clock. With a ~200 clock latency, you need a minimum of ~800 threads per cluster to avoid stalls. That's a lot more than the 192x2 needed to avoid those register limitations, which shouldn't change with a double-width SIMD array. The only thing that will change is the number of threads needed in flight to saturate the ALUs, and that's still a lot less than ~800.

Fewer texture units on ATI hardware is something that should allow it to get away with a much smaller register file than NVidia, but they seem to go the other way. Why on earth does RV580 handle 512 batches of 48 pixels? That's enough for 1536 cycles between texture lookups! For each TMU-quad, G80 has a max of 48 warps while R580 has 128 batches, with the latter bigger than the former too boot.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top