Upcoming ATI Radeon GPUs (45/40nm)

That usually only increases the size of the textures, not how many texture lookups are done per shader. In other words, an increase in texture bandwidth usage.
The texture lookups don't change, but the number of bilinear fetches can change because higher res textures mean fewer pixels have magnification. Trilinear and aniso affect more pixels.
 
http://en.hardspell.com/doc/showcont.asp?news_id=3768

What I ascertain is 12 units, where-as RV770 has 10.

48 TMUs, therefore keeping the ratio.

140mm2 should be enough to get make a part under 75w.

My guesses would be 1.25TF (652mhz) and 1.5TF (782mhz) parts.

Is it conceivable Rv740 is this part, RV870 is a dual core part, and R800 is a dual-package dual core part? *thinks back to ATi saying octo-fire is in the works...*

Me-thinks a dual core x 4 or a quad core x 2 at the high-end of support seems possible. Also, if Nvidia's next part is indeed the 3TF monster listed in this article....God save AMD if they want a true halo product to compete with it without doing so.

If such a situation came to be (dual core R870 vs 2xgt200=gt300), the R800 would seemingly be worth it's $300-400 with 1GB, a 384 sp GT300 it's $499 with 1792MB, and a full 480sp part it's $599+ with 2GB...This generations price structure, only making a heck of a lot more sense from a price/performance perspective.

I just keep thinking with AMD's switching to MCM coming with the octo and 12-core cpus,, and ATi wanting to simplify their product stack, a single core, dual core, and quadfire-on-a-stick would make sense. Given the die area, I imagine power could be kept under 300w for a dual dual core part. I'd think no power adapter single-core for the low-end with 512mb, a single-core product with power adapter for the mid range with 512mb, a dual-core 2x75w 2x6 pin connectors for the performance with 1GB, and an 2x2mcm package quad core with 1x6-pin and 1x8-pin connector for the extreme market with 2GB...

Perhaps I'm overly optimistic, but a 6TF part sure would be tits, and I'd be damned if Nvidia could do anything to compete with it at 55nm. Granted if they shrank the rumored next-gen nvidia design to 40nm and made an GX2-style part, you'd be right back where you started with a dual-core versus that single-core monster...except no longer at 1/2 the price...give or take this, that, and the other thing. ;)
 
Last edited by a moderator:
Thanks 3d, you've explained this far better than I.



Yes! Perfect summary explanation.

Wouldn't that change once the new midrange/lowrange comes? You still have to make sure that at least the midrange has a proper combo that takes advantage of mm^2 as much as possible. If all RV770 did was to scale up and have that ratio intact, 730 will show that clearly (even better if it had 320 ALUs to boot).

EDIT: Oops, this post is rather redundant as what's said has been. Must have missed a few pages, darn.
 
If there was going to be a change within the 5-wide vector units - which way would be the one to go?

I guess that 5 seems to be the magical number you'd need without a severe reorganization of control logic for you'd have to break vectors apart and feed them serially into the ALUs (which is why Nvidia has such a relatively high overhead in control).

Wouldn't it be feasible then to widen them to 7? Because that enables maximum flexibility with widespread pairings of vector sizes. You could do 5+2, 4+3, 3+3, 3+2+2, 2+2+2 and of course filing spaces up with scalars or doing non-serial scalars 7-wide.
 
If we take a SIMD in isolation then, yes, TEX and ALU instructions will execute simultaneously. However if the batch contains instructions that are primarily texture heavy then the ALU's in that SIMD will see lower utilization over the course of executing that batch to completion; if the batch is ALU heavy then that SIMD may experience lower utilization of the texture units.
I'm pretty sure you need almost all batches in flight at a given point in time to become texture heavy for this to happen. Single batches will just average out.

The same thing goes for shaders with texture heavy parts and math heavy parts. The overall ratio is all that matters because there are enough batches in flight to statistically even this out.
 
I think you're over-simplying this, Wavey. R420 did address R300's relative lack of shading compute power, so in a sense it was a "correction". A better example of another correction would be R520->R580 addressing the same lack of shading power. Given the fact that ATi has addressed a shortcoming in shading compute power twice, I can understand why they shifted to a high ALU:TEX philosophy. Unfortunately, having been burned twice they over-corrected, which caused the relative lack of texturing ability in R6xx. R7xx is a correction of the failed R6xx design philosophy. I know you can't admit this as a representative of ATi, but there's no use denying it either.
Shaidar, there is no correction here. I bet that RV770 will show equal or lower full- frame percent utilization in the TEX and ALU than RV670.

RV770 is just an example of superior design over previous gen product. The ROPs are similarly sized yet 2x faster with AA. The memory controller gets higher utilization. The texture units are smaller with the same perf. The ALUs are smaller with the same perf.

Aside from setup and color fill speed w/o AA, almost everything in RV770 became twice as fast per clock as in RV670. It's really quite silly to say texturing speed is the main reason that it's faster. If RV770 was 350 mm2, it really wouldn't be impressive because it's basically just RV670 doubled (except for no-AA fillrate and triangle setup).

BTW, R420 has twice the texturing power of R300 as well. It did not cause a "relative lack of texturing".
 
Last edited by a moderator:
If there was going to be a change within the 5-wide vector units - which way would be the one to go?

I guess that 5 seems to be the magical number you'd need without a severe reorganization of control logic for you'd have to break vectors apart and feed them serially into the ALUs (which is why Nvidia has such a relatively high overhead in control).
Did you by any chance read my suggestion and it's subsequent discussion?

I go into a bit more detail here:
http://forum.beyond3d.com/showpost.php?p=1180701&postcount=4545
 
Shaidar, there is no correction here. I bet that RV770 will show equal or lower full- frame percent utilization in the TEX and ALU than RV670.

If I'm wrong, so be it. I'd like to see results from the thread I started before I'll admit to such though ;)

RV770 is just an example of superior design over previous gen product. The ROPs are similarly sized yet 2x faster with AA. The memory controller gets higher utilization. The texture units are smaller with the same perf. The ALUs are smaller with the same perf.

I've never said anything to contradict this, at least not IMO. I still believe it has overcome bottlenecks in the previous architecture.

Aside from setup and color fill speed w/o AA, almost everything in RV770 became twice as fast per clock as in RV670. It's really quite silly to say texturing speed is the main reason that it's faster. If RV770 was 350 mm2, it really wouldn't be impressive because it's basically just RV670 doubled (except for no-AA fillrate and triangle setup).

How is it silly to say the increase in tex rate is not the primary contributor to perf. increases? I said in non-AA-rate-bound scenarios, and non shadow-bound scenarios, tex perf. was the bottleneck. Given just how slow the AA perf. on R6xx was (perf. decreases of 50%+ were not uncommon @ > 2xAA), many R6xx owners simply went without. Shadows rarely seem to be a primary bottleneck in games (other than D3-engine, of course), so while the Z-rate increases are welcome, I don't think they're as commonly useful as the increased tex rate.

BTW, R420 has twice the texturing power of R300 as well. It did not cause a "relative lack of texturing".

When did I say there was a relative lack of texturing power in R300? I said a relative lack of shading power (meaning the shader core was the primary bottleneck in most scenarios).
 
If I'm wrong, so be it. I'd like to see results from the thread I started before I'll admit to such though ;)
Just what are you asking for there? We don't have access to internal hardware registers. We don't have a specific breakdown of workload in all games. We know that we can write texture limited shaders and RV770 will be 2.5x the speed, but that also applies to ALU limited shaders. What result would validate your claims?

How is it silly to say the increase in tex rate is not the primary contributor to perf. increases?
Not at all, because everything increased in performance.
I said in non-AA-rate-bound scenarios, and non shadow-bound scenarios, tex perf. was the bottleneck.
Really? Then show me one game benchmark that is non-AA-rate-bound, non shadow-bound, and has RV770 performing well over twice as fast as RV670.

Without AA we're seeing the 4850 be 30% faster than the RV670. If we assume that the 4850 is no faster than the 3870 in all non-texturing-limited situations combined, then that means the 3870 is texture limited only 38% of the time. That's less than half of the render time, and if we make a more realistic assumption that the 4850 is faster at the other stuff overall (due to faster real-world color and z fillrate, and way faster math speed), then that figure drops even more.

Shadows rarely seem to be a primary bottleneck in games (other than D3-engine, of course)
It doesn't have to be the primary bottleneck to still be significant, and possibly more so than texturing.
so while the Z-rate increases are welcome, I don't think they're as commonly useful as the increased tex rate.
ROP redesign (Z-rate and MSAA resolve) is almost entirely responsible for RV770's much improved performance in games with AA enabled. I don't know why you keep mentioning no-AA scenarios, because given the specs RV770 isn't outperforming RV670 by much.

When did I say there was a relative lack of texturing power in R300? I said a relative lack of shading power (meaning the shader core was the primary bottleneck in most scenarios).
You said R300->R420 increased compute power, and again in R520->R580, and that doing so twice lead them to overcorrecting. R300->R420 was just a doubling in everything except BW.
 
Last edited by a moderator:
How is it silly to say the increase in tex rate is not the primary contributor to perf. increases? I said in non-AA-rate-bound scenarios, and non shadow-bound scenarios, tex perf. was the bottleneck. Given just how slow the AA perf. on R6xx was (perf. decreases of 50%+ were not uncommon @ > 2xAA), many R6xx owners simply went without. Shadows rarely seem to be a primary bottleneck in games (other than D3-engine, of course), so while the Z-rate increases are welcome, I don't think they're as commonly useful as the increased tex rate.

Let me suggest a simple test then:
- Choose a handful of games you suspect to be TEX bound
- Make sure the game settings result in a fully VGA-bound situation
- Test twice - once with bilinear texture filtering, once with 16xAF

If there's serious difference between the two results in more than 1-2 games, then you're right.
 
Just what are you asking for there? We don't have access to internal hardware registers. We don't have a specific breakdown of workload in all games. We know that we can write texture limited shaders and RV770 will be 2.5x the speed, but that also applies to ALU limited shaders. What result would validate your claims?

I don't know what results to expect. If I did, I would've simply provided them already.

Not at all, because everything increased in performance.

Now I can't help but think you're being obtuse about this...

Really? Then show me one game benchmark that is non-AA-rate-bound,

Easy - don't enable AA and it won't become a bottleneck ;)

non shadow-bound, and has RV770 performing well over twice as fast as RV670.

Were I able to do so, there would be no need for the thread I have created.

Without AA we're seeing the 4850 be 30% faster than the RV670. If we assume that the 4850 is no faster than the 3870 in all non-texturing-limited situations combined, then that means the 3870 is texture limited only 38% of the time. That's less than half of the render time, and if we make a more realistic assumption that the 4850 is faster at the other stuff overall (due to faster real-world color and z fillrate,

A bottleneck that is present 38% of the time is hugely significant, given just how many potential bottlenecks there are in RT graphics rendering. I find your omission of percentages of other bottlenecks to be rather curious, given this fact, and believe your analysis to be disingenuous in this regard.

and way faster math speed), then that figure drops even more.

Incorrect. RV770 does not have "faster math speed", it has more math resources working in parallel. This is an important distinction. Faster math speed would be a result of higher clock rates or various uarch enhancements which speed the return of results from the shader core.

It doesn't have to be the primary bottleneck to still be significant, and possibly more so than texturing.

Pardon? If bottleneck A is a bottleneck for less time than bottleneck B, would that not make bottleneck A less significant?

Feel free to show me another bottleneck that accounts for > 38% and I'll shut my big fat mouth :D

ROP redesign (Z-rate and MSAA resolve) is almost entirely responsible for RV770's much improved performance in games with AA enabled. I don't know why you keep mentioning no-AA scenarios, because given the specs RV770 isn't outperforming RV670 by much.

I mention no-AA because it is not a primary factor in gaming (and R6xx performs so poorly with it). It's an IQ enhancer. If you lack AA capability you can simply choose not to use it. The same can't be said for texturing capabilities.

You said R300->R420 increased compute power, and again in R520->R580, and that doing so twice lead them to overcorrecting. R300->R420 was just a doubling in everything except BW.

R300 was primarily math-bound. R420 > doubled math power, resulting in a chip which was bottlenecked by its math rate less often than the previous chip. Does that clear things up?

On a personal note, I'd just like to thank everyone that has participated in this discussion so far. I'm enjoying this immensely, and I really hope to get to the bottom of the matter.
 
Let me suggest a simple test then:
- Choose a handful of games you suspect to be TEX bound
- Make sure the game settings result in a fully VGA-bound situation
- Test twice - once with bilinear texture filtering, once with 16xAF

If there's serious difference between the two results in more than 1-2 games, then you're right.

Sounds like a reasonable test case to me. Would you mind posting that in the other thread?
 
When you have no AA, of course the bottleneck shifts more to texturing. What that said, 38% is by no means doom. I'm under the assumption that RV770 is benifiting more to the much improved z/depth rate in the RBE's then anything else.
 
When you have no AA, of course the bottleneck shifts more to texturing. What that said, 38% is by no means doom. I'm under the assumption that RV770 is benifiting more to the much improved z/depth rate in the RBE's then anything else.

Stop. Think about this for a second. If your AA perf is so low as to be unusable, what is the point of discussing the fact that it is a bottleneck?
 
Back
Top