R520 die size and transistor count redux

geo said:
Yeah, that memory controller on 130nm wouldn't have left much for it to control, would it?
Some of the centre must be the scheduler, I reckon. The scheduler and the memory controller both need to share information on task timing and priority.

I look at that and think ATI does have their NV30. . .from an architectural shift pov that will continue to impact their chips for years to come. Only theirs is at least competitive, and sometimes superior, to the competition immediately --it's a better baseline, so to speak.
R300 you mean :oops:

Though I would counter that slightly, and say that there appears to be a continuum of architectural change spread over R520, RV530, R580 and R600.

Jawed
 
Kaotik said:
Could it be, that the current XT's draw so much power to somehow work around the "sucky" core revision? The way I understood it, all the current cores are the ones that got stalled and "fixed on the fly" or something
In theory ATI should have a few "completely fixed" XT cores available - just not enough to release to the world's shops. It's only a few weeks away from the shops and some of that time is just transportation.

Jawed
 
Jawed said:
Some of the centre must be the scheduler, I reckon. The scheduler and the memory controller both need to share information on task timing and priority.

R300 you mean :oops:

No, I meant what I said --it's just that you can't mention the-chip-we-do-not-name-other-than-to-have-a-bashfest without the emotional reaction of all that baggage. And as I was careful to point out directly, I don't mean the performance aspects. And as I didn't point out, but maybe should have, I also didn't mean all that other stuff people get their knickers in a knot over re that chip.

Let me try it this way --it's a much more successful significant technology transition than the one NV attempted with NV30, albeit with some of the same "goods" in mind (FP32, pipeline flexibility) implemented more successfully. There, that help to turn off the flashing red lights? :LOL:
 
Tim said:
The core should use (2^2)/(1.8^2)*625/500 = 1.54 times more, with twice the RAM for XT the difference is not completely out of line. Anyway 2.0V (and even 1.8V) seems a bit extreme for a 90nm chip - is it the TMSC process that requires high voltages or is Ati still having trouples reaching the needed clockspeed (I would expect something like 1.4-1.6V for a 90nm chip).

The ram requires 1.8V or 2.0V, not the chip itself.
 
My guess is

A = fragment shaders (the small boxes are the four groups of ALUs)
B = Texture Unit or ROPs (depends on what would you put nearer to memory)
C = no idea, Rasterizer or HZ related
D = ROPs or Texture Unit
E = no idea, Rasterizer or HZ related

Where are the missing VSH? The 8 shaders be as big as eight of those small boxes in A. One option is that the vertex shaders are also distributed per quad core (but that would require some kind of bus to dispath vertices to all the other quad cores).

Where is the fourth E? Disgregated around C at the bottom right section?

What is that zone in the left to center that hasn't been marked yet? No idea.

Then there is the video related stuff, the DAC, the video decodification processer and such stuff and the PCIe interface. Likely all that is located in the bottom right corner.
 
Geo, not really - R300's precision and pipeline/tile flexibility is more than a match for the wayward posturing of NV30. It's a huge step beyond R200, made with care and foresight.

Why construct a GPU capable of running hundreds of instructions in a shader, for example, when the GPU isn't fast enough? Why build FP32 precision when the longest practical shaders show no merit beyond FP24? etc.

There are key reasons why NV30 was a disaster - it pushed in all the wrong directions. NV30 is an exemplar of one thing only, how to get caught up in features to the detriment of viability. Too ambitious, by far, for its own good.

I don't see NVidia's superscalar with knobs-on (or cut-down VLIW, if you prefer) architecture lasting beyond the first iteration of DX10. As the pressure to decouple texturing from shading grows (otherwise more and more shader pipelines stall from a deficit in texture bandwidth) the requirement to make a "wide" pipeline with an ALU also serving as a texture address processor diminishes. Once that goes, I don't see the wide pipeline lasting. The risk of idle ALUs (or other units) grows because shader programs fit less and less into the neat little packages that existed back in the DX8 days.

There's nothing novel or difficult about G70 architecturally that couldn't have been implemented back in NV30. You have to ask why did it take 3 years to get there?

Jawed
 
Jawed said:
I agree that "texture units are decoupled" in the sense of batch scheduling - but it seems that each shader quad "owns" a texture quad.

A die shot of RV530 would be very much appreciated, as there is no ownership of texture quads by shader quads...

Jawed

that is unfortunate. A unified block of 4 quad texture samplers would be able to do AF 2x with Trilinear in one cycle.

v
 
vb said:
that is unfortunate. A unified block of 4 quad texture samplers would be able to do AF 2x with Trilinear in one cycle.
And a unified block like that would be less efficient than four separate quad texture samplers. Since your texture sampling performance in general is going to be limited by memory bandwidth these days, there's not all that much incentive to add many more than we have in today's architectures.
 
Jawed said:
Geo, not really - R300's precision and pipeline/tile flexibility is more than a match for the wayward posturing of NV30.

You say tomatoe, I say tomatoh. The R300 still caries around the "phase"d baggage from the R200 era, and you can see it in the dependent lookup limitation. NV3x, although a failure, gave NV the valuable experience they needed to make the NV4x. Sometimes, you make a 1.0 version of something that contains alot of risky decisions, then you use the experience from that, to find out what works and what doesn't, and make alot better 2.0 version.

Why construct a GPU capable of running hundreds of instructions in a shader, for example, when the GPU isn't fast enough? Why build FP32 precision when the longest practical shaders show no merit beyond FP24? etc.

Because NVidia was also trying to go after the DCC market. They know a 4000 instruction long shader can't be used in games. But it can be used in offline rendering. e.g. Gelato. Rather than design a separate "Quadro" core, they added cheap support for long shaders.



There's nothing novel or difficult about G70 architecturally that couldn't have been implemented back in NV30. You have to ask why did it take 3 years to get there?

And nothing in the R520 scheduler architecture could not have been implemented 3 years ago either. Why did it take us three years to get here? Because ATI needed the experience it gains over the last 3 years of optimizing the R2xx/3xx architecture and Xenos project to arrive where it is today. Just as ATI will use its Xenos experience on the XB360 to gain insights into how to make a better Longhorn chip too.

The G70 is a safe, mature tweak of the architectural line that started with the NV4x, just as the R420 was a safe tweak. I think the R520 is relatively safe as well. I think both ATI and NVidia are working on more radical architectures behind the scenes, and this G70 vs R520 business is just a sideshow.

I think the expectation of top-down miracle designs which work with 100% success out of the box is a little naive. In most cases, feedback from successes and failures of real implementations are needed. Design in the real world needs bottom-up inputs, which cannot always be had by whiteboarding and lab simulations.
 
It's interesting that RV515, somewhere in the vicinity of a 9700Pro in terms of performance is "too small" to utilise all elements of R520's architecture - i.e. no ring-bus.

I'm not sure if that means it would have also been "too small" back in the 150nm days - since the transistor count for RV515 is in the same ballpark as R300.

But it's interesting to ponder.

But anyway, that's not the scheduler, which is what you're talking about.

ATI's patents on this kind of scheduler go back to 2003 (so the work goes back further), so it could be a case of choosing the right moment to implement it. Maybe it wasn't seen as a performance priority back then.

R400 was canned for some reason or other...

I see the scheduler as the key to fully decoupling the shader and texture pipes. Which has a knock-on effect of providing for dynamic branching.

Jawed
 
Chalnoth said:
And a unified block like that would be less efficient than four separate quad texture samplers. Since your texture sampling performance in general is going to be limited by memory bandwidth these days, there's not all that much incentive to add many more than we have in today's architectures.

But in R520, considering a 1/1 PS to texture op ratio, most of the time texture units would be idle. having them all work on a quad when they actually get some work, should provide some benefit. It is strange that Rv530 and R580 might be getting that option when they won't need it.
 
vb said:
But in R520, considering a 1/1 PS to texture op ratio, most of the time texture units would be idle. having them all work on a quad when they actually get some work, should provide some benefit. It is strange that Rv530 and R580 might be getting that option when they won't need it.
This is the detail about R520 scheduling that I find hard to understand.

Xenos seemingly can assign texture work to the texture pipes "as a group", independently of the shaders executing in the shader pipelines.

But it's not clear if R520 can or can't. The implication is there that once a batch requires texturing the texture instruction will be fired-off, independently. And then the batch will keep on executing ALU instructions until the first instruction that's dependent on the texture result.

At that point the scheduler should issue another batch for the shader pipelines (one quad).

But the big question is, whether R520 can search the batch queue for batches that require texturing, while the ALU pipeline is occupied with a shader that isn't running texture instructions?

I'm getting really wound-up not knowing this. I have a nasty feeling R520's scheduler/architecture is quite a bit dumber than Xenos in this respect. The "texture pipes owned by shader pipes" concept sounds bad to me :cry:

Jawed
 
vb said:
But in R520, considering a 1/1 PS to texture op ratio, most of the time texture units would be idle. having them all work on a quad when they actually get some work, should provide some benefit. It is strange that Rv530 and R580 might be getting that option when they won't need it.
No. Just like how the pixel shader units are going to achieve high utilization through multithreading, so do the texture units.
 
Jawed said:
But it's not clear if R520 can or can't. The implication is there that once a batch requires texturing the texture instruction will be fired-off, independently. And then the batch will keep on executing ALU instructions until the first instruction that's dependent on the texture result.
There'd be no point to multithreading if this was all that the R520 could do, and it couldn't explain the performance benefit in many games that the R520 is showcasing.

The primary purpose in tying the texture units to specific quad pipelines would be to reduce the amount of traffic that has to be ferried across the chip.

I'm sure that while the texture units are getting some texture fetches on one thread done, the pixel shader units can be operating on something else entirely (could be pixels that have already done the texture read, and are ready to move forward in processing...could be pixels that still need to execute a few instructions before the texture read, or it could be pixels executing an entirely different pixel shader, but grouped due to ATI's tile-based rendering). That's the only thing that can both explain the low branching penalty and the performance increase for current Direct3D games.
 
I was kind of wondering how the Xenos handles transporting texture data back to the shader arrays. If 16 pixels are operated on at the same time, then it seems to me that a texture block should be able to deliver at least 16*32 bits of filtered results to a shader array (or 16*64 if filtering 4 channel fp16 textures is to cause no performance penalty). The address bandwidth is even murkier to me - where does the texture address calculation take place? In the shader array, or in the texture units?
 
Chalnoth, I'm asking the question out loud. I agree it wouldn't make sense for R520 to be "dumbed down". The thing is, I haven't seen any convincing evidence that R520's TMUs are more freely schedulable.

X1600XT performance, in particular, just seems way off beam to me. It's hurting in seemingly texture-heavy scenarios far more than it should, in my view.

But the thing is, I can't find any detailed analysis, so it's just a waiting game... I haven't really got any evidence one way or another...

And what's puzzling me is ATI has had 10 months to sort the drivers out against real hardware.

Jawed
 
Jawed said:
X1600XT performance, in particular, just seems way off beam to me. It's hurting in seemingly texture-heavy scenarios far more than it should, in my view.
Bear in mind that the X1600XT only has four texture units. If it were a traditional architecture, you'd call it a four-pipeline architecture with three full ALU's per pipeline.

I think from that perspective it is a very efficient piece of hardware (provided you're running in Direct3D........).
 
psurge said:
I was kind of wondering how the Xenos handles transporting texture data back to the shader arrays. If 16 pixels are operated on at the same time, then it seems to me that a texture block should be able to deliver at least 16*32 bits of filtered results to a shader array (or 16*64 if filtering 4 channel fp16 textures is to cause no performance penalty). The address bandwidth is even murkier to me - where does the texture address calculation take place? In the shader array, or in the texture units?
Presuming you do really mean Xenos, not R520, then I imagine it's much like R520, in fact. The texture pipes can send data directly to the register array.

In order for this to be valid, the batch that wants that texture data needs to be in context, so the scheduler would have to time the batch so that it is ready, too. So there might be a buffer on the texture pipes' output to soak up the unknowable latency of the texture operation.

Xenos's texture pipes have address calculation ALUs - i.e. they're similarly detached as those in R520.

012l.jpg


Jawed
 
Back
Top