R5xx Pipeline Transistor Counts

Jawed

Legend
b3d77.gif

Previously we've had an estimate of 2M transistors per ALU pipeline (including its associated register file) for R5xx GPUs.

With RV570 it seems to me we can get an estimate of the number of transistors in the TMU and ROP pipelines, jointly. About 8M transistors.

That then leaves a baseline of about 168M transistors for the remainder of the die: fixed function pipelines, vertex shader pipelines, memory controller, ring bus, PCI Express interface, AVIVO, etc.

There's got to be some error in here, as I can't account for the setup engine architecture. The setup engine, being between VS and PS, could scale depending upon the number of PS pipelines (16 in R520 and R580, 12 in RV570). There'll be other errors, I'm sure, simply due to various scaling factors and revisions over the lifetime of R5xx.

One notable revision is the Fetch 4 hardware in the TMUs, something that's missing from R520.

Jawed​
 
Very interesting exercise!

I copied your approach and came up with the same results. Then I extended the table further:

Code:
		2.03	4.40	11.8	12.4

	Trans	ALU	TEX+ROP	VS	MC	Fixed (M trans)
RV570	330	32	12	8	8	20
R580	384	48	16	8	8	20
R520	321	17.6	16	8	8	20
R530	157	13.2	3.6	5	3.6	10
R515	100	4.8	3.2	2	3.4	10

For R530 and R515, I decreased the size of TEXT+ROP due to smaller caches. Similar for the MC.
I increased the number of pixels shaders for smaller numbers to account for inefficiencies.

My cost function is the relative difference squared in number of transistors, with a weighing factor of 0.6 for the first 3 (because I don't want 3 quite similar entries to overwhelm their smaller brothers.)

The problem is that, no matter what parameters I play with, I always end up with humongeous sizes for the vertex shaders and with MC's that seem smallish...

Edit: I also added a Fixed column.
 
Last edited by a moderator:
I considered including RV515 and RV530, but:
  1. RV530 has double-rate Z ROPs
  2. RV515 has no ring bus
  3. I can't think of a way to account for their VS pipes, 2 and 5 respectively
From the die photo of R520:

die.jpg

it's estimated that the memory controller is 10% of the die. I interpret that as excluding the ring stops, which logically should be placed local to each of the four quad pixel pipes. At least the central MC looks "8-way" which is sorta handy!

So, here's my revision:

Code:
		1.97	7.59	1.5	4	
						
	Trans	ALU	TEX+ROP	VS	MC	Fixed
RV570	330	36	12	8	8	124
R580	384	48	16	8	8	124
R520	321	16	16	8	8	124
RV530	157	12	4	5	4	79
RV515	100	4	4	2	3	47

I've sized MC to be ~10% of R520's die, guesstimated VS at < PS ALU (it should be way less, I suspect) and then mopped up the rest of the fixed-functionality as a big heap of die :!:

So, clearly it's still wrong, but it doesn't look terrible I hope.

Jawed
 
Last edited by a moderator:
Armed with this faux knowledge:
Code:
		1.97	7.59	1.5	4
	Trans	ALU	TEX+ROP	VS	MC	Fixed
		96	24	0	16
R600	685	189	182	0	64	250

Let's not kid ourselves. We all knew that this was the intention right from the start! :D
 
Ha, well actually I was thinking about process-to-process scalings and the effect on potential die size and got diverted when I realised that RV570 solved an "unknown".

I can't see much of it translating to R600. Prolly only the MC can :!: :LOL: and it'll end up about 10% of the die again hahahaha...

ALU architecture is prolly radically different (some form of scalar?) and there's integer in there too. The TMUs have extra formats to filter, and the ROPs have new formats. etc.

Overall it's pretty striking that the TMUs and ROPs combined utterly dominate the die in comparison with the ALUs. 4x bigger is really a hell of a lot. I dare say that in terms of functionality: address generation, fetching, caching, blending, writing; TMUs and ROPs are very similar in complexity so fairly equal in size. What differentiates them: filtering, z/stencil testing; I can't think how to get close to splitting them up.

But it seems reasonable to think that TMUs and ROPs are each ~2x the count per pipeline as the ALUs.

It's worth pointing out that the "ALUs + Registers" count doesn't include instruction issue, decode, load-balancing, and other fiddly things to do with the operation of the PS pipelines (ALU + TMU). So the Fixed count includes a considerable amount of pixel shader hardware. Once all these gubbins have been included the total count of transistors per ALU in R580 might increase by 50%... who knows.

The other thing that's missing is pipeline redundancy. For example R580 might have 13 ALU pipelines for every 12 (i.e. one spare) and 1 pipeline for every 4 of both TMU and ROP.

So, overall, a fun diversion.

Jawed
 
Back
Top