AMD: R7xx Speculation

Nite_Hawk · Jun 21, 2008

Hi Guys,

I just got back from Best Buy; I got in on the visiontek 25% off sale and picked up a 4850 for $150.

I'm going to install it now, so if there are any tests people want run let me know. I don't have any recent games so only tests you can provide for free please.

Nite_Hawk

Jawed · Jun 21, 2008

Mintmaster said:
The problem with that type of layout is the crazy data movement that needs to happen across blocks. Think about swizzling and dot products and channel replication.

The killer is that T can only get register operands from the other ALUs. Literals, constants and previous results are also possible of course.

Maybe the 55nm process is mature enough that they just skipped a lot of the redundancy, or limited it to register files and parts of the shader core instead of its entirety.

At a cost of 6% of the ALUs or 2.4% of the die, it seems doubtful. If the dark bits are logic, there's a hell of a lot of logic in each SIMD to go wrong.

I think some of those T-labelled regions are texture units. I don't see why the TU quads wouldn't show up on the die with a nice regular pattern.

I certainly won't discount this, as it was my original guess when the "finger-and-thumb" die shot appeared - 10 regular looking things that are TUs are desperately needed

There's L1, vertex data cache and GDS to account for too. L1 could be 32KB, vertex data cache the same.

Each SIMD should have 256KB of register file (unless they've changed that too), i.e. 4 lots of 64KB, implying that each block in each corner is 16KB. But there's still a big question mark over multi-porting - are there really 4 physical banks per logical bank? If so the L1 caches and other SIMD memory should look piddly...

I have to say, right now I'm leaning towards your interpretation. EDIT: [strike]There's nothing to say that 20 MADs can't have 1 ALU lane of redundancy

[/strike]

Jawed

Jawed · Jun 21, 2008

Recall said:
Guys I am in the market for a new gpu. I currently have a 8800GTS 640mb, looking at benchmarks the 4850 does not represent a big enough boost for me.

With the superior core speed and gigantic memory clocks, can we expect 20-30% increase in performance with the 4870 over the 4850?

With 16xAF and 4xMSAA I'm hoping for way more than 30%. The Extreme preset in Vantage is rumoured to be 60% faster. Can't believe we don't have any more benchmarks.

Also are ATI now using hardware resolve for processing AA, or is it still done through shaders.

they've changed so much, we're still guessing.

Anyway, no point in rushing...

Jawed

hoom · Jun 21, 2008

I think more like this:

Blue = 5SP block (1 spare?)
Red = LDS/L1?
Pink = TMU quad
Green = Scheduler/Dispatch

silent_guy · Jun 21, 2008

CarstenS said:
No you didn't. But until quite recently, one was supposed to think that every kind of crossbar was old-fashioned compared to a modern concept like ring-bus.

Yes, that was a great marketing story but it's really incorrect. If you can pull it off wrt layout, for the similar performance, a crossbar is much better in terms of performance per area than a ring bus, assuming an evenly balance load. With a ringbus, you have to go out of your way architecturally to avoid severe stalling or even deadlock conditions, typically by overdesiging it. Crossbars have no such problems.

With the area required for a crossbar increasing in a linear fashion while chip area is on a quadratic path, the layout area required to create channels on a chip to place the wires is really a non-issue these days.

As for its impact on overall GPU performance: an interconnect should never be something that determines performance. Once you're able to feed the agents at their maximum capacity, it doesn't matter whether you do so with a crossbar or anything else. It's reasonable to assume that ATI nor Nvidia have ever been so stupid as to underdesign the bandwidth of their interconnects. That's why that whole hoopla of the existance of the ring bus has always been baffling to me: at the end of the day, it has zero impact on how well your GPU will work. It's not going to make FP calculations any faster...

3dilettante said:
What is it then, if not a crossbar, a switch fabric, ring bus?

FWIW, at my job, we'd use 'crossbar' and 'switch fabric' very often interchangeably to describe the same thing. I'd say a switch fabric is at the core of a crossbar.

kyetech · Jun 21, 2008

pjbliverpool said:
But how do you know Xenos is v.good in the context of those PC GPU's? We don't actually know how it performs compared to those others. At best we can say it probably performs a bit better than a 500Mhz 128bit G71 with 8 ROPs. Hardly earth shattering .

I happen to think it was good in the context of consoles... Not just in terms of performance, but also interms of functionality.

Look at gears of war 2 and remind yourself this chip is v.good for its size, power budget and timeframe.

3dcgi · Jun 21, 2008

pjbliverpool said:
But how do you know Xenos is v.good in the context of those PC GPU's? We don't actually know how it performs compared to those others. At best we can say it probably performs a bit better than a 500Mhz 128bit G71 with 8 ROPs. Hardly earth shattering compared to R600 especially when you consider that R600 also supports DX10 which takes a lot of transistors.

This has always been the problem i've had with the general consensus with regards to Xenos. On paper, it looks incredible. But on paper, R600 looks earth shatteringly incredible! If we didn't actually get to see R600 benchmarks in the cold light of day - especially in comparison to G80 then we would probably still be happily toddling along thinking its the best thing since sliced bread.

We've never seen any Xenos benchmarks though.

Judging performance of Xenos from any PC part is a waste of time. There were even more changes from Xenos to R600 than there were from R600 to RV770.

leoneazzurro · Jun 21, 2008

silent_guy said:
Yes, that was a great marketing story but it's really incorrect. If you can pull it off wrt layout, for the similar performance, a crossbar is much better in terms of performance per area than a ring bus, assuming an evenly balance load. With a ringbus, you have to go out of your way architecturally to avoid severe stalling or even deadlock conditions, typically by overdesiging it. Crossbars have no such problems.

With the area required for a crossbar increasing in a linear fashion while chip area is on a quadratic path, the layout area required to create channels on a chip to place the wires is really a non-issue these days.

AFAIK a crossbar does not scale linearly with number of units, while a bus can (not necessarily a "ring" one). For all other considerations, I agree.

mczak · Jun 21, 2008

Jawed said:
The killer is that T can only get register operands from the other ALUs. Literals, constants and previous results are also possible of course.

I must have missed this, why does T not have its own register file? And if that's the case, wouldn't that mean it has to be very close to the other units?
Maybe what you labeled T-Mad could be texture filter (16 fp16 "bilerp units") which would fit the mostly logic look of this area (with the are right to it texture address, texture fetch including L1 cache, and at the left the sequencer - not much storage there though and a huge amount of logic...).
Or not...

Mariner · Jun 21, 2008

OpenGL guy said:
The 24xCFAA (and 12xCFAA mode for 4xAA) mode doesn't cause any extra blurriness at all. Great pains are taken to make sure that only the edges are filtered.

Performance of the CFAA modes on 48xx parts should be surprising

Erm, unless I've missed this somewhere has anybody with one of the 4850s actually tested CFAA performance after this obvious hint?

Comparison with standard AA and versus 3870 performance would be interesting. :smile:

CarstenS · Jun 21, 2008

Mariner said:
Erm, unless I've missed this somewhere has anybody with one of the 4850s actually tested CFAA performance after this obvious hint?

Comparison with standard AA and versus 3870 performance would be interesting. :smile:

Yeah, that'd be very interesting. Unfortunately, most reviewers were seemingly caught by surprise to see the Performance-NDA lift on the third day (about five days earlier than was communicated initially) after being briefed about the new products...

igg · Jun 21, 2008

The card is scheduled for a July launch: Source.

It would be awesome if thats true

Tchock · Jun 21, 2008

Mariner said:
Erm, unless I've missed this somewhere has anybody with one of the 4850s actually tested CFAA performance after this obvious hint?

Comparison with standard AA and versus 3870 performance would be interesting. :smile:

One post in Lowyat.net (think of it as an even cruder VR-Zone. Yes, sites like these actually exist

)

ben3003 on LYN said:
My newest crysis score:

Run #1- DX9 1280x1024 AA=No AA, 32 bit test, Quality: High ~~ Overall Average FPS: 46.42
Run #2- DX9 1280x1024 AA=4x, 32 bit test, Quality: High ~~ Overall Average FPS: 23

He later said that these were the ED scores, box 4xAA scores over 30 FPS.

pjbliverpool · Jun 21, 2008

kyetech said:
I happen to think it was good in the context of consoles... Not just in terms of performance, but also interms of functionality.

Look at gears of war 2 and remind yourself this chip is v.good for its size, power budget and timeframe.

Its a great chip compared to RSX thats for sure (at least in terms of its design and functionality) but thats more because RSX was pretty poor for its timeframe.

What I mean is, Xenos is clearly a great design on paper, and it also comes packed with great functionality but the same can be said of R600. We mark R600's "greatness" down because it didn't perform as well as we expected. I'm just not seeing why we should assume Xenos is a superior implementation of the architecture when R600 came second and had time to learn from and refine the Xenos design.

E.g. in terms of overall efficiency of the implementation it looks like:

R600 -> R670 -> R770

That also matches the timing of their releases which is to be expected as each evolved from the one before. Xenos performance is an unknown but timing wise it does slot into the above picture before R600 so if we're going to make assumptions about its efficiently it seems more sensible that those assumptins fit into the above picture. Assuming Xenos is as efficient an implementation of that basic architecture as R770 seems a bit baseless to me. More likely its an as efficient implementaion or less so than R600.

CarstenS · Jun 21, 2008

pjbliverpool said:
I'm just not seeing why we should assume Xenos is a superior implementation of the architecture when R600 came second and had time to learn from and refine the Xenos design.

IMO: Greatness only derives from comparison with the alternatives.

Skinner · Jun 21, 2008

igg said:
The card is scheduled for a July launch: Source.

It would be awesome if thats true

I wonder if it only have 512mb framebuffer.?

Jawed · Jun 21, 2008

hoom said:
I think more like this:

Blue = 5SP block (1 spare?)
Red = LDS/L1?
Pink = TMU quad
Green = Scheduler/Dispatch

That's clever!

I like that, particularly the solution to redundancy.

Jawed

Jawed · Jun 21, 2008

mczak said:
I must have missed this, why does T not have its own register file?

The register file is vec4, I guess. As far as I can tell the register file is banked into 1KB sections, each section being 1 vec4 register * 64 elements (64 * 16 bytes).

If you download the CAL SDK you can see a hell of a lot of detail about R600 from the ISA document.

And if that's the case, wouldn't that mean it has to be very close to the other units?
Maybe what you labeled T-Mad could be texture filter (16 fp16 "bilerp units") which would fit the mostly logic look of this area (with the are right to it texture address, texture fetch including L1 cache, and at the left the sequencer - not much storage there though and a huge amount of logic...).
Or not...

Anyway, I'm abandoning my theory, I like Hoom's theory very much.

Jawed

MfA · Jun 21, 2008

Skinner said:
I wonder if it only have 512mb framebuffer.?

If they ditched AFR much more of the aggregate memory would be available than on the old X2 cards, so a per chip 512 MB buffer wouldn't be so bad.

If AMD against all odds has ditched AFR they should dust off the FASN8 motherboard ... nothing could stand against them in the benchmarks then, nothing could even get close.

Tchock · Jun 21, 2008

MfA said:
If they ditched AFR much more of the aggregate memory would be available than on the old X2 cards, so a per chip 512 MB buffer wouldn't be so bad.

If AMD against all odds has ditched AFR they should dust off the FASN8 motherboard ... nothing could stand against them in the benchmarks then, nothing could even get close.

Wait... what about 2x GT200 and SmackOver? The latter would propel the combo to something better than FASN8 I suppose.

That was a conclusion made too soon.

AMD: R7xx Speculation

Nite_Hawk

Jawed

Jawed

hoom

silent_guy

kyetech

3dcgi

leoneazzurro

mczak

Mariner

CarstenS

Moderator

igg

Tchock

pjbliverpool

B3D Scallywag

CarstenS

Moderator

Skinner

Jawed

Jawed

MfA

Tchock

Similar threads