AMD: R7xx Speculation

MfA · Jun 16, 2008

Jawed said:
NVidia appears to have spent a lot on CUDA-related functionality

On the other hand ATI could focus on improving what was there since they already did DP and scatter before ... lets just wait for some benchmarks.

Jawed · Jun 16, 2008

Arun said:
That's called marketing...

The patent document seems to imply that the output from these disparate units can be returned to the "cluster" (register file, presumably) - i.e. the staging is "real".

Just like the 'Integer' path in the GT200 diagrams are pure marketing spin, at least according to John Nickolls.

Haven't a clue who he is.

Exact same thing as in, uhhh, Voodoo Graphics, Riva 128, Rage, GeForce 256, Radeon, [...], R300, R520, NV30, NV40, G70, ...

Even G70 doesn't use the same hardware for vertex data fetches as for texture filtering, does it?

If anything, as I said many times already, I very heavily suspect that there is a significant amount of sharing between the addressing & filtering parts on all DX10 NVIDIA chips, except maybe G80.

Kinda disappointed you guys haven't wheedled this stuff out...

Yes, it seems they finally decided they could no longer afford to have an incredibly inefficient TMU architecture that made neither theoretical nor practical sense. Shock, horror?

Until it's benchmarked we won't know the practicalities - clearly total throughput is up for both point and filtered sampling.

Hopefully the ROPs are sufficiently improved too so they can get nearer or even catch NVIDIA in these two respects,

Yeah, sadly at least some of the 9800GTX 16xAF/4xAA benchmarks are blighted by the card apparently running out of memory, so the comparisons with HD4850 are unreliable.

but given the die size what seems really really impressive IMO are the ALUs as there's a ridiculous amount of them and they presumably aren't any weaker (if anything, the FP64 numbers would imply they're stronger). They're definitely back in the game.

Double-precision was already there in RV670, so that particular aspect shouldn't have changed...

I am intrigued by the idea of 10 SIMDs though - that's quite a lot of control overhead in comparison with 4 SIMDs.

Jawed

leoneazzurro · Jun 16, 2008

Jawed said:
I was just assuming the same "horizontal" arrangement we saw in R6xx

If each SIMD gets a dedicated TU ("vertical" - though it's horizontal on this diagram if the SIMDs are now read as rotated 90 degrees) then that's a bit of a change - though in terms of cache I guess the effect of the change is very limited because L1 is basically only big enough for a small region of texels and any one texel will find itself in multiple L1s anyway, whether it's a horizontal or vertical mapping from SIMDs to TUs.

Jawed

It seems to me that the "global data share" could be like a shared link where all SIMD can access in a fast way data in the other SIMD (and access other TMUblocks linked to other SIMDs). that is, it's the "vertical" link between the SIMDs where the "horizontal" link is among shaders in a SIMD and the dedicated TMU block.
Is it possible?

Sunday · Jun 16, 2008

v_rr said:
And by that time comes RV770 in 40nm, and them we will see who catch who

40nm is the half node of 45nm witch is a new process node. It decrease dramatically die-size.

it was on this site that TSMC is skipping 45nm and heading directly towards 40nm (presumably in H2 2009)

Mintmaster · Jun 16, 2008

Jawed said:
Hmm, I've never seen any sign that NVidia uses dedicated point-sampling units in G80 onwards. NVidia seems to separate-out the stages that make up "texturing", with a separate LOD/Bias unit, then an addressing unit, then sampling, then filtering (I suspect I've forgotten something). I've been assuming that point samples (vertex data fetches) are taken simply by issuing commands to the address and sampling units.

I wasn't implying anything so elaborate. Just talking about the strategy of making units much smaller at the cost of some efficiency/functionality.

ZerazaX · Jun 17, 2008

Well with GDDR5 supporting:

6. Clamshell mode(x16 mode)
Graphics system designers expect GDDR5 standard to offer high flexibility in terms of framebuffer and bandwidth variation. GDDR5 supports this need for flexibility in an outstanding way with its clamshell mode. The clamshell mode allows 32 controller-I/Os to be shared between two GDDR5 components. In clamshell mode each GDDR5 DRAMs interface is reduced to 16 I/Os. 32 controller I/Os can, therefore, be populated with two GDDR5 DRAMs, while DQs are single loaded and the addresss and command bus is shared between the two components. Operation in clam shell mode has no impact on system band width.
Example: System configurations with 512M GDDR5 device using a controller with 256 bit interface:
A) 8pcs of 512M GDDR5 in standard mode Framebuffer: 512 MB
B) 16pcs of 512M GDDR5 in clamshell mode Framebuffer: 1 GB

That crossbar might just mean shared memory does indeed occur

Anarchist4000 · Jun 17, 2008

http://www.qimonda-news.com/download/Qimonda_GDDR5_whitepaper.pdf

Page 7.

That does make a lot of sense though. Each GPU would only have half the bandwidth to each memory chip but with twice as many chips all things are equal. Routing on the PCB I see being an utter nightmare. Getting traces around the second GPU and attached to its memory should be interesting.

You'd have to imagine one of the IHVs requesting a spec like that for it to get included in GDDR5. Since Nvidia likes GDDR3 and is making GPUs roughly 1 Kardashian in size I'd think it's obvious who would have asked for it.

The only interconnect would be the control bus. That and some link or partitioning between the schedulers would make a little sense.

Jawed · Jun 17, 2008

I don't get what you guys are describing. Clamshell mode is designed to let one memory channel interface with 1 or 2 DRAM chips. The command bus is common for both configurations, while the data bus is split in two for clamshell configuration.

Jawed

Anarchist4000 · Jun 17, 2008

After further reading it's not quite what I thought.

Clamshell would cut the data bus in half allowing the possibility of two GPUs connected to each chip. What I'm still digging for is if there is any control logic that determines which half of the data bus gets used. Then they could alternate between controllers every other clock etc. If that was the case they'd just have to work out timing and sharing of the control bus.

MfA · Jun 17, 2008

The data bus is point to point, you can't just bodge an extra couple of 10s of cm of trace to another GPU on those DRAMs data pins and hope it will still run at 4 GHz.

Anyway, there is no real point to doing it like that even if you could. You are still losing bandwidth that way ... if you are going to lose bandwidth anyway you could just directly connect one partition of the memory interface on both GPUs, without a DRAM in between.

Anarchist4000 · Jun 17, 2008

GDDR5 SGRAM will be operated in both ODT Enable (terminated) and ODT Disable (unterminated) modes. For highest data
rates it is recommended to operate in the ODT Enable mode. ODT Disable mode is designed to reduce power and may operate
at reduced data rates. There exist situations where ODT Enable mode can not be guaranteed for a short period of time, i.e.
during power up.

After going through a little reading the buses can be tri-stated so the point-to-point might not be necessary. The question is how much it affects performance. I'd agree this is pushing things a bit but it's a possible option.

My original idea was that you'd have twice the effective bandwidth but available only half the time as it alternated between controllers. Also there is a mirror option but that appears to flip all the pins. Not just the data pins.

Arty · Jun 17, 2008

Thought this was intertesting:

Jason Cross said:
Then there's the question of what ATI has up its sleeves, given that they're on the verge of releasing their new graphics cards based on the new RV770 chip. ATI tells us they're not going to compete in this really high-end of the market with those products.

Rather, they promise we'll get close to the performance of the GTX 200 cards (say within 20%) at dramatically lower prices and power. Certainly that targets a much larger segment of the market, but the worth of that strategy all hinges on its real relative performance. For the high-end, ATI is still a couple months away from the release of their card containing two RV670 GPUs.

We don't know what shape that will take, only that it will combine two GPUs on a card in a substantially different fashion than the Radeon HD 3870 X2, and that ATI tells us that with not-yet-fully optimized drivers it already scores over 6,000 in 3DMark Vantage on Extreme settings. We've heard promises of future greatness before, and reserve judgment until we get the cards in our own hands to run our own tests.

http://www.extremetech.com/article2/0,2845,2320134,00.asp

ZerazaX · Jun 17, 2008

So what's the "Global Data Share" that the supposed-Crossbar leads to for? I just realized it wasn't in the R600 drawing...

Anarchist4000 · Jun 17, 2008

My guess would be some sort of global cache for inter-thread communication.

I don't suppose anyone has done a pin count on RV770? For a unified memory architecture I'd assume there has to be some form of high speed interconnect and an abundance, or lack thereof, might give an idea on just what they're doing.

Jawed · Jun 17, 2008

ZerazaX said:
So what's the "Global Data Share" that the supposed-Crossbar leads to for? I just realized it wasn't in the R600 drawing...

If we knew what the Local Data Share is, maybe we could make a decent guess.

I'm thinking that LDS might be the name for the per SIMD register file.

But GDS is placed alongside texture caches, which implies it's texture-data related. Hell it might be nothing more than memory used to hold addresses or filtering coefficients or something, stuff that's been computed but needs to be kept around for later usage.

If that's the case then LDS might just be data that's specific to a batch for the purposes of texturing or vertex data fetching.

Maybe related to texture arrays and cubemap arrays?

Jawed

v_rr · Jun 17, 2008

ZerazaX said:
So what's the "Global Data Share" that the supposed-Crossbar leads to for? I just realized it wasn't in the R600 drawing...

Does that picture come from AMD?
It list 40TMU, but by tests so far HD 4850 looks to have 32TMU.

So True/Fake?

satein · Jun 17, 2008

v_rr said:
Does that picture come from AMD?
It list 40TMU, but by tests so far HD 4850 looks to have 32TMU.

So True/Fake?

What's test?

If it is GPUz, the test will rely on database. So at this point, I think we may need to wait until the launch date to be confirmed all the info about the RV770. It will not be any longer to wait

This round, AMD/ATi do playing a good game on keeping infomation well.

Wirmish · Jun 17, 2008

satein said:
What's test?

Perlin Noise

trinibwoy · Jun 17, 2008

I think no-x posted some 3dmark fillrate numbers a while back that pointed to 32.

LordEC911 · Jun 17, 2008

Wirmish said:
Perlin Noise

Ummm... Isn't Perlin Noise a shader intensive bench?

AMD: R7xx Speculation

MfA

Jawed

leoneazzurro

Sunday

Mintmaster

ZerazaX

Anarchist4000

Jawed

Anarchist4000

MfA

Anarchist4000

Arty

KEPLER

ZerazaX

Anarchist4000

Jawed

v_rr

satein

Wirmish

trinibwoy

Meh

LordEC911

Similar threads