AMD: R8xx Speculation

Neb · Jul 20, 2009

gongo said:
There is one thing i dont get about GPU and DX, I just saw Resident Evil 5 benchmarks and it ran notably slower on HD4800 series than GT200 series. Is this a driver bug or something about the programming by Capcom is off? I would thought a DX game should at least run similarly on the same DX class hardware, that would be the purpose of a standard API?

RE5 doesn't work correctly with AA in DX10 mode due to driver problems and/or game bugs as there is no reason why it would hit ~15fps on test 3 but ~65fps in DX9 mode with same AA amount.

trinibwoy · Jul 20, 2009

Jawed said:
If this MCM is enhanced-sideport, is it AFR or is it something else? If something else, can the 4-chip board be anything but AFR based?

Heh, I'm surprised you guys are giving this rumour so much credit. I would wager that even if AMD were to develop an elegant solution for inter-die communication on top of a shared memory pool they would do so with a 2-chip design first before jumping all the way to 4.....at least that's what my uninformed common sense is whispering

keritto · Jul 20, 2009

nagus said:
http://we.pcinlife.com/thread-12119...e carbonized bacteries on jupiter yeah :rofl:

keritto · Jul 20, 2009

Jawed said:
Alternatively, the 192-bit, 960-lane chip merely sounds like a refresh part that will be scheduled for March 2010, or something...

2 clusters more than RV740 is only about 11-14mm² I reckon, leaving 30mm² for D3D11 changes + enhanced sideport. Does RV740 have a sideport? Seems unlikely...

What do you meant by that 181mm² couldnt have full length 192bit bus when RV670 (similarly sized, 187mm² afair) had 256bit gddr3/4 bus and gddr5 needs only some 30% more pins per chip?

Nope RV740 is w/o sideport. But does RV790 has sideport i think also doesnt have like it's little brother. And sideport didn't work at all in R700 case so it's sideport rev.1.05a that actually works unlike RV770's one

willardjuice · Jul 20, 2009

sideport didn't work at all

There's a difference between "didn't work at all" and "working but just unused".

Jawed · Jul 20, 2009

trinibwoy said:
Heh, I'm surprised you guys are giving this rumour so much credit. I would wager that even if AMD were to develop an elegant solution for inter-die communication on top of a shared memory pool they would do so with a 2-chip design first before jumping all the way to 4.....at least that's what my uninformed common sense is whispering

Yeah, that's what I'm thinking, 2-chip is the biggest MCM. And even if that doesn't use AFR, then AFR would be required for a dual-MCM board (X2, as it were, with 4 chips).

I don't know if it's credible. I wonder about the complexity of interconnecting 20 clusters. The overhead of a very high bandwidth sideport (10x the speed of the sideport on RV770?) in a trade against AMD manufacturing only two chips?: 64-bit (no sideport) and 128-bit for single chip, dual-chip (MCM) and quad-chip (two MCMs?) boards.

If inteconnecting 20 clusters is very costly (dunno) then it might be better to have two semi-independent groups of 10 clusters. Once you've done that, what's the trade-off between going single-chip with a 3-chip line-up versus multi-chip with a 2-chip line-up?

A high-bandwidth connection between two chips on an MCM should be easier (in the sense of physical properties) than getting high bandwidth between a GPU and GDDR5. GDDR5 is capable of >7G bits per pin per second.

Maybe if we keep wondering if it's possible it'll eventually come to pass

Jawed

Jawed · Jul 20, 2009

keritto said:
What do you meant by that 181mm² couldnt have full length 192bit bus when RV670 (similarly sized, 187mm² afair) had 256bit gddr3/4 bus and gddr5 needs only some 30% more pins per chip?

I don't know if it's possible to fit 256-bit bus and a sideport into 181mm².

I'm still unclear on whether a single "layer" of physical I/O round the edge of a chip is the limit. e.g. it could be possible to implement a 256-bit interface in ~the same perimeter as a 128-bit interface, but using more depth, e.g. about 2mm depth in total instead of about 1mm as is the case currently.

The PCI-Express IO on GT215 is configred in this dual-layer approach.

Nope RV740 is w/o sideport. But does RV790 has sideport i think also doesnt have like it's little brother. And sideport didn't work at all in R700 case so it's sideport rev.1.05a that actually works unlike RV770's one

Apparently RV790 does have the sideport, but that may be solely because it was easiest to leave it in place while adding the cap ring.

Jawed

dkanter · Jul 20, 2009

trinibwoy said:
Heh, I'm surprised you guys are giving this rumour so much credit. I would wager that even if AMD were to develop an elegant solution for inter-die communication on top of a shared memory pool they would do so with a 2-chip design first before jumping all the way to 4.....at least that's what my uninformed common sense is whispering

Heh, I think you're being too kind. Going for a shared memory pool across two GPUs would be horribly slow, since the access would have to be arbitrated by PCI-E.

Unless you want to design in a cache coherency protocol, which would be ridiculously complex and totally inconsistent with AMD's design philosophy.

David

Silent_Buddha · Jul 20, 2009

dkanter said:
Heh, I think you're being too kind. Going for a shared memory pool across two GPUs would be horribly slow, since the access would have to be arbitrated by PCI-E.

Unless you want to design in a cache coherency protocol, which would be ridiculously complex and totally inconsistent with AMD's design philosophy.

David

You mean like the R600 ringbus?

Regards,
SB

MfA · Jul 20, 2009

dkanter said:
Unless you want to design in a cache coherency protocol

Cache coherency is not a factor in rendering, there are defacto no cached reads after writes apart perhaps from the ROPs (and those are assigned non overlapping regions, they can never interfere with each other). Any read after write requires either uncached accesses or big state changes, the former will still be uncached with multiple GPUs, the latter provide plenty of time for write back.

Jawed · Jul 20, 2009

dkanter said:
Unless you want to design in a cache coherency protocol, which would be ridiculously complex and totally inconsistent with AMD's design philosophy.

Texture caches are read-only, so the coherency is relatively easy, isn't it? With tiled formatting of textures in memory (i.e. deterministic texel addresses), I'm guessing that there's only one cache to query for the presence of any quad of texels, rather than having to broadcast a request. Obviously all the TUs will be fetching multiple quads of texels in parallel, so the rate of queries is quite high.

Writes to render targets are tiled, i.e. any caching of those targets is localised, obviating coherency.

It's hard to work out what's needed to make each of the graphics-pipeline inter-stage buffers work. I'm still quite fuzzy on how VS/GS work is distributed across the clusters of a single GPU, so don't know what implications there are for multi-chip processing of geometry.

And I'm not clear if it's meaningful to have TS fixed-function units, on each of multiple GPUs, running in parallel, due to the locality/neighbourhood properties of patch processing (similar question mark for GS since locality of primitives is a key concept).

On the other hand, with post-transform caches being very small in GPUs, with implicit re-shading of vertices that have been evicted due to LRU or other policy, absolute throughput for VS/GS seems a lower priority. Maybe this will change with D3D11 due to the intensity of tessellation-based pipelines? I don't know how often typical games are VS or geometry bound in some way, generally.

---

Under GPGPU video memory is effectively read-write. This type of access is uncached in current GPUs, so requires plenty of threads to hide the resulting latency. Hard to know if there are any plans to implement caching for read-write accesses :???:

---

Why do you say PCI-Express arbitration is required for multi-chip? The buffers are assigned by the host, which requires commands sent over PCI-Express, but after that what kind of arbitration are you referring to? Why would two chips be "horribly" slower than one chip?

Jawed

trinibwoy · Jul 20, 2009

Jawed said:
A high-bandwidth connection between two chips on an MCM should be easier (in the sense of physical properties) than getting high bandwidth between a GPU and GDDR5. GDDR5 is capable of >7G bits per pin per second.

Yeah that's my thought as well. If you can manage to implement such high speed off package buses based on industry standard protocols it surely must be relatively easy to implement some sort of proprietary on package communication.

dkanter said:
Heh, I think you're being too kind. Going for a shared memory pool across two GPUs would be horribly slow, since the access would have to be arbitrated by PCI-E.

To be honest I don't see the point of any sort of MCM approach if it's not going to lead to a shared memory pool (which is surely a requirement for truly seamless load balancing?). Going MCM just to improve inter-die bandwidth doesn't seem like a worthy goal on its own considering the reported adequacy of PCI-E for AFR. Also, as framebuffer requirements increase it will become exceedingly wasteful to give each chip its own full complement of dedicated memory.

Jawed said:
On the other hand, with post-transform caches being very small in GPUs, with implicit re-shading of vertices that have been evicted due to LRU or other policy, absolute throughput for VS/GS seems a lower priority. Maybe this will change with D3D11 due to the intensity of tessellation-based pipelines?

Just curious but why would it change? An increase in the number of vertices passing through the chip doesn't inherently increase the probability of reuse of any given vertex does it?

keritto · Jul 20, 2009

Jawed said:
I don't know if it's possible to fit 256-bit bus and a sideport into 181mm².

But i asked about 192bit (in the part you cut) not 256-bit cause 55nm RV770 has 256-bit and gddr5 needs only 30% more pins than gddr3/4 and it a full new half node since 55nm

Jawed said:
It's hard to work out what's needed to make each of the graphics-pipeline inter-stage buffers work. I'm still quite fuzzy on how VS/GS work is distributed across the clusters of a single GPU, so don't know what implications there are for multi-chip processing of geometry.

Well if cache reads are easy, and all above you mention, then i don't see why chips couldnt share instructions in DX11 ATi architecture while ATi claims inter cluster threading being improved on that geary slides. It sounds me like deja-vu. So if they implement that there's no problem that clusters spread across different physical dies couldn't do the same if inttakonnekshen is perfect enough. And after all they would brings us what they promised a year and half ago "small cheaper dies" (if 181mm someone considers small

) that they use as building blocks for the whole range of their products. Dot think it's anything better for us but certainly it's better for them.

LordEC911 · Jul 20, 2009

nagus said:
RV870 is a package with two RV840, Sideport enhanced high-speed Internet, so that from a practical application, performance-driven point of view including the RV870 looks like a GPU, the RV840 specifications for the double 1600SP/256bit, A11 XT version of 3Dmark Vantage at least P17000 +.

See I interpret that bolded part as the dualGPUs being recognized as a single GPU.

Could it be that these last 3-4 pre-release rumors/speculation on MCM being recognized as a single GPU was all thanks to R800 being in developement?
First rumors of MCM was back in 2H '07, when R800 was well underway, ~1year at least.

kemosabe · Jul 21, 2009

First rumors of a launch delay, albeit a minor one.

gongo · Jul 21, 2009

Can we say HD5870 will be delayed till the end of the year? If so i might join in the cheapo HD4890 fun. Sounds to me the HD5870 is not the leap in performance over 4890 by the expected double figures. We need the MCM HD5870x2?

neliz · Jul 21, 2009

gongo said:
Sounds to me the HD5870 is not the leap in performance over 4890 by the expected double figures. We need the MCM HD5870x2?

I guess you're not reading it right. we're getting a RV840 that's performing inbetween the 4850 and 4890 and we'll getting a new kind of "two chips in one package" RV870 that would perform at 180% of the RV840.

A product like that is as far from crossfire as it is from a single chip product.

LordEC911 · Jul 21, 2009

kemosabe said:
First rumors of a launch delay, albeit a minor one.

Not quite, we got the first a few months ago in Q1 '09.

neliz · Jul 21, 2009

Delay? Looky here:

AFTER MUCH sniffing about, the INQ can finally reveal the names of AMD's much anticipated 40nm DX11-based Evergreen products.

A product launch is thought to be imminent (think late September), and AMD reckons these products will fundamentally change the graphics industry and give it an advantage over arch rival Nvidia.

The highest-end enthusiast offerings are purportedly called Cypress, with performance offerings dubbed Redwood, mainstream offerings called Juniper and Cedar, and low level entry offerings named after the poisionous shrubbery, Hemlock. [bwing me a shwubbery!]

We've heard AMD have already received a wafer back from TSMC and that it's alive, healthy and pretty much ready for ramping production. So the 40nm fully DX11-compliant chips will be ready for launch by the end of September, even slightly ahead of Windows 7.

Nvidia is still a way behind on DX11, and from what we're seeing, AMD seems confident - nay, cocky - that Evergreen will deliver a punch to the Goblin it may take a while to recover from. Lets just hope NV doesn't get ever-green with envy

Jawed · Jul 21, 2009

trinibwoy said:
Just curious but why would it change? An increase in the number of vertices passing through the chip doesn't inherently increase the probability of reuse of any given vertex does it?

As geometry gets more complex (tessellation

) it becomes progressively more and more expensive to have to re-create vertices because PTVC missed.

Indeed, I'm now wondering if this, in itself, is such a fundamental issue that a severe re-design is required.

In the good old days a vertex started in memory, went through VS and then proceeded to setup via PTVC.

Now a vertex can "appear" as a result of TS, which means that pre-processing through VS and HS had to occur - both of these are "low-frequency" (e.g. 1/10th or 1/100th). After appearing in TS it then goes through DS and GS. DS is reasonably costly as all attributes have to be, at minimum, interpolated - otherwise I guess DS functions as the main VS. GS can be pass-through - I'm not sure what other things one would do with GS after tessellating.

So, maybe TS-centric geometry isn't hugely costly, per vertex - but it still seems like there'd be quite a bit of latency there. I suppose it's possible that the tessellation process, within a patch, orders vertices in a PTVC-friendly manner. But the ordering of patches across a surface and during tessellation seems to imply that there will inevitably be vertices that are shaded multiple times.

If patches are "square" then the vertex at a given corner will end up being shaded four times. Arguably it doesn't make sense to try to cache such vertices because the interval between successive processing could be hundreds of vertices.

I'm wondering now if the introduction of GS and the way ATI buffers GS output through video memory (which increases latency) means that a structural or at least capacity change was implemented for PTVC.

Jawed

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

Neb

Iron "BEAST" Man

trinibwoy

Meh

keritto

keritto

willardjuice

super willyjuice

Jawed

Jawed

dkanter

Silent_Buddha

MfA

Jawed

trinibwoy

Meh

keritto

LordEC911

kemosabe

gongo

neliz

GIGABYTE Man

LordEC911

neliz

GIGABYTE Man

Jawed

Similar threads