AMD: R7xx Speculation

Status
Not open for further replies.
Well, yes, I'm just extremely skeptical of the 480 SPs stuff. What makes me even more dubious now is the supposed X2 delay. Seems like a perfectly phased rumour, given that it came slightly before you might have started to expect seeing board pic leaks or AIBs getting briefed... So yeah, call me crazy, but I'm still giving a 65%+ chance of RV770 being monolithic (350mm²+) *shrugs*. Might be usable for a X2 with some units disabled down the line though, who knows.

Once again, I could be horribly wrong here, but I wouldn't have been surprised at all if the everyone has been misled. This would hardly be unprecedented in the history of the industry either so I don't think it'd be horribly surprising...
I wish I could hop on your train :p but knowing AMD I'll set myself up for dissappointment ..

edit - I'll go ahead and stick my neck out. I'll 99% sure that RV770 is 480sp and about ~250mm2.
 
Well, yes, I'm just extremely skeptical of the 480 SPs stuff.
I dare say I'm pretty confident in the number Pande stated, I think he's leaked stuff before.

I can't shake off the feeling that ATI GPUs are destined to be stuck at 16 TUs and 16 RBEs for, well, forever. It's been 4 years now... Texturing performance will merely scale with clock.

What makes me even more dubious now is the supposed X2 delay. Seems like a perfectly phased rumour, given that it came slightly before you might have started to expect seeing board pic leaks or AIBs getting briefed...
X2 should be quite different from HD3870X2 if the reasonably firm rumours that it'll be "more like a single GPU" hold up - therefore a delay due to driver-quality seems likely. I'm not in the least surprised that X2 is a few months down the road.

So yeah, call me crazy, but I'm still giving a 65%+ chance of RV770 being monolithic (350mm²+) *shrugs*. Might be usable for a X2 with some units disabled down the line though, who knows.

Once again, I could be horribly wrong here, but I wouldn't have been surprised at all if the everyone has been misled. This would hardly be unprecedented in the history of the industry either so I don't think it'd be horribly surprising...
I think it's merely drowning in wishful thinking, a perennial state of affairs since R600's delay began.

Jawed
 
Arun, there are three strong points that go against that though

first one Performance, unless its serverly bottlenecked somewhere or the chips that are out there are very underclocked or heavily disabled.

second one The rv770 is based off the rv670, if the SPs are now 2 flops each, no more vec 5 which would be a big change to its shader array.

third one pricing rumors and recent slides all have to be fake.

*mass confusion* :p :D:LOL::devilish:
 
Anyway, so the bottlenecking point of 3870's were thought to be their texture fill rates, yeah?
Number 2 after z fillrate in my opinion.

3870 has a fillrate of 775*16 = 12400MT/s and a processing rate of 775*320*2 = 496GFLOPS

4870 has a fillrate of 850*32 = 27200MT/s and a processing rate of 1050*480*2 = 1008GFLOPS
TU and ALU having different clocks is a seriously unlikely prospect - ATI's never done it and I don't see why RV770 would be the start. Plus I think the register file wouldn't be happy - but that's just another guess.

I've missed a lot of rumors about 4870's power, but it's probably doubled 3870;
HD3870 is about 115W I think, HD4870 appears to be upto 160W, so about 40% extra.

Jawed
 
first one Performance, unless its serverly bottlenecked somewhere or the chips that are out there are very underclocked or heavily disabled
All the indications are that RV770 will be significantly less than 2x the performance of RV670.

I believe that simple cherry-picking of 4xMSAA benchmarks with RV770 specified as 480 ALU lanes, 16 TUs, 16 RBEs (with 4x Z per clock) is all that AMD needs to claim the performance gains they're indicating.

Jawed
 
ATi has found the solution for this with R700 AKA 4870 x2. They've ditched NUMA all together and gone to a unified memory space for multiple GPUs. This will provide performance gains in many scenarios. You can count on it.
NUMA means that access time to memory varies depending upon address - which is normally a consequence of having distributed MCs.

:LOL: I suppose you could say that the MCs dotted around R6xx's ring bus constitute a NUMA configuration.

Jawed
 
NUMA means that access time to memory varies depending upon address - which is normally a consequence of having distributed MCs.

:LOL: I suppose you could say that the MCs dotted around R6xx's ring bus constitute a NUMA configuration.

Jawed

But doesn't NUMA mean that in a multi-processor system there be separate memory chips for each processor? Or is its definition simply that "the memory access time depending on the memory location"?

I've too heard that the 4870x2 will be a 1GB card (and effectively 1GB, not 512mb like 3870x2); which means there will be only one memory space but two memory buses for two GPUs.
 
I wouldn't say significantly less, then x2 the performance of the rv670. Its less but not that much less then double.
With the MSAA/Z fillrate bottlenecks lifted (more than 2x faster than RV670), but with texturing performance only improved by clock rate (<40%) then it'll just be TEX limited the whole time - hence overall performance gain will be far short of 2x.

---

I have been wondering if an increase in ALU:TEX ratio is capable of improving TU utilisation. I also wonder if an increase in the number of threads in flight per SIMD might help TU utilisation. In other words I think there are mechanisms by which TEX throughput could scale better than expected simply because of threading inefficiencies in RV670.

Jawed
 
But doesn't NUMA mean that in a multi-processor system there be separate memory chips for each processor? Or is its definition simply that "the memory access time depending on the memory location"?
http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

I've too heard that the 4870x2 will be a 1GB card (and effectively 1GB, not 512mb like 3870x2); which means there will be only one memory space but two memory buses for two GPUs.
It's certainly expected.

I'm still unclear on the mechanism by which the two GPUs on HD3870X2 access memory - are they really locked-out of each other's space? Do all data transfers between them depend upon the CPU initiating PCI Express operations?

Jawed
 
Jawed: If your guess is correct and 4870 is only %40 faster than 3870, I don't think AMD should even bother with a SRP of $350. And I don't understand why texturing performance scales only with core clock, from the Expreview chart it looks like texture units have doubled and it's the z-fillrate that only increases with core clock, what am I seeing wrong?

I agree that introducing an ALU:TEX ratio (which is bigger than 1) doesn't make sense for ATI, it should have been the other way around if they were to be asynchronous at all.
 
I can't edit my posts :S anyway, I wanted to say whether it's the Z-fillrate that only benefits from the clock increase or it's the texture fillrate; that would mean only a %40 increase in performance either way, since they both were bottlenecking the R600 family. But then the superbly increased (more than 2x) math processing rate is all a waste? Why would they do that?

Or are all of these speculations wrong, including the shader count and the clocks?
 
Jawed: If your guess is correct and 4870 is only %40 faster than 3870,
Put me down for 20-80% on average :smile:

I don't think AMD should even bother with a SRP of $350. And I don't understand why texturing performance scales only with core clock, from the Expreview chart it looks like texture units have doubled and it's the z-fillrate that only increases with core clock, what am I seeing wrong?
I'm going out on a limb, suggesting that RV770 is only a minor architectural change (4x Z per clock instead of 2x) with an increase in ALUs and clocks.

Jawed
 
Copying-across from the other thread, via a low-bandwidth bridge:
Any moment now one of the mods is going to notice we now have two R7xx threads.

Before that happens though ... the most straightforward way to do better than crossfire/"SLI" is simply adding a bridge on the ring bus and connecting the chips through it, this requires minimum invasiveness in the rest of the design (everything else could be done in software). You would still need the bridge chip though.
This simple bridge is what I've been thinking too, but I don't understand the requirement for a bridge chip.

Come to think of it, my previous estimates of the needed bandwidth were inflated ... you would only need enough bandwidth for vertices and to copy finalized render buffers.
If the memory space is unified then the bandwidth associated with reading textures from the foreign MCs would have to be counted wouldn't it?

Jawed
 
The chip still has to be setup and receive commands from the host, simply reading and writing on the ring bus won't let you do that.

You could just replicate the textures (except for the rendered ones).
 
Status
Not open for further replies.
Back
Top