AMD: R8xx Speculation

Dave Baumann · Feb 16, 2010

FrameBuffer said:
Good enough for some.. sure.. for someone who wants a compact, silent htpc card without having to kill most post processing at HD..

And those type of features are more suited to lower quality SD (or lower) sources than cleaner HD sources to begin with.

mczak · Feb 16, 2010

Mindfury said:
The RV870 Story: AMD Showing up to the Fight

http://anandtech.com/video/showdoc.aspx?i=3740

Very interesting indeed.
Though I'm really wondering how they got die size down so much. By reading that, it seems like they didn't cut down on the numbers of simds, which is sort of hard to believe given size needed to go down from 400+ to 330 mm². So what else? Sideport is mentioned, but that's probably only good for 10mm² or so at best. What else could have been in there? Wider internal data paths (though the article says "features" had to go)? Cache sizes (not that they take up a whole lot of space)? In any case it can't have been something which required rebuilding of whole blocks, as that would have led to a much larger delay.
Also, I'd say they missed the 2x performance target. Granted it's there in theoretical flops but not really in practice, though maybe the 2x target was only in theoretical flops...

Silent_Buddha · Feb 16, 2010

FrameBuffer said:
Good enough for some.. sure.. for someone who wants a compact, silent htpc card without having to kill most post processing at HD.. hardly and most certainly not over previous gen (with the sole exception being bitstreaming) such as the 4550 (from $19.99 to 44.99 vs 44.99 to 82.99 for 5450).

And despite that it's still by far the best solution for HTPC in that price range and more than enough for almost everyone building an HTPC. The only card better is the 5570, but there you're getting into more heat and power consumption, something that isn't always conducive to small and silent. It can either be small and actively cooled, or silent and passively cooled but requiring a larger enclosure with active cooling.

Thanks, I'll go for the 5450 anyday.

Regards,
SB

Jawed · Feb 16, 2010

mczak said:
Very interesting indeed.
Though I'm really wondering how they got die size down so much. By reading that, it seems like they didn't cut down on the numbers of simds, which is sort of hard to believe given size needed to go down from 400+ to 330 mm².

Size was more like ~480mm² if the diagram is to be believed. ~30% cut in die size. The cut was prolly more significant than that, because we know that the via solution for RV740 was "a doubling of vias" which made RV740 grow.

So what else? Sideport is mentioned, but that's probably only good for 10mm² or so at best. What else could have been in there? Wider internal data paths (though the article says "features" had to go)? Cache sizes (not that they take up a whole lot of space)? In any case it can't have been something which required rebuilding of whole blocks, as that would have led to a much larger delay.

For the sideport to be useful it prolly needs to be much meatier than that seen in RV770, because that sideport's bandwidth is nothing to write home about (it is literally superfluous). 10x more bandwidth?

Also I think a complete revamp of the cache system is due. Evergreen has two sets of atomic units: one set in the ROPs and another set in the LDSs. A cache system with one set of atomics close to the ALUs would do all this, making the atomics run on L1 which is dual-purpose L1/LDS. We get back to the old topic of making such atomics globally coherent, something discussed at length in the GT300 thread, which is a serious problem.

Getting rid of the ROPs also has implications for early-Z.

The "peculiarly asymmetric" handling of caches for RW that Gipsel and I have been discussing might be a side effect of AMD simply deleting the fancy cache stuff. Retreating to a slightly enhanced version of what's in R700? R700 only supports a single UAV, but Evergreen has to support 8 for D3D11 compliance.

There might be some clues in the "reserved" gaps in the opcodes seen in the ISA

e.g. these 12 values:

19 EXPORT_RAT_INST_DEC_UINT : dst = ((dst==0 | (dst > src)) ? src : dst-1.
31:20 Reserved.
32 EXPORT_RAT_INST_NOP_RTN: Internal use by SX only (flush+ack with no
opcode). Return dword.

The RAT ID actually has space for 16 UAVs (coming in D3D11.1?) but the bit range there is 9 bits ([8:4] is unused).

Then you get into the whole topic of whether faster setup is required. And whether that's predicated on enhancements in the cache system.

Also, I'd say they missed the 2x performance target. Granted it's there in theoretical flops but not really in practice, though maybe the 2x target was only in theoretical flops...

It seems to me "shader core" is the basis for that.

Jawed

ferro · Feb 16, 2010

Dave Baumann said:
And those type of features are more suited to lower quality SD (or lower) sources than cleaner HD sources to begin with.

Agreed. Post processing options that are supposed to "enhance" an original digital image are the first I turn off in any device (Blu-Ray player, cable/satellite receiver, TV, AVR, ...). To me the 5450 is the perfect HTPC card. It was a real life-saver when it turned out that clarkdale does not do 23.976Hz.

nAo · Feb 16, 2010

Jawed said:
Evergreen has two sets of atomic units: one set in the ROPs and another set in the LDSs. A cache system with one set of atomics close to the ALUs would do all this, making the atomics run on L1 which is dual-purpose L1/LDS. We get back to the old topic of making such atomics globally coherent, something discussed at length in the GT300 thread, which is a serious problem.

I believe there is a third set of atomics units that handle atomics on the GDS memory.

MfA · Feb 16, 2010

Jawed said:
Also I think a complete revamp of the cache system is due. Evergreen has two sets of atomic units: one set in the ROPs and another set in the LDSs.

So does Fermi ... even if you did make a multilevel globally coherent cache and allowed global memory atomics at L1 you would probably want to keep units for atomic operation on the memory controller caches as well.

Coherency is not free, some of the lighter weight methods area wise will increase the costs runtime wise. If you (the developer or compiler) know cacheline/atomic ownership would just ping-pong between SIMDs it would almost certainly be better to just keep it at the memory controller. It's not like the operations occupy a lot of area.

Jawed · Feb 16, 2010

nAo said:
I believe there is a third set of atomics units that handle atomics on the GDS memory.

Sigh, I missed those. I thought the atomics were just on semaphores/counters.

Jawed

CarstenS · Feb 16, 2010

Kaotik said:
At least as far as I've understood it, the microstuttering for example is a side effect from the fact that the GPUs don't run exactly synchronized

AFAIU, that's true: As long as you're not ready to give up some of the performance gain from AFR to synchronize, you'll end up oftentimes with different frame rendering times. Thus, maximum performance for a Fraps/benchmark measurement and maximum user experience for those who actually play the games exclude each other. Shame.

Jawed · Feb 16, 2010

MfA said:
So does Fermi ... even if you did make a multilevel globally coherent cache and allowed global memory atomics at L1 you would probably want to keep units for atomic operation on the memory controller caches as well.

Coherency is not free, some of the lighter weight methods area wise will increase the costs runtime wise. If you (the developer or compiler) know cacheline/atomic ownership would just ping-pong between SIMDs it would almost certainly be better to just keep it at the memory controller. It's not like the operations occupy a lot of area.

Centralised global atomics incur a high serialisation latency. Ignoring cache miss latency there's still the length of the queue/throughput.

In theory, wherever the latency lies it can be hidden with enough threads, but throughput is going to be better with L1 based atomics.

Larrabee appears to be going for high throughput at some (unkown) coherency-latency, based on L2 being close to the core. I know you don't like that but so far we only have Cypress to play with... Well, there's GT200's global memory atomics too, but there's no caching at all.

We've already seen GT200 wasted by a CPU when doing histogramming:

http://forum.beyond3d.com/showthread.php?p=1305053#post1305053

(later posts get improved performance on both platforms). The cached atomics on these SM5 GPUs should be quite interesting - though the tweaked local memory approach that performs best might still be the ultimate.

Jawed

3dilettante · Feb 16, 2010

Larrabee's scheme is to require the programmer/software layer pin threads in such a way that atomics reside within a single core's cache.
In that instance, the latency is whatever handful of cycles the cache operation takes.

Anything outside of this ideal case is going to be slow. I'd be curious on the exact steps and what the costs would have been if Larrabee had been released.
Something like a broadcast invalidate and attempt to get a cache line in an exclusive state could be, in a naive implementation without a directory or help from the memory controller, worse than what GPUs would get with uncached global memory atomics.
Depending on the capability of the cache controller and ring bus, the controller may be monopolized servicing all the broadcast responses for all the cores, and the ring bus at that stop filled with nothing but coherence/invalidate traffic for tens of cycles just from one core's miss.
So, it would be really important to keep the atomic on one core, and thus more closely approximate an LDS atomic than a global one.

MfA · Feb 16, 2010

Jawed said:
Centralised global atomics incur a high serialisation latency. Ignoring cache miss latency there's still the length of the queue/throughput.

An atomic operation on a cacheline currently in another SIMD's L1 cache including the transfer of ownership to the local L1 incurs much more overhead. If you know ahead of time the atomic is going to be accessed incoherently by the different SIMDs it's best to just put it in L2 and avoid the need to move shit around.

Jawed · Feb 16, 2010

MfA said:
An atomic operation on a cacheline currently in another SIMD's L1 cache including the transfer of ownership to the local L1 incurs much more overhead. If you know ahead of time the atomic is going to be accessed incoherently by the different SIMDs it's best to just put it in L2 and avoid the need to move shit around.

I think it's too early to judge best as this is uncharted territory. Intel clearly thinks 32+ cores are viable based on distributed L2 with coherency.

Though Intel's research projects look more and more like the Transputer

Dammit, I'd love to see what you could do with that architecture, 3 billion transistors and 3GHz...

Jawed

MfA · Feb 16, 2010

They have nearly a full year process node advantage ... they don't need efficiency, just execution. Just because a familiar cache coherency mechanism suits them best doesn't mean it's best, even if it successfully competes.

3dilettante · Feb 16, 2010

The general solution with bog-standard broadcast cache coherency is going to be pinning threads and writing your code so atomics do not hop cores.

The L2 doesn't help too much with core bouncing, as other cores do not have write access to a local tile and will need to go through an invalidate broadcast+load with attempt at exclusivity.
The L2 would help reduce the amount of time it would take to perform an atomic a cache line that gets evicted from the L1. It would be possibly ~10 cycles to service in that case.

Jawed · Feb 17, 2010

I can only reference the shock and awe that arose out of the actual experiments with the 65536-bin histogram: this is uncharted territory. There were accusations of unfairness to GPUs floating around

Anyway, this discussion can't be about just atomics, since arbitrary RW of multiple UAVs is the more general issue.

Jawed

dkanter · Feb 17, 2010

3dilettante: Pinning threads to cores is fine for LRB....that's what the uOS and driver are there for. Unlike with a CPU, you don't need to worry about the OS doing crazy stuff and moving your threads around.

MfA: Who do you think has a full year process advantage? Intel certainly does over TSMC/GF, but they aren't going to ship LRB on cutting edge silicon. The die is too large...maybe a 6 month lead.

DK

mczak · Feb 17, 2010

Jawed said:
Size was more like ~480mm² if the diagram is to be believed. ~30% cut in die size. The cut was prolly more significant than that, because we know that the via solution for RV740 was "a doubling of vias" which made RV740 grow.

I think the diagram was just made up by anandtech. The text talks about 20x20 chip, with potential to grow to 22x22 (which gives you the ~480mm²) if necessary. Nowhere does it say how big it actually was at that point.
Maybe some fancy cache system indeed went away, but it doesn't look to me like that would be the only other thing (beside sideport) neither, no matter if you assume 400 or 480 mm²...

Lightman · Feb 17, 2010

Sorry to post this here but I didn't know where to put this!

To the point!
Anyone knows anything about this ATi/XFX/EC event in Coventry/UK on Friday 19th Feb.?
http://www.eclipsecomputers.com/images/eshot/ee16022010/eshot.html

What it will be, what will be shown, who will be attending?
Is it worth taking day off work to go there (not to mention driving 200miles)?

Answers please!

Florin · Feb 17, 2010

Lightman said:
What it will be, what will be shown, who will be attending?
Is it worth taking day off work to go there (not to mention driving 200miles)?

Answers please!

AvP?

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

Dave Baumann

Gamerscore Wh...

mczak

Silent_Buddha

Jawed

ferro

nAo

Nutella Nutellae

MfA

Jawed

CarstenS

Moderator

Jawed

3dilettante

MfA

Jawed

MfA

3dilettante

Jawed

dkanter

mczak

Lightman

Florin

Merrily dodgy

Similar threads