AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
Good enough for some.. sure.. for someone who wants a compact, silent htpc card without having to kill most post processing at HD..
And those type of features are more suited to lower quality SD (or lower) sources than cleaner HD sources to begin with.
 
The RV870 Story: AMD Showing up to the Fight

http://anandtech.com/video/showdoc.aspx?i=3740

Very interesting indeed.
Though I'm really wondering how they got die size down so much. By reading that, it seems like they didn't cut down on the numbers of simds, which is sort of hard to believe given size needed to go down from 400+ to 330 mm². So what else? Sideport is mentioned, but that's probably only good for 10mm² or so at best. What else could have been in there? Wider internal data paths (though the article says "features" had to go)? Cache sizes (not that they take up a whole lot of space)? In any case it can't have been something which required rebuilding of whole blocks, as that would have led to a much larger delay.
Also, I'd say they missed the 2x performance target. Granted it's there in theoretical flops but not really in practice, though maybe the 2x target was only in theoretical flops...
 
Good enough for some.. sure.. for someone who wants a compact, silent htpc card without having to kill most post processing at HD.. hardly and most certainly not over previous gen (with the sole exception being bitstreaming) such as the 4550 (from $19.99 to 44.99 vs 44.99 to 82.99 for 5450).

And despite that it's still by far the best solution for HTPC in that price range and more than enough for almost everyone building an HTPC. The only card better is the 5570, but there you're getting into more heat and power consumption, something that isn't always conducive to small and silent. It can either be small and actively cooled, or silent and passively cooled but requiring a larger enclosure with active cooling.

Thanks, I'll go for the 5450 anyday.

Regards,
SB
 
Very interesting indeed.
Though I'm really wondering how they got die size down so much. By reading that, it seems like they didn't cut down on the numbers of simds, which is sort of hard to believe given size needed to go down from 400+ to 330 mm².
Size was more like ~480mm² if the diagram is to be believed. ~30% cut in die size. The cut was prolly more significant than that, because we know that the via solution for RV740 was "a doubling of vias" which made RV740 grow.

So what else? Sideport is mentioned, but that's probably only good for 10mm² or so at best. What else could have been in there? Wider internal data paths (though the article says "features" had to go)? Cache sizes (not that they take up a whole lot of space)? In any case it can't have been something which required rebuilding of whole blocks, as that would have led to a much larger delay.
For the sideport to be useful it prolly needs to be much meatier than that seen in RV770, because that sideport's bandwidth is nothing to write home about (it is literally superfluous). 10x more bandwidth?

Also I think a complete revamp of the cache system is due. Evergreen has two sets of atomic units: one set in the ROPs and another set in the LDSs. A cache system with one set of atomics close to the ALUs would do all this, making the atomics run on L1 which is dual-purpose L1/LDS. We get back to the old topic of making such atomics globally coherent, something discussed at length in the GT300 thread, which is a serious problem.

Getting rid of the ROPs also has implications for early-Z.

The "peculiarly asymmetric" handling of caches for RW that Gipsel and I have been discussing might be a side effect of AMD simply deleting the fancy cache stuff. Retreating to a slightly enhanced version of what's in R700? R700 only supports a single UAV, but Evergreen has to support 8 for D3D11 compliance.

There might be some clues in the "reserved" gaps in the opcodes seen in the ISA :LOL: e.g. these 12 values:

19 EXPORT_RAT_INST_DEC_UINT : dst = ((dst==0 | (dst > src)) ? src : dst-1.
31:20 Reserved.
32 EXPORT_RAT_INST_NOP_RTN: Internal use by SX only (flush+ack with no
opcode). Return dword.

The RAT ID actually has space for 16 UAVs (coming in D3D11.1?) but the bit range there is 9 bits ([8:4] is unused).

Then you get into the whole topic of whether faster setup is required. And whether that's predicated on enhancements in the cache system.

Also, I'd say they missed the 2x performance target. Granted it's there in theoretical flops but not really in practice, though maybe the 2x target was only in theoretical flops...
It seems to me "shader core" is the basis for that.

Jawed
 
And those type of features are more suited to lower quality SD (or lower) sources than cleaner HD sources to begin with.

Agreed. Post processing options that are supposed to "enhance" an original digital image are the first I turn off in any device (Blu-Ray player, cable/satellite receiver, TV, AVR, ...). To me the 5450 is the perfect HTPC card. It was a real life-saver when it turned out that clarkdale does not do 23.976Hz.
 
Evergreen has two sets of atomic units: one set in the ROPs and another set in the LDSs. A cache system with one set of atomics close to the ALUs would do all this, making the atomics run on L1 which is dual-purpose L1/LDS. We get back to the old topic of making such atomics globally coherent, something discussed at length in the GT300 thread, which is a serious problem.
I believe there is a third set of atomics units that handle atomics on the GDS memory.
 
Also I think a complete revamp of the cache system is due. Evergreen has two sets of atomic units: one set in the ROPs and another set in the LDSs.
So does Fermi ... even if you did make a multilevel globally coherent cache and allowed global memory atomics at L1 you would probably want to keep units for atomic operation on the memory controller caches as well.

Coherency is not free, some of the lighter weight methods area wise will increase the costs runtime wise. If you (the developer or compiler) know cacheline/atomic ownership would just ping-pong between SIMDs it would almost certainly be better to just keep it at the memory controller. It's not like the operations occupy a lot of area.
 
At least as far as I've understood it, the microstuttering for example is a side effect from the fact that the GPUs don't run exactly synchronized

AFAIU, that's true: As long as you're not ready to give up some of the performance gain from AFR to synchronize, you'll end up oftentimes with different frame rendering times. Thus, maximum performance for a Fraps/benchmark measurement and maximum user experience for those who actually play the games exclude each other. Shame.
 
So does Fermi ... even if you did make a multilevel globally coherent cache and allowed global memory atomics at L1 you would probably want to keep units for atomic operation on the memory controller caches as well.

Coherency is not free, some of the lighter weight methods area wise will increase the costs runtime wise. If you (the developer or compiler) know cacheline/atomic ownership would just ping-pong between SIMDs it would almost certainly be better to just keep it at the memory controller. It's not like the operations occupy a lot of area.
Centralised global atomics incur a high serialisation latency. Ignoring cache miss latency there's still the length of the queue/throughput.

In theory, wherever the latency lies it can be hidden with enough threads, but throughput is going to be better with L1 based atomics.

Larrabee appears to be going for high throughput at some (unkown) coherency-latency, based on L2 being close to the core. I know you don't like that but so far we only have Cypress to play with... Well, there's GT200's global memory atomics too, but there's no caching at all.

We've already seen GT200 wasted by a CPU when doing histogramming:

http://forum.beyond3d.com/showthread.php?p=1305053#post1305053

(later posts get improved performance on both platforms). The cached atomics on these SM5 GPUs should be quite interesting - though the tweaked local memory approach that performs best might still be the ultimate.

Jawed
 
Larrabee's scheme is to require the programmer/software layer pin threads in such a way that atomics reside within a single core's cache.
In that instance, the latency is whatever handful of cycles the cache operation takes.

Anything outside of this ideal case is going to be slow. I'd be curious on the exact steps and what the costs would have been if Larrabee had been released.
Something like a broadcast invalidate and attempt to get a cache line in an exclusive state could be, in a naive implementation without a directory or help from the memory controller, worse than what GPUs would get with uncached global memory atomics.
Depending on the capability of the cache controller and ring bus, the controller may be monopolized servicing all the broadcast responses for all the cores, and the ring bus at that stop filled with nothing but coherence/invalidate traffic for tens of cycles just from one core's miss.
So, it would be really important to keep the atomic on one core, and thus more closely approximate an LDS atomic than a global one.
 
Centralised global atomics incur a high serialisation latency. Ignoring cache miss latency there's still the length of the queue/throughput.
An atomic operation on a cacheline currently in another SIMD's L1 cache including the transfer of ownership to the local L1 incurs much more overhead. If you know ahead of time the atomic is going to be accessed incoherently by the different SIMDs it's best to just put it in L2 and avoid the need to move shit around.
 
An atomic operation on a cacheline currently in another SIMD's L1 cache including the transfer of ownership to the local L1 incurs much more overhead. If you know ahead of time the atomic is going to be accessed incoherently by the different SIMDs it's best to just put it in L2 and avoid the need to move shit around.
I think it's too early to judge best as this is uncharted territory. Intel clearly thinks 32+ cores are viable based on distributed L2 with coherency.

Though Intel's research projects look more and more like the Transputer :LOL: Dammit, I'd love to see what you could do with that architecture, 3 billion transistors and 3GHz...

Jawed
 
They have nearly a full year process node advantage ... they don't need efficiency, just execution. Just because a familiar cache coherency mechanism suits them best doesn't mean it's best, even if it successfully competes.
 
The general solution with bog-standard broadcast cache coherency is going to be pinning threads and writing your code so atomics do not hop cores.

The L2 doesn't help too much with core bouncing, as other cores do not have write access to a local tile and will need to go through an invalidate broadcast+load with attempt at exclusivity.
The L2 would help reduce the amount of time it would take to perform an atomic a cache line that gets evicted from the L1. It would be possibly ~10 cycles to service in that case.
 
I can only reference the shock and awe that arose out of the actual experiments with the 65536-bin histogram: this is uncharted territory. There were accusations of unfairness to GPUs floating around :oops:

Anyway, this discussion can't be about just atomics, since arbitrary RW of multiple UAVs is the more general issue.

Jawed
 
3dilettante: Pinning threads to cores is fine for LRB....that's what the uOS and driver are there for. Unlike with a CPU, you don't need to worry about the OS doing crazy stuff and moving your threads around.

MfA: Who do you think has a full year process advantage? Intel certainly does over TSMC/GF, but they aren't going to ship LRB on cutting edge silicon. The die is too large...maybe a 6 month lead.


DK
 
Size was more like ~480mm² if the diagram is to be believed. ~30% cut in die size. The cut was prolly more significant than that, because we know that the via solution for RV740 was "a doubling of vias" which made RV740 grow.
I think the diagram was just made up by anandtech. The text talks about 20x20 chip, with potential to grow to 22x22 (which gives you the ~480mm²) if necessary. Nowhere does it say how big it actually was at that point.
Maybe some fancy cache system indeed went away, but it doesn't look to me like that would be the only other thing (beside sideport) neither, no matter if you assume 400 or 480 mm²...
 
Back
Top