AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Graphics Core Next (compared to Cayman) architectures by Hiroshi Goto

http://translate.google.com/transla...co.jp/docs/column/kaigai/20110622_454826.html
70901478.jpg


Why does the new architecture still supports GDS as shown above?
Doesn't the coherent L2 cache eliminates the need for dedicated (i.e. non-standard) global sync memory anyway?
 
How's about we start begging for a die shot now?
I'm starting to think the RV770 one was only released because some guy at ATI got drunk and accidentally sexted what he thought was a picture of his junk.

I'm thinking this design would look intesting in a side by side comparison.

Agreed, AMD definitely needs to hire Anthony Weiner.
 
http://img837.imageshack.us/img837/484/70901478.jpg

Why does the new architecture still supports GDS as shown above?
Doesn't the coherent L2 cache eliminates the need for dedicated (i.e. non-standard) global sync memory anyway?
Compatibility with existing stuff?

By the way, I think the ROPs still retain their dedicated Color-/Z caches (I think it was even mentioned during the presentation). They are kind of specialized in streaming in and out tiles of the framebuffer (and also handles its compression in case of MSAA render targets), what would only pollute the L2 cache. The CUs have 2 possibilities to write to memory (there are even two separate pipelines for handling this), the vector memory pipeline which handles the in- and output through the L1/L2 caches and ensures coherency, and the export pipeline which is there to specifically write to the ROPs for graphics applications.

Edit:
And wasn't the vertex cache unified with the texture cache in R700 already?
 
Compatibility with existing stuff?
I'd say that's a good reason.
Another is that the design seems kind of top-heavy with regards to the number of (write-through) L1 clients versus L2 slices. Fermi also burdens it L2 caches with multiple clients that contend for it, though its 16x6, so not as high as AMD's solution and the write-through seems to default to a per-kernel synchronization as opposed to per write. On the other hand, Fermi leans on the cache hierarchy for data share duty as well.

The L2 in both designs looks like it cheaps out on coherence by linking each segment to a single memory controller and portion of the address space.

I'm very curious how AMD balances complexity and die area for a read/write crossbar between 32 write-through L1s and some number of L2 caches (4?) and how much contention this could produce for them.

By the way, I think the ROPs still retain their dedicated Color-/Z caches (I think it was even mentioned during the presentation). They are kind of specialized in streaming in and out tiles of the framebuffer (and also handles its compression in case of MSAA render targets), what would only pollute the L2 cache.
In addition, it would be yet another write burden on the heavily outnumbered L2 sections, and the ROPs would be generating quite a bit of traffic even if they didn't pollute the cache too much.

Edit:
And wasn't the vertex cache unified with the texture cache in R700 already?
According to the Cypress overview, it was present in RV770, however, it was disabled in later driver revisions.
 
Why does the new architecture still supports GDS as shown above?

Because a (theoretically) software controlled global scratchpad is pretty damn nice, IMHO. They also have a bunch of hardware mechanisms in place there that make them fast for special cases (Append/Consume buffers, Counters etc.). Now if only they'd actually get the stuff exposed to the programmer in some way...a zillion years after promising they'd do it.
 
How are they going to do that without changing the OpenCL spec or breaking compatibility with other OpenCL vendors?
 
Extension. :)

Speaking of extensions, I think it would be a good idea for khronos to expose intra warp/wave intrinsics. These instructions are too useful to not be included in spec.
 
Indeed, but there is no vertex cache inside each SIMD either way. If there is one (not all models have it), there is a single one shared between all.
Fermi also burdens it L2 caches with multiple clients that contend for it, though its 16x6, so not as high as AMD's solution and the write-through seems to default to a per-kernel synchronization as opposed to per write.
As far as I understood the presentation, GCN writes the L1's dirty content also at the end of the kernel (for each wavefront) to the L2, so it is also a delayed write-through. As you have no good means of synchronizing between different CUs (actually you could do it with GWS/GDS), that's quite acceptable. Current programming models assume no dependencies between different thread groups / thread blocks (consisting of multiple wavefronts/warps which are processed on the same CU/SM).
The L2 in both designs looks like it cheaps out on coherence by linking each segment to a single memory controller and portion of the address space.
Isn't every not fully associative cache doing that (maybe not linked to the memory controller, but having sets each responsible for some part of the memory space)?
 
As far as I understood the presentation, GCN writes the L1's dirty content also at the end of the kernel (for each wavefront) to the L2, so it is also a delayed write-through.
I interpreted the slides as indicating it's write-through after the end of every wavefront, and that there were multiple wavefronts per kernel, so there would be more writes in flight.

Isn't every not fully associative cache doing that (maybe not linked to the memory controller, but having sets each responsible for some part of the memory space)?
There is no guarantee between normal caches that what appears in one cache will never appear in the other. I interpeted the description as meaning that cache slices themselves are 16-way associative within their allocated space, and they are controlled independently.
This works, but the storage offered by the L2 is less flexible than a general 16-way cache.

I wonder how they allocate the address space amongst them to help minimize the chance of traffic falling too heavily on one section.
 
I interpreted the slides as indicating it's write-through after the end of every wavefront, and that there were multiple wavefronts per kernel, so there would be more writes in flight.
As you need to synchronize reads/writes within a thread group (several wavefronts) either way (if you access the same adresses), it results in basically the same, I think. It's not like that it switches off the coalescing of the writes.
There is no guarantee between normal caches that what appears in one cache will never appear in the other.
Actually there is in a set associative cache. At least if you don't think of the slices as separate caches but just sub-sets of the cache. Then you have your guarantee ;)
In a set associative cache, you have always sets of (fully associative) subcaches, each owning a part of the memory space (meaning what is in one set, can't be in another), just like the slices on the GPUs. The decision in which set memory at a certain address is cached is basically also done comparable as the mapping to a certain slice on GPUs. Some part of the address (let's say bit 6 to 8) of the address gives you the slice number, which would lead to ...
I wonder how they allocate the address space amongst them to help minimize the chance of traffic falling too heavily on one section.
... an interleaving of the adresses between the slices at cacheline granularity (can be chosen differently to increase the number of transferred bytes in a single transaction, or to avoid aliasing when using certain common strides, but that the latter reason is more relevant to L1 caches I think).
I interpeted the description as meaning that cache slices themselves are 16-way associative within their allocated space, and they are controlled independently.
This works, but the storage offered by the L2 is less flexible than a general 16-way cache.
Not necessarily.
I missed the 16way associativity. So lets do some math with it assuming 512 kB total size and 64 byte line size.
16 * 64 = 1024 Byte set size. That means you have 512 fully associative sets in your cache and need to chose 9 bits of the memory adress to determine the exact set it will be cached in. As long as you don't plan to build a GPU with more than 512 memory channels, this won't limit you in a fundamental way. Those slices just group several sets together, it is more an issue of the physical implementation. With 8 slices, each slice would consist of 64 sets of the cache.
 
Last edited by a moderator:
Actually there is in a set associative cache. At least if you don't think of the slices as separate caches but just sub-sets of the cache. Then you have your guarantee ;)
In a set associative cache, you have always sets of (fully associative) subcaches, each owning a part of the memory space (meaning what is in one set, can't be in another), just like the slices on the GPUs.
I see how this interpretation works. My initial position was coming from the point of view that these slices were effectively independent and could manage their traffic with no common physical control.


How is that cheaping out? To do anything else would be silly.
That came out more harsh than I had thought it would.
It is making an economy of the hardware at hand.
My primary question about maintaining the static link is that it means the L2 does not scale independently of the memory controller count, though for a bandwidth-maximizing architecture I suppose that would follow.

If the L2 were decoupled, however, each slice could cover the whole address space. Meaning that not having a direct connection to a slice does not mean an L1 suddenly loses visibility of 1/N of the address space. Perhaps a collection of 4 8x1 crossbars + some secondary fabric, instead of a 32x4 global crossbar. (not counting the instruction and scalar caches)

I seem to be having trouble interpreting slides this time around.
I have been trying to compare the possible bandwidth numbers for the L1/L2 interface versus earlier chips.
Is this bandwidth going up at all?
In the 4 channel, 4 slice L2 case, at 64B a cycle each, I get numbers that seem too low.
 
I seem to be having trouble interpreting slides this time around.
I have been trying to compare the possible bandwidth numbers for the L1/L2 interface versus earlier chips.
Is this bandwidth going up at all?
In the 4 channel, 4 slice L2 case, at 64B a cycle each, I get numbers that seem too low.
Probably the organization is eight 64 Byte slices - one slice per channel, where each GDDR5 controller hosts two channels (2*32-bit devices) and respectively two L2 slices. I think this was the same case since Cypress - four 64-bit controllers and eight 32-bit channels.
 
So the bandwidth stays the same per clock.
Any improvements would depend on cllock, the shift to using compressed textures in the L1, and any bandwidth that a somewhat higher number of hits in the larger L1 can provide.
Though that does make for a crossbar that has 32 clients on one side and eight on the other, just counting the L1s. The instruction caches and scalar caches are shared amongst 4 CUs and would be present on the read path.
 
I know its not the silly season yet but ....

Anyone willing to wager the amount of functional units (CUs etc) to roughly double the shader/texture performance of Cypress? (keeping same clocks)
 
I know its not the silly season yet but ....

Anyone willing to wager the amount of functional units (CUs etc) to roughly double the shader/texture performance of Cypress? (keeping same clocks)

Cypress or Cayman?
If Cayman, I'll throw out a poor guess of ~40CUs.
 
Fermi is faster than Cayman and Fermi is about 15 CUs wide after normalizing for clock speed. 30 CUs should do it.
 
The GPU contingent from AMD won't be intimidated from such floppery though. They told me that the next version of the FireStream HPC product will have twice the performance of the current model. The 9350 and 9370 products being shipped today deliver 528 DP gigaflops and 2.64 SP teraflops. The new FireStreams will be announced this fall -- around SC11, I'm guessing -- and will start shipping sometime in early 2012.
http://www.hpcwire.com/hpcwire/2011..._international_supercomputing_conference.html
GCN is half rate DP, right?
 
Back
Top