AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

This type of prefetching is strictly in programmer's hands, so it's safe to shift this burden onto him.

The kind of prefetching I'm talking about is well in the hands of programmers as well. It is the dominant form used.
It's safe to assume the vector load case would not be used as a prefetch, or much less than it would be otherwise.
Loops that prefetch far ahead of the actual usage may take on too much overhead if some kind of bounds checking is included or the data region is padded enough so that the corner case can't wander a stride too far. It's also mostly useless to make a safety check on the SIMD architecture, since the case here is that divergence breaks up the prefetch run anyway.
 
It's safe to assume the vector load case would not be used as a prefetch, or much less than it would be otherwise.

nvcc tries to bunch up loads and nv recommends that the programmer bunch up loads as much as possible.
 
What would mean there are only two desktop GPUs (Tahiti and Lombok) but 4 mobile ones. I would say that doesn't add up as long as the other parts are not coming significantly later. They didn't got added in the last 3 months or so since Tahiti appeared.
My guess: Thames and Lombok are the same GPU. Thames is simply the mobile version of Lombok. AMD usually lists the mobile version IDs above the desktop ones, which is the case here as well.

Lombok is probably that last VLIW4-based GPU. AMD needs a mainstream VLIW4 GPU to crossfire with Trinity, I cannot imagine it would work that well with any non-VLIW4 GPU. Being VLIW4-based would also explain why it appears so much earlier than other SI mainstream GPUs. It may very well have taken the same role as RV740 did for 40nm.

btw, this is what I found in the file atimdag.sys:
(...) Tahiti2@ý9$³CapeVe÷rde1A16 P?itcairU@2@ÿ5 (...)
So I think the [desktop] codenames for the three GCN SI GPUs are Tahiti, Cape Verde and Pitcairn.

Interestingly, Devastator is mentioned there as well, while Lombok is not (at least I couldn't find it).
 
My guess: Thames and Lombok are the same GPU. Thames is simply the mobile version of Lombok. AMD usually lists the mobile version IDs above the desktop ones, which is the case here as well.

Lombok is probably that last VLIW4-based GPU. AMD needs a mainstream VLIW4 GPU to crossfire with Trinity, I cannot imagine it would work that well with any non-VLIW4 GPU. Being VLIW4-based would also explain why it appears so much earlier than other SI mainstream GPUs. It may very well have taken the same role as RV740 did for 40nm.

btw, this is what I found in the file atimdag.sys:
So I think the [desktop] codenames for the three GCN SI GPUs are Tahiti, Cape Verde and Pitcairn.

Interestingly, Devastator is mentioned there as well, while Lombok is not (at least I couldn't find it).
In my version of that file it reads:
atikmdag.sys said:
Wrestler Sumo SuperSumo(3SIMD) SuperSumo(4SIMD) SuperSumo [..] Tahiti #9 CapeVerde #16 Pitcairn #5 Devastator(4662)
Can anybody make some sense out of this SuperSumo stuff? Especially as Wrestler is Ontario's GPU, Sumo is Llano (which depending on the version has 5, 4, 3, or even only 2 SIMDs [E2-350]). SuperSumo sounds like it should be an improved version, isn't it? But why does it mention the number of SIMDs explicitly? As far as I can remember, some OpenCL SDK beta release referred to the last unused VLIW5 ID as SuperSumo.

And by the way, as there are only two VLIW4 IDs in total, one already in use for Cayman, it appears a bit unprobable that Trinity as well as Lombok/Thames are VLIW4 (and Dave Bauman just confirmed here in the forum that Trinity is HD6900 architecture based aka VLIW4). I doubt they would share an ID. That only happened once with with the RV740 and RV770/RV790 which was obviously a straight RV770 shrink (with 2 SIMDs removed) from the functional perspective so the shader compiler didn't need to distinguish between them (doesn't do that for parts with deactivated SIMDs neither). So if Thames/Lombok are indeed VLIW4, it may be a shrinked Cayman with 20 SIMDs or something like that, reusing the Cayman ID.
Edit: Or it is a shrinked Barts reusing that ID and keeping VLIW5 alive a bit longer :rolleyes:
 
Last edited by a moderator:
nvcc tries to bunch up loads and nv recommends that the programmer bunch up loads as much as possible.
Which makes entirely sense, although that increases register pressure and the likelihood of thrashing caches.
 
im looking at this image, and im wondering how gcn is going to fit into the picture.

evolving2.jpg


is it possible for a gcn cu to function as a fpu for a bulldozer core? maybe if they beefed up the scalar unit or something it could serve both cpu and gpu as needed.

it just seems like a natural progression to integrate the 2 arch's.
 
I think the potential exists for something like a CU to operate as a coprocessor, which AMD's architecture already does for the FPU.
The CU is designed to handle multiple clients, so it could offload for a CPU and the graphics pipeline.

A CU's ability to perform its own branching and instruction fetch puts it a step higher than an FPU, however, the design is very different from a CPU and the latency probably means it can't be as tightly coupled.
The difference is enough that the CU as currently described would at best be an ancestor to whatever future design AMD hopes to integrate.
 
yea that makes sense, all that logic would have to be added to the cpu branch unit and scheduler then.

so it would have to be an entirely different arch anyway, to share cache and that sort of thing.
 
The control logic could be combined with the CPU. There is a benefit to keeping the CU separate, though. A CU that can do its own branching and fetching can maintain its throughput focus without hamstringing the CPU's latency focus.

If the CU was sent instructions by the CPU, the CPU would need to keep track of the status of the very long-lived vector instructions. A CU could routinely take dozens to hundreds of cycles to handle a memory operation or a cluster of ALU ops. Just one vector load would be enough to stall an OoO CPU when the reorder buffer fills up.

If the CU was kept semi-autonomous, the CPU would have an instruction that basically says "tell the CU to start here" or "give the CU this data" and then it could keep on going.
 
it's step 2, "separate but equal", there is still segregation but you get shared memory pointers.
maybe the GPU starts accessing I/O? I wonder if the GPU can access by DMA data coming from raw storage, from a video capture card, sound card or generic signals that you could do GPU stream computing on..

another cool thing is with the IOMMU on motherboard's chipset, the GPU resources becoming fully accessible on virtual machines
 
Looks like they are targeting a Q4 2011 release date using again a dual architecture split (like Barts/Caymans).
The latter part is just a speculation on their side, the interview itself gives no hint of that. And considering the ASIC IDs for the shader compiler it appears almost unlikely, if they don't do a straight shrink of Cayman (maybe reduced SIMD count) with no functional changes and reuse its ID (like the RV740 reused the RV770 ID).
 
Looks like they are targeting a Q4 2011 release date using again a dual architecture split (like Barts/Caymans).

The Interview said:
The subject of the IQ wars, and Anistropic Filtering came up, and it turns out that AMD, internally, take criticism badly; that is, they take it personally and to heart. They are working hard to provide the best quality products they can, and they are not going to stop trying to deliver.

Me likes, much. :)
 
Taking Charlie's information for granted, I'd think a straight shrink of Cayman would make the most sense to me right now. You don't cannibalize on your inventory, you don't make your partners angry, you get experience with general 28 nm technology and you probably get to make more money because of smaller chips - enabling you to be more aggressive in a price war, positioning your higher end chips against their mid-range, making your chips look better, sell for the same or less and end up increasing market- and mind-share. Lower power would also enable further SKUs like an 18 to 20 MCU Cayman, which we haven't been seeing yet.

But I do hope, that we get to see GCN in Q3 despite of that.
 
The use of low power process sounds strange, surely that can be good enough for high end chips?
 
Back
Top