AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Does anyone else think it's cool that we're back to the R3xx series? I wonder if this one will be as good as the last? If so, count me in.
 
Sea Island CU: 4 Tiles, 16 ALU per Tiles, (4 Thread), Branching is expensive

From AMD, Heterogeneous Coherency System
Slide: http://i.imgur.com/nkBqAIi.jpg

1 CU = 16 Tile, Per Tile control 4 ALU (or More), Increase Thread, Branching more efficient
so there wont be same ALU number for each GPU Type
one GPU could be 64 ALU per CU or
1 CU could be 128 ALU , or 1 CU could be 64 Tiles
(64 ALU vs 128 ALU is not much different in die area, SRAM is the one takes much of die area)

Slide:http://i.imgur.com/XKGOFal.jpg
Showed Heterogeneous Coherence System Spec, GPU is full coherent
DDR is stacked, DDR type is DDR3, 16 Channel, BW like pascal 700-800 GB/sec
L3 = Embedded Memory not eSRAM

There is one console also have full coherency system, certainly wont use GDDR5
 
Check CodeXL 1.4 PDF
you will see Volcanic Island vs Hawai Sea Island (ASIC)

Check C++ AMP Lambda API, which also for Xbox One
You will see AMD Next Generation Tiled GPU vs older Untiled GPU

AMD will moved to Tiled Architecture

That (if present) might not mean what you think it means. There's more than one way to tile a workload. Just saying.
 
Well, there's definitely some truth to the rumors:

Screen_Shot258.png
 
That (if present) might not mean what you think it means. There's more than one way to tile a workload. Just saying.

of course but that is Ex unit represent 4 ALU is hardware view, and not a software view.

Even in software view, you are bound to hardware capability
Sea island is 4x SIMD16, you can not tile that into 16 x SIMD4

That also present on Mike Mantor paper on 2013
they try to increase thread, but reduce ALU per thread
 
Sea Island CU: 4 Tiles, 16 ALU per Tiles, (4 Thread), Branching is expensive

From AMD, Heterogeneous Coherency System
Slide: http://i.imgur.com/nkBqAIi.jpg

1 CU = 16 Tile, Per Tile control 4 ALU (or More), Increase Thread, Branching more efficient
so there wont be same ALU number for each GPU Type
one GPU could be 64 ALU per CU or
1 CU could be 128 ALU , or 1 CU could be 64 Tiles
(64 ALU vs 128 ALU is not much different in die area, SRAM is the one takes much of die area)

Slide:http://i.imgur.com/XKGOFal.jpg
Showed Heterogeneous Coherence System Spec, GPU is full coherent
DDR is stacked, DDR type is DDR3, 16 Channel, BW like pascal 700-800 GB/sec
L3 = Embedded Memory not eSRAM

There is one console also have full coherency system, certainly wont use GDDR5

You are speculating, right?!
 
about what ?

16 EX is clearly on that slide,

and we Know that GCN CU what we know so far is only
4 Ex unit x SIMD 16


Please, provide me a link to the full paper. Also your last sentence:

There is one console also have full coherency system, certainly wont use GDDR5
Is in contrasts with our knowledge about PS4 which has hUMA and accidently (!) uses GDDR5.
 
about what ?

16 EX is clearly on that slide,

and we Know that GCN CU what we know so far is only
4 Ex unit x SIMD 16

Sometimes there are only so many little boxes that can fit on a single slide.
 
Well, there's definitely some truth to the rumors:

Screen_Shot258.png

Looks like Iceland will replace low-end Oland and Cape Verde (i.e. R5 250 and R7 240/250), while Tonga should replace higher-end Curacao and Tahiti (R7 265/270/270X and R9 280/280X) to bring them up to GCN 1.1 spec R7 260 and R9 29x.

I'd imagine these new desktop parts could be R5 255/255X and R7 245/245X for Iceland, and R7 275/275X and R9 285/285X for Tonga.
 
Last edited by a moderator:
FWIW, you can actually run analyze and look at the generated ISA.

For a ray-tracing kernel, I observed the following:

  • Instruction encoding with Tonga is longer compared to Hawaii (2152 bytes on Hawaii vs. 2232 bytes on Tonga). This is (as far as I can tell) only due to buffer loads, which are twice the size with the offset being now a separate 32-bit dword.
  • A nop after 3 loads which was required on Hawaii is missing on Tonga.
  • Scalar register usage is way up: 42 on Hawaii vs 94 on Tonga. This won't fly if they don't expand the number of SGPR.
  • Some mac instructions got replaced by mad (no mac anymore, maybe)
  • Some very minor control flow changes. A complex loop is set up slightly different (mostly instruction have been shuffled.)
  • There's a v_add_u32 which was not present on Hawaii, and a v_mul_lo_u32 (possibly even more)
Overall, the changes look pretty minor, except for the load instruction encoding and the scalar register usage.
 
Please, provide me a link to the full paper. Also your last sentence:

Is in contrasts with our knowledge about PS4 which has hUMA and accidently (!) uses GDDR5.

PS4 certainly is not Full Coherent


The paper is from Micro46 (Dec 2013)
called "Heterogeneous Coherence System"
and it use Stacked DDR3 + EmbRAM/eSRAM (depends on what term you like to use)


for the foreseeable future [21], and to ease
software’s burden. Additionally, AMD and other HSA Foundation
members have committed to providing hardware coherence for

heterogeneous systems. Hardware coherence can enable new
classes of applications to take advantage of GPGPU computing
through low-overhead fine-grained data sharing.
Supporting coherence between CPUs and GPUs is challenging.

And from Some console architecture panel:
we have to invest a lot in coherency throughout the chip, so there's been io coherency for a while but we really wanted to get the software out of the mode of managing caches and put in hardware coherency for the first time on the mass scale in the living room on the gpu
 
PS4 certainly is not Full Coherent

The paper is from Micro46 (Dec 2013)
called "Heterogeneous Coherence System"
and it use Stacked DDR3 + EmbRAM/eSRAM (depends on what term you like to use)

And from Some console architecture panel:

The article you are referring to is here:http://research.cs.wisc.edu/multifacet/papers/micro13_hsc.pdf

Though, there is a table indicating the setup used for their testing, I don't think that ensures that
a) hardware design must mimic the setup they used for testing to obtain full coherence
b) I still don't see that their setup uses both stacked DDR3 + embedded ram, I've interpreted this as _just_ stacked DDR3 and a large 16mb L3 cache, unless I'm missing something.

edit: if there is one thing I found really interesting about the article is that
700 GB/S of bandwidth was required to not bottleneck 32 CUs / or as they wrote with 700 GB/S of bandwidth available they were able to perform all tasks within the limitations of 32 CUs.

While I don't necessarily want to go off topic here but here comes some painfully inaccurate math
12 CU / 32 CU is about 37.5%
37.5% of 700 GB/S is 262

X1 total system theoretical bandwidth is 204 (102 both ways) + 67.8 ~ = ~~271 gb/s
or 192 (MS release #) + 68 ~ 260
 
Last edited by a moderator:
Though, there is a table indicating the setup used for their testing, I don't think that ensures that
a) hardware design must mimic the setup they used for testing to obtain full coherence
b) I still don't see that their setup uses both stacked DDR3 + embedded ram, I've interpreted this as _just_ stacked DDR3 and a large 16mb L3 cache, unless I'm missing something.
The most important bit in the paper is the cache coherence protocol that allows a region to be temporarily removed from the coherence domain, and also the design concepts of the related facilities to make the protocol happen. The setup is not really important after all, so as the architecture used in the papers, though combining with other papers it shows the research trend (or even the architecture trend) of AMD. QuickRelease (2014) is another paper from the same team of researchers.

For PlayStation 4, if one assumes it is architecturally the same as Kaveri, it should be "fully coherent", or more specifically it is capable of accessing the coherent system memory by bypassing all GPU caches. Still fully coherent, yah?
 
Back
Top