AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

CNCAddict · May 17, 2014

Looked up HBM memory and found these links interesting.

http://sites.amd.com/us/Documents/TFE2011_006HYN.pdf
https://www.skhynix.com/gl/products/graphics/graphics_info.jsp
http://www.3dincites.com/2013/10/the-many-flavors-of-3d-dram/

pjbliverpool · May 17, 2014

Does anyone else think it's cool that we're back to the R3xx series? I wonder if this one will be as good as the last? If so, count me in.

Wynix · May 17, 2014

CNCAddict said:
Looked up HBM memory and found these links interesting.

http://sites.amd.com/us/Documents/TFE2011_006HYN.pdf

On page 12 I think they are referring to laptop products; So high end laptop will be the first to get HBM?

kotakaja · May 17, 2014

Kaotik said:
Only reference to Volcanic Island, which Hawaii should be too, is that 1.4 adds support for Volcanic Islands.

Where exactly?

Check Slideshare M. Mantor
R9 290x is Sea Island

From CodeXL 1.4
http://i.imgur.com/8LjI06P.jpg

Hawai is Sea Island
Volcanic Island is something else

Alexko · May 17, 2014

So what's in that Volcanic Islands menu?

And by the way, what's Kalindi?

Nemo · May 17, 2014

Alexko said:
So what's in that Volcanic Islands menu?

Tonga?

kotakaja · May 17, 2014

Alexko said:
So what's in that Volcanic Islands menu?

And by the way, what's Kalindi?

Kalindi is Sea island in Kabini/Temash
i am not so sure on Beema/Mullins

kotakaja · May 17, 2014

Sea Island CU: 4 Tiles, 16 ALU per Tiles, (4 Thread), Branching is expensive

From AMD, Heterogeneous Coherency System
Slide: http://i.imgur.com/nkBqAIi.jpg

1 CU = 16 Tile, Per Tile control 4 ALU (or More), Increase Thread, Branching more efficient
so there wont be same ALU number for each GPU Type
one GPU could be 64 ALU per CU or
1 CU could be 128 ALU , or 1 CU could be 64 Tiles
(64 ALU vs 128 ALU is not much different in die area, SRAM is the one takes much of die area)

Slide:http://i.imgur.com/XKGOFal.jpg
Showed Heterogeneous Coherence System Spec, GPU is full coherent
DDR is stacked, DDR type is DDR3, 16 Channel, BW like pascal 700-800 GB/sec
L3 = Embedded Memory not eSRAM

There is one console also have full coherency system, certainly wont use GDDR5

AlexV · May 17, 2014

kotakaja said:
Check CodeXL 1.4 PDF
you will see Volcanic Island vs Hawai Sea Island (ASIC)

Check C++ AMP Lambda API, which also for Xbox One
You will see AMD Next Generation Tiled GPU vs older Untiled GPU

AMD will moved to Tiled Architecture

That (if present) might not mean what you think it means. There's more than one way to tile a workload. Just saying.

Alexko · May 17, 2014

Well, there's definitely some truth to the rumors:

kotakaja · May 17, 2014

AlexV said:
That (if present) might not mean what you think it means. There's more than one way to tile a workload. Just saying.

of course but that is Ex unit represent 4 ALU is hardware view, and not a software view.

Even in software view, you are bound to hardware capability
Sea island is 4x SIMD16, you can not tile that into 16 x SIMD4

That also present on Mike Mantor paper on 2013
they try to increase thread, but reduce ALU per thread

mosen · May 17, 2014

kotakaja said:
Sea Island CU: 4 Tiles, 16 ALU per Tiles, (4 Thread), Branching is expensive

From AMD, Heterogeneous Coherency System
Slide: http://i.imgur.com/nkBqAIi.jpg

1 CU = 16 Tile, Per Tile control 4 ALU (or More), Increase Thread, Branching more efficient
so there wont be same ALU number for each GPU Type
one GPU could be 64 ALU per CU or
1 CU could be 128 ALU , or 1 CU could be 64 Tiles
(64 ALU vs 128 ALU is not much different in die area, SRAM is the one takes much of die area)

Slide:http://i.imgur.com/XKGOFal.jpg
Showed Heterogeneous Coherence System Spec, GPU is full coherent
DDR is stacked, DDR type is DDR3, 16 Channel, BW like pascal 700-800 GB/sec
L3 = Embedded Memory not eSRAM

There is one console also have full coherency system, certainly wont use GDDR5

You are speculating, right?!

kotakaja · May 17, 2014

mosen said:
You are speculating, right?!

about what ?

16 EX is clearly on that slide,

and we Know that GCN CU what we know so far is only
4 Ex unit x SIMD 16

mosen · May 17, 2014

kotakaja said:
about what ?

16 EX is clearly on that slide,

and we Know that GCN CU what we know so far is only
4 Ex unit x SIMD 16

Please, provide me a link to the full paper. Also your last sentence:

There is one console also have full coherency system, certainly wont use GDDR5

Is in contrasts with our knowledge about PS4 which has hUMA and accidently (!) uses GDDR5.

AlexV · May 17, 2014

kotakaja said:
about what ?

16 EX is clearly on that slide,

and we Know that GCN CU what we know so far is only
4 Ex unit x SIMD 16

Sometimes there are only so many little boxes that can fit on a single slide.

DmitryKo · May 18, 2014

Alexko said:
Well, there's definitely some truth to the rumors:

Looks like Iceland will replace low-end Oland and Cape Verde (i.e. R5 250 and R7 240/250), while Tonga should replace higher-end Curacao and Tahiti (R7 265/270/270X and R9 280/280X) to bring them up to GCN 1.1 spec R7 260 and R9 29x.

I'd imagine these new desktop parts could be R5 255/255X and R7 245/245X for Iceland, and R7 275/275X and R9 285/285X for Tonga.

Anteru · May 19, 2014

FWIW, you can actually run analyze and look at the generated ISA.

For a ray-tracing kernel, I observed the following:

Instruction encoding with Tonga is longer compared to Hawaii (2152 bytes on Hawaii vs. 2232 bytes on Tonga). This is (as far as I can tell) only due to buffer loads, which are twice the size with the offset being now a separate 32-bit dword.
A nop after 3 loads which was required on Hawaii is missing on Tonga.
Scalar register usage is way up: 42 on Hawaii vs 94 on Tonga. This won't fly if they don't expand the number of SGPR.
Some mac instructions got replaced by mad (no mac anymore, maybe)
Some very minor control flow changes. A complex loop is set up slightly different (mostly instruction have been shuffled.)
There's a v_add_u32 which was not present on Hawaii, and a v_mul_lo_u32 (possibly even more)

Overall, the changes look pretty minor, except for the load instruction encoding and the scalar register usage.

kotakaja · May 21, 2014

mosen said:
Please, provide me a link to the full paper. Also your last sentence:

Is in contrasts with our knowledge about PS4 which has hUMA and accidently (!) uses GDDR5.

PS4 certainly is not Full Coherent

The paper is from Micro46 (Dec 2013)
called "Heterogeneous Coherence System"
and it use Stacked DDR3 + EmbRAM/eSRAM (depends on what term you like to use)

for the foreseeable future [21], and to ease
software’s burden. Additionally, AMD and other HSA Foundation
members have committed to providing hardware coherence for

heterogeneous systems. Hardware coherence can enable new
classes of applications to take advantage of GPGPU computing
through low-overhead fine-grained data sharing.
Supporting coherence between CPUs and GPUs is challenging.

And from Some console architecture panel:

we have to invest a lot in coherency throughout the chip, so there's been io coherency for a while but we really wanted to get the software out of the mode of managing caches and put in hardware coherency for the first time on the mass scale in the living room on the gpu

iroboto · May 22, 2014

kotakaja said:
PS4 certainly is not Full Coherent

The paper is from Micro46 (Dec 2013)
called "Heterogeneous Coherence System"
and it use Stacked DDR3 + EmbRAM/eSRAM (depends on what term you like to use)

And from Some console architecture panel:

The article you are referring to is here:http://research.cs.wisc.edu/multifacet/papers/micro13_hsc.pdf

Though, there is a table indicating the setup used for their testing, I don't think that ensures that
a) hardware design must mimic the setup they used for testing to obtain full coherence
b) I still don't see that their setup uses both stacked DDR3 + embedded ram, I've interpreted this as _just_ stacked DDR3 and a large 16mb L3 cache, unless I'm missing something.

edit: if there is one thing I found really interesting about the article is that
700 GB/S of bandwidth was required to not bottleneck 32 CUs / or as they wrote with 700 GB/S of bandwidth available they were able to perform all tasks within the limitations of 32 CUs.

While I don't necessarily want to go off topic here but here comes some painfully inaccurate math
12 CU / 32 CU is about 37.5%
37.5% of 700 GB/S is 262

X1 total system theoretical bandwidth is 204 (102 both ways) + 67.8 ~ = ~~271 gb/s
or 192 (MS release #) + 68 ~ 260

pTmdfx · May 27, 2014

iroboto said:
Though, there is a table indicating the setup used for their testing, I don't think that ensures that
a) hardware design must mimic the setup they used for testing to obtain full coherence
b) I still don't see that their setup uses both stacked DDR3 + embedded ram, I've interpreted this as _just_ stacked DDR3 and a large 16mb L3 cache, unless I'm missing something.

The most important bit in the paper is the cache coherence protocol that allows a region to be temporarily removed from the coherence domain, and also the design concepts of the related facilities to make the protocol happen. The setup is not really important after all, so as the architecture used in the papers, though combining with other papers it shows the research trend (or even the architecture trend) of AMD. QuickRelease (2014) is another paper from the same team of researchers.

For PlayStation 4, if one assumes it is architecturally the same as Kaveri, it should be "fully coherent", or more specifically it is capable of accessing the coherent system memory by bypassing all GPU caches. Still fully coherent, yah?

AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

CNCAddict

pjbliverpool

B3D Scallywag

Wynix

kotakaja

Alexko

Nemo

kotakaja

kotakaja

AlexV

Heteroscedasticitate

Alexko

kotakaja

mosen

kotakaja

mosen

AlexV

Heteroscedasticitate

DmitryKo

Anteru

kotakaja

iroboto

Daft Funk

pTmdfx

Similar threads