AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314 · Jun 15, 2011

Well, in earlier architectures, one work group couldn't span multiple simd's.

I was wondering what was the case here, ie whether a workgroup could span multiple simd's or not.

I now think what was a "SIMD engine" in cayman is now a CU in SI.

trinibwoy · Jun 15, 2011

Yep cause each "engine" only had one SIMD. The easiest way to look at it is by LDS. Each CU has an LDS shared across all four SIMDs so the CU is the unit in the hierarchy that workgroups are bound to. Very much like nVidia's SM. Interesting that there was no mention of dedicated transcendental units, guess those instructions will run on the ALUs as well.

rpg.314 · Jun 15, 2011

Interesting that there was no mention of dedicated transcendental units, guess those instructions will run on the ALUs as well.

They'll be there. Just not mentioned here.

Gipsel · Jun 15, 2011

GZ007 said:
Iam wondering if the TMU-s are still in CU-s Not a single slide mentioned them.

The TMUs ("Filter") are mentioned in this slide:

So AMD is going to unify GP-L1 cache and Tex-L1 (and keep LDS separate) while Fermi unified local memory and GP-L1 and kept the Tex-L1 separate. And AMD is going to use a compressed L1 for textures like nvidia. Current L1 appears to be uncompressed, 128 way (full) associative 8 kB. That may enable more efficient bandwidth use.

Ethatron · Jun 15, 2011

rpg.314 said:
I think it should be (a), else it would break dx11 spec.

You just need to expose a spec. conformant view, if you have parallel views (like DX-concurrent OpenCL-queues) doesn't make it spec. breaking.

rpg.314 · Jun 15, 2011

If a spec requires a certain amount of local memory, then not exposing that amount is spec breaking.

Gipsel · Jun 15, 2011

rpg.314 said:
If a spec requires a certain amount of local memory, then not exposing that amount is spec breaking.

If a kernel declares a workgroup needs only 16 kB of it, you can run 4 groups without breaking any spec.

rpg.314 · Jun 15, 2011

Gipsel said:
If a kernel declares a workgroup needs only 16 kB of it, you can run 4 groups without breaking any spec.

What about kernels written assuming 32kb local mem (dx11)?

Anyway, now the re design of alu organization is clear which was my original question. Instead of one vector thread over four cycles issuing a 4 wide vliw bundle per clock, now it does 4 vector threads each issuing single instruction per clock.

3dilettante · Jun 15, 2011

I think it's one instruction from four threads over four cycles. The batch/workgroup size is still 64.

rpg.314 · Jun 15, 2011

3dilettante said:
I think it's one instruction from four threads over four cycles. The batch/workgroup size is still 64.

If instructions are sourced from 4 different threads, they might as well be from 4 different IP's each. I think the organization is similar to fermi which dual issues from 2 warps. Here it quad issues from 4 different wavefronts.

3dilettante · Jun 15, 2011

rpg.314 said:
If instructions are sourced from 4 different threads, they might as well be from 4 different IP's each. I think the organization is similar to fermi which dual issues from 2 warps. Here it quad issues from 4 different wavefronts.

I meant one instruction from each hardware thread, with each instruction taking four cycles.

Gipsel · Jun 15, 2011

rpg.314 said:
What about kernels written assuming 32kb local mem (dx11)?

You can run at least two.
A kernel still declares how much it needs. If the kernel at hand uses less than the full 32kB, there is still the opportunity to run more than a single workgroup even on Evergreens. The specification just limits the maximum size a workgroup can use. You can write a kernel where a workgroup uses only 256 Byte for instance. One doesn't have to assume each workgroup will use 32kB, a kernel declares local memory usage. That declaration determines the maximum number of simultaneous thread groups.

rpg.314 · Jun 15, 2011

While that would work, I don't think any IHV will try such a solution.

Gipsel · Jun 15, 2011

rpg.314 said:
While that would work, I don't think any IHV will try such a solution.

What do you mean? Isn't that how it works now?!? Does not matter if it is nvidia or AMD, R700/Evergreen/NI or G80/Fermi. It always works in the same way.

OpenGL guy · Jun 15, 2011

rpg.314 said:
While that would work, I don't think any IHV will try such a solution.

Why not?

We know the LDS usage at compilation time, so we can easily manage LDS resources at dispatch time either in the GPU or in the driver.

If a kernel declares 32KiB of LDS usage, then you would only get one wavefront per SIMD, but if you only used 1 KiB of LDS then we could schedule up to 32 wavefronts per SIMD.

trinibwoy · Jun 15, 2011

Gipsel said:
What do you mean? Isn't that how it works now?!? Does not matter if it is nvidia or AMD, R700/Evergreen/NI or G80/Fermi. It always works in the same way.

Yep, been that way since G80 and CUDA 1.0.

I like AMD's approach of having specialized units all leveraging a shared pool of L2 and CU's. Very clean and very scalable. The shared, coherent cache is a real enabler.

itsmydamnation · Jun 16, 2011

David Kanter said:
don't know about schedule, but probably 28nm so late this year maybe. VLIW4 was a small change, but a precursor to the new uarch.

http://twitter.com/#!/DKrwt
pretty amazing if VLIW was always planned to just be a 1 gen stop gap, AMD did a great job at not letting that one out of the bag!!!!

really looking forward to your artical David!!!!!!

Alexko · Jun 16, 2011

Indeed, AMD is getting frighteningly good at keeping secret things secret. I don't think anything at all had leaked about SI before this event.

trinibwoy · Jun 16, 2011

Yeah it should be a gem, looking forward to it too. The scalar unit isn't exposed by any of the compute APIs but it's probably a boon to the driver team.

rpg.314 · Jun 16, 2011

OpenGL guy said:
Why not? We know the LDS usage at compilation time, so we can easily manage LDS resources at dispatch time either in the GPU or in the driver.

If a kernel declares 32KiB of LDS usage, then you would only get one wavefront per SIMD, but if you only used 1 KiB of LDS then we could schedule up to 32 wavefronts per SIMD.

Sure, but designing your hw assuming devs will use litle LDS when spec exposes 32K is a poor design choice. Although something that will work.

This micro discussion on LDS is getting derailed.

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314

trinibwoy

Meh

rpg.314

Gipsel

Ethatron

rpg.314

Gipsel

rpg.314

3dilettante

rpg.314

3dilettante

Gipsel

rpg.314

Gipsel

OpenGL guy

trinibwoy

Meh

itsmydamnation

Alexko

trinibwoy

Meh

rpg.314

Similar threads