AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

rpg.314 · Dec 31, 2011

trinibwoy said:
Yeah I agree but why isn't it possible to have multiple 1-pixel sized triangles in a draw call?

For the pre dx11 days, batching with the stamp granularity is fine. In the era of single pixel triangles, not packing every lane with fragments is a huge waste.

fellix · Dec 31, 2011

Jawed said:
And, of course, it's possible to have multiple triangles per thread, but I haven't seen any evidence that ATI does this.

I think G70 did this to compensate for it's rather large batch size. Not sure for NV40.

trinibwoy · Dec 31, 2011

rpg.314 said:
For the pre dx11 days, batching with the stamp granularity is fine. In the era of single pixel triangles, not packing every lane with fragments is a huge waste.

Right but, like Jawed said, we don't have any evidence of batching at higher than rasterization granularity from either AMD or nVidia.

Psycho · Dec 31, 2011

I think I remember we already figured it out to be the case when discussing evergreen.
Otherwise it should be detectable by benchmarking fillrate on a *very* heavy pixel shader while decreasing triangle size.

rpg.314 · Dec 31, 2011

I guess what need to be done is to render 64 single pixel triangles, with a *very* long pixel shader. And then measure the variation in rendering time as the triangle size is increased. If there is batching at stamp size, then there would be step function jumps in render times.

Shtal · Dec 31, 2011

ATI Radeon 8000 series GPU

http://developer.amd.com/archive/legacydemos/pages/ATIRadeon8000SeriesDemos.aspx

Shtal · Dec 31, 2011

fellix said:
As much, as the manufacturing process allows, I guess. Tahiti is "only" 365mm², so there's alot of margin left. TSMC is also moving very fast to 450mm wafers, remember. Large chips are ought to get cheaper, in a long term.

I guess ATI/AMD could go for 512bit memory bus this time with Sea Islands - since they did before in the days of R600.

3dcgi · Dec 31, 2011

CarstenS said:
With GCN being the basis for a few years to come, I hope the next major overhaul will be in the front-end.

Why do you think it needs a major overhaul?

Jawed said:
ATI still appears to be designed for "8-fragment triangles" (arguably 16-fragment triangles) - I haven't seen anything in GCN that changes this. And a pixel shader thread that contains one of these then runs at 12.5% efficiency. That's dark ages.

In order to achieve peak fill rate with AMD hardware you need 16+ pixel triangles, but small triangles don't limit the efficiency of packing pixel waves if they are a multiple of 4 pixels. The PS is packed by quads from up to 16 triangles. It's been like this since Xenos at least.

Jawed · Jan 1, 2012

3dcgi said:
The PS is packed by quads from up to 16 triangles.

Is there evidence or documentation for this?

3dcgi · Jan 1, 2012

Jawed said:
Is there evidence or documentation for this?

Just my working for AMD and knowing that's how it works. I'd be surprised if Nvidia hardware doesn't work the same way.

Also, we wouldn't have papers from Stanford about quad fragment merging if we wasted an entire wavefront/warp when rendering a quad as this would be lower hanging fruit.

rpg.314 · Jan 1, 2012

Shtal said:
I guess ATI/AMD could go for 512bit memory bus this time with Sea Islands - since they did before in the days of R600.

They will have to make a gigantic chip ~500 mm2 to fit a 512 bit bus.

Shtal · Jan 1, 2012

rpg.314 said:
They will have to make a gigantic chip ~500 mm2 to fit a 512 bit bus.

Why would you say that? - did you forgot ATI R600 420 mm2 with 512bit bus
http://www.beyond3d.com/content/reviews/16/2

rpg.314 · Jan 1, 2012

It also had too much bandwidth for it's shader core.

Dooby · Jan 1, 2012

64 ROPs!!!

Shtal · Jan 1, 2012

rpg.314 said:
It also had too much bandwidth for it's shader core.

True, but if Sea Islands is still based on GCN "1D" architecture and unified shaders is increased to 3072, plus 48 ROP's and 192 TMU's and memory GDDR5 1375MHz over 512bit bus will give 352.0 GB's bandwidth, it might be slightly overkill but I feel it's still be okay!

rpg.314 · Jan 1, 2012

Charlie posted a pic tagged as hd 8k with an interposer. I am rooting for interposers to go mainstream around that time.

Shtal · Jan 1, 2012

Can someone speculate what die size could it be based on this specs below or how much bigger than AMD Radeon HD 7970.

Sea Islands (Radeon HD 8970)
28nm tech
900MHz GPU
3072 unified shaders (GCN "1D" architecture)
48 ROP's
192 TMU's
GDDR5 1375MHz on 512bit bus (352.0 GB's bandwidth)

CarstenS · Jan 1, 2012

3dcgi said:
Why do you think it needs a major overhaul?

Because from reading a couple of reviews and from first-hand experience, there seems to be quite a bit of performance left on the table when applied to real world gaming scenarios (i.e. non-canned benchmarks but in-game stuff measured with Fraps). Tahiti oftentimes fails to reach it's nominal advantage over Cayman in the ballpark of +40%, not regarding it's more efficient architecture.

Also some synthetic tests show that most of the basic blocks like shader core including texturing or the Render Back-Ends seem to function very well when isolated (but in some cases not even that).

Try running this on Tahiti for example:
http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770

With Cayman you'll get 780/1030/1500/1770/1800 of sustained GFLOPS for the five predefined n's.
With Tahiti, you should be able to beat that by at least 40% (clocks & unit count).

Jawed · Jan 1, 2012

3dcgi said:
Just my working for AMD and knowing that's how it works.

Thanks very much for the insight.

Jawed · Jan 1, 2012

CarstenS said:
Try running this on Tahiti for example:
http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770

With Cayman you'll get 780/1030/1500/1770/1800 of sustained GFLOPS for the five predefined n's.
With Tahiti, you should be able to beat that by at least 40% (clocks & unit count).

That's fairly heavy on trascendentals and I dare say that the compiler might be outputting "compute" transcendentals rather than graphics transcendentals. I suspect the former are much slower.

Though GCN is supposed to have "helpers" for compute transcendentals as far as I understand it, to reduce the hit of "precise" transcendentals that's seen on prior GPUs.

But this application is written with IL (it appears) which throws extra variables into the mix. e.g. I don't know if this is written using PS compute (unlikely, but not impossible) or CS. The former would be more likely to use graphics transcendentals.

In GCN I can hypothesise that a transcendental (SQRT say) that appears in a compute shader is always "slow", because the compiler sees that it is compute, not pixel shading. Whereas on R700/Evergreen the compiler would issue the literal instruction, the standard "imprecise" SQRT regardless of PS or CS mode that the IL is written in.

---

Another issue is register allocation. GCN's peak register allocation per work item is half R600's. Brute force N-Body is a perfect place to do maximal vectorisation, i.e. stuffing as many particles into a work item as is possible.

GCN may be over-stuffed and so running at much lower performance, with too few threads per core.

Horrible register allocation is a long standing problem for AMD. GCN may be extra-horrible in its youth. It should improve...

AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

rpg.314

fellix

trinibwoy

Meh

Psycho

rpg.314

Shtal

Shtal

3dcgi

Jawed

3dcgi

rpg.314

Shtal

rpg.314

Dooby

Shtal

rpg.314

Shtal

CarstenS

Moderator

Jawed

Jawed

Similar threads