AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

Yeah I agree but why isn't it possible to have multiple 1-pixel sized triangles in a draw call?

For the pre dx11 days, batching with the stamp granularity is fine. In the era of single pixel triangles, not packing every lane with fragments is a huge waste.
 
For the pre dx11 days, batching with the stamp granularity is fine. In the era of single pixel triangles, not packing every lane with fragments is a huge waste.

Right but, like Jawed said, we don't have any evidence of batching at higher than rasterization granularity from either AMD or nVidia.
 
I think I remember we already figured it out to be the case when discussing evergreen.
Otherwise it should be detectable by benchmarking fillrate on a *very* heavy pixel shader while decreasing triangle size.
 
I guess what need to be done is to render 64 single pixel triangles, with a *very* long pixel shader. And then measure the variation in rendering time as the triangle size is increased. If there is batching at stamp size, then there would be step function jumps in render times.
 
As much, as the manufacturing process allows, I guess. Tahiti is "only" 365mm², so there's alot of margin left. TSMC is also moving very fast to 450mm wafers, remember. Large chips are ought to get cheaper, in a long term.

I guess ATI/AMD could go for 512bit memory bus this time with Sea Islands - since they did before in the days of R600.
 
With GCN being the basis for a few years to come, I hope the next major overhaul will be in the front-end.
Why do you think it needs a major overhaul?

ATI still appears to be designed for "8-fragment triangles" (arguably 16-fragment triangles) - I haven't seen anything in GCN that changes this. And a pixel shader thread that contains one of these then runs at 12.5% efficiency. That's dark ages.
In order to achieve peak fill rate with AMD hardware you need 16+ pixel triangles, but small triangles don't limit the efficiency of packing pixel waves if they are a multiple of 4 pixels. The PS is packed by quads from up to 16 triangles. It's been like this since Xenos at least.
 
Is there evidence or documentation for this?
Just my working for AMD and knowing that's how it works. I'd be surprised if Nvidia hardware doesn't work the same way.

Also, we wouldn't have papers from Stanford about quad fragment merging if we wasted an entire wavefront/warp when rendering a quad as this would be lower hanging fruit.
 
It also had too much bandwidth for it's shader core.

True, but if Sea Islands is still based on GCN "1D" architecture and unified shaders is increased to 3072, plus 48 ROP's and 192 TMU's and memory GDDR5 1375MHz over 512bit bus will give 352.0 GB's bandwidth, it might be slightly overkill but I feel it's still be okay! :)
 
Charlie posted a pic tagged as hd 8k with an interposer. I am rooting for interposers to go mainstream around that time.
 
Can someone speculate what die size could it be based on this specs below or how much bigger than AMD Radeon HD 7970.

Sea Islands (Radeon HD 8970)
28nm tech
900MHz GPU
3072 unified shaders (GCN "1D" architecture)
48 ROP's
192 TMU's
GDDR5 1375MHz on 512bit bus (352.0 GB's bandwidth)
 
Why do you think it needs a major overhaul?
Because from reading a couple of reviews and from first-hand experience, there seems to be quite a bit of performance left on the table when applied to real world gaming scenarios (i.e. non-canned benchmarks but in-game stuff measured with Fraps). Tahiti oftentimes fails to reach it's nominal advantage over Cayman in the ballpark of +40%, not regarding it's more efficient architecture.

Also some synthetic tests show that most of the basic blocks like shader core including texturing or the Render Back-Ends seem to function very well when isolated (but in some cases not even that).

Try running this on Tahiti for example:
http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770

With Cayman you'll get 780/1030/1500/1770/1800 of sustained GFLOPS for the five predefined n's.
With Tahiti, you should be able to beat that by at least 40% (clocks & unit count).
 
Try running this on Tahiti for example:
http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770

With Cayman you'll get 780/1030/1500/1770/1800 of sustained GFLOPS for the five predefined n's.
With Tahiti, you should be able to beat that by at least 40% (clocks & unit count).
That's fairly heavy on trascendentals and I dare say that the compiler might be outputting "compute" transcendentals rather than graphics transcendentals. I suspect the former are much slower.

Though GCN is supposed to have "helpers" for compute transcendentals as far as I understand it, to reduce the hit of "precise" transcendentals that's seen on prior GPUs.

But this application is written with IL (it appears) which throws extra variables into the mix. e.g. I don't know if this is written using PS compute (unlikely, but not impossible) or CS. The former would be more likely to use graphics transcendentals.

In GCN I can hypothesise that a transcendental (SQRT say) that appears in a compute shader is always "slow", because the compiler sees that it is compute, not pixel shading. Whereas on R700/Evergreen the compiler would issue the literal instruction, the standard "imprecise" SQRT regardless of PS or CS mode that the IL is written in.

---

Another issue is register allocation. GCN's peak register allocation per work item is half R600's. Brute force N-Body is a perfect place to do maximal vectorisation, i.e. stuffing as many particles into a work item as is possible.

GCN may be over-stuffed and so running at much lower performance, with too few threads per core.

Horrible register allocation is a long standing problem for AMD. GCN may be extra-horrible in its youth. It should improve...
 
Back
Top