ARM Bifrost Architecture

Ailuros · May 30, 2016

http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71

Finally "scalar" ALUs! Looks good on paper so far

sebbbi · May 30, 2016

They borrowed many good ideas from others. This presentation reminds me about the GCN presentation (http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah). Maxwell has similar operand reuse cache (avoiding register access -> avoiding bank conflicts -> simpler more efficient register file design). Clauses aren't new either (AMDs Terascale had clauses... and while clauses are nice, they are not 100% problem free). SIMD4 is a natural fit for quad shading. Pixel shaders are still the most common workload on mobile GPUs. Other GPUs need to pack multiple quads to a single wave (IIRC Intel doesn't - it just issues partially full SIMD8).

pixelio · May 30, 2016

It looks good!

The other new "feature" is that you can easily calculate total AFUs by taking the number of Bifrost cores and multiplying times 12.

A Bifrost 32 core design would have 384 AFUs.

Deleted member 13524 · May 30, 2016

Could someone tell the new architecture's floating point throughput per-core-per-frequency?

The Anandtech article says that each Bifrost core has 3 * quad execution engines, for a total of 12 fused multiply-add operations.
Does each FMA correspond to two FLOPs, since it's mutliply and add?

Then a G71MP32 at e.g. 1000MHz should be able to do 32*12*2*1000M = 768 GFLOP/s?

If so, it's still not a design that could match a current-gen console (unless taken to really high clocks?), but it could be an interesting match to the current 15W x86 APU offerings (Bristol Ridge and Skylake U GT3e).
With 10nm, this could be very interesting for a handheld console.

arjan de lumens · May 30, 2016

Each of the four quad lanes in an execution engine has an FMA plus a separate ADD, so in terms of flop numbers, you should get that:

G71 at 1000 MHz should be able to do 32*12*(2+1)*1000M = 1152 GFlops/s.

While this does fall a bit short of XB1/PS4, it should generally be a bit easier to get closer to theoretical numbers with smaller wavefronts; it's also been my understanding that the GCN architecture used in these consoles has to do attribute/varying interpolation in shader cores, which also reduces the amount of GFlops actually available; G71 keeps dedicated varying-units, so that this kind of varying-interpolation doesn't take away from the GFlops number. Also, G71 has full FP16 support, using 2-component vectors in each 32-bit lane; using FP16 instead of FP32 will as such give you twice the paper-GFlops.

Also, for heterogeneous compute, G71 now has full coherency, so that CPU and GPU can work on the same data without intervening cache maintenance operations; whether this is useful for anything that a console would normally do ... time will tell, I suppose.

sebbbi · May 31, 2016

arjan de lumens said:
Each of the four quad lanes in an execution engine has an FMA plus a separate ADD, so in terms of flop numbers, you should get that:

G71 at 1000 MHz should be able to do 32*12*(2+1)*1000M = 1152 GFlops/s.

While this does fall a bit short of XB1/PS4, it should generally be a bit easier to get closer to theoretical numbers with smaller wavefronts; it's also been my understanding that the GCN architecture used in these consoles has to do attribute/varying interpolation in shader cores, which also reduces the amount of GFlops actually available; G71 keeps dedicated varying-units, so that this kind of varying-interpolation doesn't take away from the GFlops number. Also, G71 has full FP16 support, using 2-component vectors in each 32-bit lane; using FP16 instead of FP32 will as such give you twice the paper-GFlops.

Also, for heterogeneous compute, G71 now has full coherency, so that CPU and GPU can work on the same data without intervening cache maintenance operations; whether this is useful for anything that a console would normally do ... time will tell, I suppose.

Yes, GCN does attribute interpolation with ALUs (compiler puts them in the beginning of the shader). Also GCN uses lots of ALUs for other seemingly innocent tasks, such as sampling a cube map (normalize UV vector, cubeface. cubecoord). Double rate FP16 would help a lot in games. Not all math needs to be full precision (to have any effect on resulting image quality). I also like double rate 16 bit integer math (and native support for splitting an 32 bit register between two 16 bit variables). Modern rendering engines do a lot of integer math (in compute shaders).

But theoretical FLOPS alone is unfortunately not enough to beat the consoles. Memory bandwidth plays a big role in running optimized shader code (designed solely for single platform). Most shaders (in critical path) are optimized until memory bandwidth becomes the hard limit. Xbox One bandwidth is 68 GB/s + ESRAM and PS4 is 176 GB/s. It is still long way to get there. Tiling helps mobile GPUs of course, but ESRAM helps as much and is much more flexible than tiling. Modern rendering engines are spending 50%+ time running compute shaders. Tiling doesn't help compute shaders at all. AMD GCN shines in compute. It is highly flexible, and very fine grained (higher GPU utilization can be reached by running shaders with different bottlenecks concurrently on each CU). GCN doesn't suffer much from LDS bank conflicts and has super fast LDS atomics (that do not even use VALU slots). I have seen Radeon 7970 beating GTX 980 (Maxwell) in code containing lots of LDS atomics (and it beats 780 Ti by a HUGE margin, as Kepler has dead slow atomics).

Forrest · Jun 2, 2016

With the amount of spatial coherence typically in graphics, a SIMD4 seems like a strange choice. Imagine the amount of control logic required in a 32 core GPU! I don't think packing quads into a wavefront is a problem - only the last wavefront would be partially empty.

Xmas · Jun 8, 2016

Forrest said:
With the amount of spatial coherence typically in graphics, a SIMD4 seems like a strange choice. Imagine the amount of control logic required in a 32 core GPU! I don't think packing quads into a wavefront is a problem - only the last wavefront would be partially empty.

Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront

Gubbi · Jun 8, 2016

Xmas said:
Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront

Tiles are 32x32 pixels these days. There should be plenty of opportunity to do 4 quads at a time.

Looks more like a re-purposing of their existing SIMD-4 ALU (which might be a wise choice)

Cheers

mczak · Jun 8, 2016

Xmas said:
Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront

Is that really different to "traditional" renderers? I'm not convinced these can export to multiple RBEs from the same wave.

Xmas · Jun 9, 2016

mczak said:
Is that really different to "traditional" renderers? I'm not convinced these can export to multiple RBEs from the same wave.

I don't know if that's the case. But even if the quads in a wavefront need to go to the same RBE that's not the same as going to the same tile.

Gubbi said:
Tiles are 32x32 pixels these days. There should be plenty of opportunity to do 4 quads at a time.

So 256 quads without overdraw, or at least 64 hypothetical wavefronts of 4 quads. With each object added to the tile that gets more fragmented, I wouldn't be surprised to see utilisation well below 90% in average cases. Is that worth it? I don't know.

On the other hand a lot of the control logic per core seems to be shared between the three quad engines.

Ailuros · Oct 28, 2016

http://www.golem.de/news/kirin-960-...ueber-seinen-smartphone-chip-1610-124061.html

Huawei is more than open about its SoCs; the 960 weighs about 110mm under 16FFC TSMC. Other than that T880MP4@900MHz vs. G71MP8@900MHz = +180% according to Huwaei and if you look at slide7 the first gets around 19 fps in Manhattan3.0 offscreen while the 960 climbs up all the way to 51 fps. On the other hand the A10 GPU with just 6 clusters and a frequency I'd estimate around 630+MHz gets 64 fps. I don't think I need to normalize that on a hypothetical basis to the same amount of clusters and same frequency to make a point I guess....

liolio · Oct 31, 2016

Well normalizing on the basics of clock speed and number of GPU units does not say much. Power and die size are better indication of the design efficiency.
ARM new IPs are impressive both on the CPU side and the GPU side and the A35 are not in use yet.
I wonder when the low and mid-end SOC will really moved forward, it seems market segmentation is the only reason holding their evolution at this point (I guess some twisted costumers perception wrt core number too).

CSI PC · Oct 31, 2016

Mali-G51 is now officially announced, actual mobile retail products meant to be released and using it early 2018.
http://www.arm.com/products/multimedia/mali-gpu/high-area-efficiency/mali-g51.php
https://community.arm.com/groups/ar...ings-premium-performance-to-mainstream-mobile

Cheers
Edit:
And also the V61 video processor, thought I better mention that as well.

Ailuros · Nov 1, 2016

liolio said:
Well normalizing on the basics of clock speed and number of GPU units does not say much. Power and die size are better indication of the design efficiency.
ARM new IPs are impressive both on the CPU side and the GPU side and the A35 are not in use yet.
I wonder when the low and mid-end SOC will really moved forward, it seems market segmentation is the only reason holding their evolution at this point (I guess some twisted costumers perception wrt core number too).

How high are the chances exactly that 8 clusters clocked at 900MHz will consume less power than 6 clusters at 630+MHz (while this time the amount of TMUs/cluster being equal)?

liolio · Nov 1, 2016

Ailuros said:
How high are the chances exactly that 8 clusters clocked at 900MHz will consume less power than 6 clusters at 630+MHz (while this time the amount of TMUs/cluster being equal)?

It is a logical fallacy, I can't answer the clusters are not the same. Now the clock speed is likely to affect performance efficiency.
Your original post is extremely negative wrt to the giant stride ARM IPs are doing especially in the GPU realm. Furthermore there are not proper tests of a devices running that SOC.
If it does not beat Apple custom GPU efforts as well as massively optimized software stack it will be a surprise to nobody. Yet it is one hell of a jump from previous arch.

Ailuros · Nov 1, 2016

liolio said:
It is a logical fallacy, I can't answer the clusters are not the same. Now the clock speed is likely to affect performance efficiency.

If power consumption would be low enough, they would eventually clock even higher.

Your original post is extremely negative wrt to the giant stride ARM IPs are doing especially in the GPU realm.

Yes the negativitiy is there because ARM takes traditionally more than one mouth full when it comes to marketing and presenting architectures, while reality until now traditionally paints a different picture.

Furthermore there are not proper tests of a devices running that SOC.

So where's the "giant stride" exactly? I've commented on Huawei's claims and I severely doubt that they've made any of those numbers up or would go through any efforts to show their own products in a negative light.

If it does not beat Apple custom GPU efforts as well as massively optimized software stack it will be a surprise to nobody. Yet it is one hell of a jump from previous arch.

No ARM GPU IP ever managed to beat cluster per cluster and clock per clock any PVR GPU IP so far. It won't change if someone compares performance numbers from other Rogue implementations either. It's not me that that makes those comparisons either; it's Huawei in the given case and the entire market that unfortunately takes Apple as a metric.

Arun · Nov 1, 2016

Ailuros said:
(while this time the amount of TMUs/cluster being equal)?

What is that based on?

Ailuros · Nov 1, 2016

Arun said:
What is that based on?

Damn I've screwed up again

Those claimed 27.2 GPixels for G71 are for 32 clusters and me dumbass calculated with 16 *kicks sand and shrugs*

Ailuros · Nov 14, 2016

Early G71 Gfxbench results from the Huawei Mate 9: https://gfxbench.com/device.jsp?D=H...&testgroup=graphics&benchmark=gfx40&var=score

I'd ignore the long term performance score for Manhattan3.1 for now, since it's only one result.

ARM Bifrost Architecture

Ailuros

Epsilon plus three

sebbbi

pixelio

Deleted member 13524

Guest

arjan de lumens

sebbbi

Forrest

Xmas

Porous

Gubbi

mczak

Xmas

Porous

Ailuros

Epsilon plus three

liolio

Aquoiboniste

CSI PC

Ailuros

Epsilon plus three

liolio

Aquoiboniste

Ailuros

Epsilon plus three

Arun

Unknown.

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Similar threads