ARM Bifrost Architecture

They borrowed many good ideas from others. This presentation reminds me about the GCN presentation (http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah). Maxwell has similar operand reuse cache (avoiding register access -> avoiding bank conflicts -> simpler more efficient register file design). Clauses aren't new either (AMDs Terascale had clauses... and while clauses are nice, they are not 100% problem free). SIMD4 is a natural fit for quad shading. Pixel shaders are still the most common workload on mobile GPUs. Other GPUs need to pack multiple quads to a single wave (IIRC Intel doesn't - it just issues partially full SIMD8).
 
It looks good!

The other new "feature" is that you can easily calculate total AFUs by taking the number of Bifrost cores and multiplying times 12. :)

A Bifrost 32 core design would have 384 AFUs.
 
Last edited:
Could someone tell the new architecture's floating point throughput per-core-per-frequency?

The Anandtech article says that each Bifrost core has 3 * quad execution engines, for a total of 12 fused multiply-add operations.
Does each FMA correspond to two FLOPs, since it's mutliply and add?

Then a G71MP32 at e.g. 1000MHz should be able to do 32*12*2*1000M = 768 GFLOP/s?

If so, it's still not a design that could match a current-gen console (unless taken to really high clocks?), but it could be an interesting match to the current 15W x86 APU offerings (Bristol Ridge and Skylake U GT3e).
With 10nm, this could be very interesting for a handheld console.
 
Each of the four quad lanes in an execution engine has an FMA plus a separate ADD, so in terms of flop numbers, you should get that:

G71 at 1000 MHz should be able to do 32*12*(2+1)*1000M = 1152 GFlops/s.

While this does fall a bit short of XB1/PS4, it should generally be a bit easier to get closer to theoretical numbers with smaller wavefronts; it's also been my understanding that the GCN architecture used in these consoles has to do attribute/varying interpolation in shader cores, which also reduces the amount of GFlops actually available; G71 keeps dedicated varying-units, so that this kind of varying-interpolation doesn't take away from the GFlops number. Also, G71 has full FP16 support, using 2-component vectors in each 32-bit lane; using FP16 instead of FP32 will as such give you twice the paper-GFlops.

Also, for heterogeneous compute, G71 now has full coherency, so that CPU and GPU can work on the same data without intervening cache maintenance operations; whether this is useful for anything that a console would normally do ... time will tell, I suppose.
 
Each of the four quad lanes in an execution engine has an FMA plus a separate ADD, so in terms of flop numbers, you should get that:

G71 at 1000 MHz should be able to do 32*12*(2+1)*1000M = 1152 GFlops/s.

While this does fall a bit short of XB1/PS4, it should generally be a bit easier to get closer to theoretical numbers with smaller wavefronts; it's also been my understanding that the GCN architecture used in these consoles has to do attribute/varying interpolation in shader cores, which also reduces the amount of GFlops actually available; G71 keeps dedicated varying-units, so that this kind of varying-interpolation doesn't take away from the GFlops number. Also, G71 has full FP16 support, using 2-component vectors in each 32-bit lane; using FP16 instead of FP32 will as such give you twice the paper-GFlops.

Also, for heterogeneous compute, G71 now has full coherency, so that CPU and GPU can work on the same data without intervening cache maintenance operations; whether this is useful for anything that a console would normally do ... time will tell, I suppose.
Yes, GCN does attribute interpolation with ALUs (compiler puts them in the beginning of the shader). Also GCN uses lots of ALUs for other seemingly innocent tasks, such as sampling a cube map (normalize UV vector, cubeface. cubecoord). Double rate FP16 would help a lot in games. Not all math needs to be full precision (to have any effect on resulting image quality). I also like double rate 16 bit integer math (and native support for splitting an 32 bit register between two 16 bit variables). Modern rendering engines do a lot of integer math (in compute shaders).

But theoretical FLOPS alone is unfortunately not enough to beat the consoles. Memory bandwidth plays a big role in running optimized shader code (designed solely for single platform). Most shaders (in critical path) are optimized until memory bandwidth becomes the hard limit. Xbox One bandwidth is 68 GB/s + ESRAM and PS4 is 176 GB/s. It is still long way to get there. Tiling helps mobile GPUs of course, but ESRAM helps as much and is much more flexible than tiling. Modern rendering engines are spending 50%+ time running compute shaders. Tiling doesn't help compute shaders at all. AMD GCN shines in compute. It is highly flexible, and very fine grained (higher GPU utilization can be reached by running shaders with different bottlenecks concurrently on each CU). GCN doesn't suffer much from LDS bank conflicts and has super fast LDS atomics (that do not even use VALU slots). I have seen Radeon 7970 beating GTX 980 (Maxwell) in code containing lots of LDS atomics (and it beats 780 Ti by a HUGE margin, as Kepler has dead slow atomics).
 
With the amount of spatial coherence typically in graphics, a SIMD4 seems like a strange choice. Imagine the amount of control logic required in a 32 core GPU! I don't think packing quads into a wavefront is a problem - only the last wavefront would be partially empty.
 
With the amount of spatial coherence typically in graphics, a SIMD4 seems like a strange choice. Imagine the amount of control logic required in a 32 core GPU! I don't think packing quads into a wavefront is a problem - only the last wavefront would be partially empty.
Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront
 
Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront

Tiles are 32x32 pixels these days. There should be plenty of opportunity to do 4 quads at a time.

Looks more like a re-purposing of their existing SIMD-4 ALU (which might be a wise choice)

Cheers
 
Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront
Is that really different to "traditional" renderers? I'm not convinced these can export to multiple RBEs from the same wave.
 
Is that really different to "traditional" renderers? I'm not convinced these can export to multiple RBEs from the same wave.
I don't know if that's the case. But even if the quads in a wavefront need to go to the same RBE that's not the same as going to the same tile.

Tiles are 32x32 pixels these days. There should be plenty of opportunity to do 4 quads at a time.
So 256 quads without overdraw, or at least 64 hypothetical wavefronts of 4 quads. With each object added to the tile that gets more fragmented, I wouldn't be surprised to see utilisation well below 90% in average cases. Is that worth it? I don't know.

On the other hand a lot of the control logic per core seems to be shared between the three quad engines.
 
http://www.golem.de/news/kirin-960-...ueber-seinen-smartphone-chip-1610-124061.html

Huawei is more than open about its SoCs; the 960 weighs about 110mm under 16FFC TSMC. Other than that T880MP4@900MHz vs. G71MP8@900MHz = +180% according to Huwaei and if you look at slide7 the first gets around 19 fps in Manhattan3.0 offscreen while the 960 climbs up all the way to 51 fps. On the other hand the A10 GPU with just 6 clusters and a frequency I'd estimate around 630+MHz gets 64 fps. I don't think I need to normalize that on a hypothetical basis to the same amount of clusters and same frequency to make a point I guess....
 
Last edited:
Well normalizing on the basics of clock speed and number of GPU units does not say much. Power and die size are better indication of the design efficiency.
ARM new IPs are impressive both on the CPU side and the GPU side and the A35 are not in use yet.
I wonder when the low and mid-end SOC will really moved forward, it seems market segmentation is the only reason holding their evolution at this point (I guess some twisted costumers perception wrt core number too).
 
Well normalizing on the basics of clock speed and number of GPU units does not say much. Power and die size are better indication of the design efficiency.
ARM new IPs are impressive both on the CPU side and the GPU side and the A35 are not in use yet.
I wonder when the low and mid-end SOC will really moved forward, it seems market segmentation is the only reason holding their evolution at this point (I guess some twisted costumers perception wrt core number too).

How high are the chances exactly that 8 clusters clocked at 900MHz will consume less power than 6 clusters at 630+MHz (while this time the amount of TMUs/cluster being equal)?
 
How high are the chances exactly that 8 clusters clocked at 900MHz will consume less power than 6 clusters at 630+MHz (while this time the amount of TMUs/cluster being equal)?
It is a logical fallacy, I can't answer the clusters are not the same. Now the clock speed is likely to affect performance efficiency.
Your original post is extremely negative wrt to the giant stride ARM IPs are doing especially in the GPU realm. Furthermore there are not proper tests of a devices running that SOC.
If it does not beat Apple custom GPU efforts as well as massively optimized software stack it will be a surprise to nobody. Yet it is one hell of a jump from previous arch.
 
It is a logical fallacy, I can't answer the clusters are not the same. Now the clock speed is likely to affect performance efficiency.
If power consumption would be low enough, they would eventually clock even higher.
Your original post is extremely negative wrt to the giant stride ARM IPs are doing especially in the GPU realm.

Yes the negativitiy is there because ARM takes traditionally more than one mouth full when it comes to marketing and presenting architectures, while reality until now traditionally paints a different picture.
Furthermore there are not proper tests of a devices running that SOC.
So where's the "giant stride" exactly? I've commented on Huawei's claims and I severely doubt that they've made any of those numbers up or would go through any efforts to show their own products in a negative light.

If it does not beat Apple custom GPU efforts as well as massively optimized software stack it will be a surprise to nobody. Yet it is one hell of a jump from previous arch.

No ARM GPU IP ever managed to beat cluster per cluster and clock per clock any PVR GPU IP so far. It won't change if someone compares performance numbers from other Rogue implementations either. It's not me that that makes those comparisons either; it's Huawei in the given case and the entire market that unfortunately takes Apple as a metric.
 
Last edited:
Back
Top