ARM Bifrost Architecture

Discussion in 'Mobile Graphics Architectures and IP' started by Ailuros, May 30, 2016.

Tags:
  1. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,408
    Likes Received:
    172
    Location:
    Chania
    pixelio likes this.
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    They borrowed many good ideas from others. This presentation reminds me about the GCN presentation (http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah). Maxwell has similar operand reuse cache (avoiding register access -> avoiding bank conflicts -> simpler more efficient register file design). Clauses aren't new either (AMDs Terascale had clauses... and while clauses are nice, they are not 100% problem free). SIMD4 is a natural fit for quad shading. Pixel shaders are still the most common workload on mobile GPUs. Other GPUs need to pack multiple quads to a single wave (IIRC Intel doesn't - it just issues partially full SIMD8).
     
    pixelio and Ike Turner like this.
  3. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    It looks good!

    The other new "feature" is that you can easily calculate total AFUs by taking the number of Bifrost cores and multiplying times 12. :)

    A Bifrost 32 core design would have 384 AFUs.
     
    #3 pixelio, May 30, 2016
    Last edited: May 30, 2016
  4. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,550
    Likes Received:
    4,214
    Could someone tell the new architecture's floating point throughput per-core-per-frequency?

    The Anandtech article says that each Bifrost core has 3 * quad execution engines, for a total of 12 fused multiply-add operations.
    Does each FMA correspond to two FLOPs, since it's mutliply and add?

    Then a G71MP32 at e.g. 1000MHz should be able to do 32*12*2*1000M = 768 GFLOP/s?

    If so, it's still not a design that could match a current-gen console (unless taken to really high clocks?), but it could be an interesting match to the current 15W x86 APU offerings (Bristol Ridge and Skylake U GT3e).
    With 10nm, this could be very interesting for a handheld console.
     
  5. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Each of the four quad lanes in an execution engine has an FMA plus a separate ADD, so in terms of flop numbers, you should get that:

    G71 at 1000 MHz should be able to do 32*12*(2+1)*1000M = 1152 GFlops/s.

    While this does fall a bit short of XB1/PS4, it should generally be a bit easier to get closer to theoretical numbers with smaller wavefronts; it's also been my understanding that the GCN architecture used in these consoles has to do attribute/varying interpolation in shader cores, which also reduces the amount of GFlops actually available; G71 keeps dedicated varying-units, so that this kind of varying-interpolation doesn't take away from the GFlops number. Also, G71 has full FP16 support, using 2-component vectors in each 32-bit lane; using FP16 instead of FP32 will as such give you twice the paper-GFlops.

    Also, for heterogeneous compute, G71 now has full coherency, so that CPU and GPU can work on the same data without intervening cache maintenance operations; whether this is useful for anything that a console would normally do ... time will tell, I suppose.
     
    pixelio likes this.
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Yes, GCN does attribute interpolation with ALUs (compiler puts them in the beginning of the shader). Also GCN uses lots of ALUs for other seemingly innocent tasks, such as sampling a cube map (normalize UV vector, cubeface. cubecoord). Double rate FP16 would help a lot in games. Not all math needs to be full precision (to have any effect on resulting image quality). I also like double rate 16 bit integer math (and native support for splitting an 32 bit register between two 16 bit variables). Modern rendering engines do a lot of integer math (in compute shaders).

    But theoretical FLOPS alone is unfortunately not enough to beat the consoles. Memory bandwidth plays a big role in running optimized shader code (designed solely for single platform). Most shaders (in critical path) are optimized until memory bandwidth becomes the hard limit. Xbox One bandwidth is 68 GB/s + ESRAM and PS4 is 176 GB/s. It is still long way to get there. Tiling helps mobile GPUs of course, but ESRAM helps as much and is much more flexible than tiling. Modern rendering engines are spending 50%+ time running compute shaders. Tiling doesn't help compute shaders at all. AMD GCN shines in compute. It is highly flexible, and very fine grained (higher GPU utilization can be reached by running shaders with different bottlenecks concurrently on each CU). GCN doesn't suffer much from LDS bank conflicts and has super fast LDS atomics (that do not even use VALU slots). I have seen Radeon 7970 beating GTX 980 (Maxwell) in code containing lots of LDS atomics (and it beats 780 Ti by a HUGE margin, as Kepler has dead slow atomics).
     
    Pixel and Silent_Buddha like this.
  7. Forrest

    Newcomer

    Joined:
    Jul 22, 2008
    Messages:
    39
    Likes Received:
    0
    With the amount of spatial coherence typically in graphics, a SIMD4 seems like a strange choice. Imagine the amount of control logic required in a 32 core GPU! I don't think packing quads into a wavefront is a problem - only the last wavefront would be partially empty.
     
  8. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,294
    Likes Received:
    132
    Location:
    On the path to wisdom
    Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront
     
  9. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,494
    Likes Received:
    806
    Tiles are 32x32 pixels these days. There should be plenty of opportunity to do 4 quads at a time.

    Looks more like a re-purposing of their existing SIMD-4 ALU (which might be a wise choice)

    Cheers
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,004
    Likes Received:
    109
    Is that really different to "traditional" renderers? I'm not convinced these can export to multiple RBEs from the same wave.
     
    jiaolu likes this.
  11. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,294
    Likes Received:
    132
    Location:
    On the path to wisdom
    I don't know if that's the case. But even if the quads in a wavefront need to go to the same RBE that's not the same as going to the same tile.

    So 256 quads without overdraw, or at least 64 hypothetical wavefronts of 4 quads. With each object added to the tile that gets more fragmented, I wouldn't be surprised to see utilisation well below 90% in average cases. Is that worth it? I don't know.

    On the other hand a lot of the control logic per core seems to be shared between the three quad engines.
     
  12. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,408
    Likes Received:
    172
    Location:
    Chania
    http://www.golem.de/news/kirin-960-...ueber-seinen-smartphone-chip-1610-124061.html

    Huawei is more than open about its SoCs; the 960 weighs about 110mm under 16FFC TSMC. Other than that T880MP4@900MHz vs. G71MP8@900MHz = +180% according to Huwaei and if you look at slide7 the first gets around 19 fps in Manhattan3.0 offscreen while the 960 climbs up all the way to 51 fps. On the other hand the A10 GPU with just 6 clusters and a frequency I'd estimate around 630+MHz gets 64 fps. I don't think I need to normalize that on a hypothetical basis to the same amount of clusters and same frequency to make a point I guess....
     
    #12 Ailuros, Oct 28, 2016
    Last edited: Oct 28, 2016
    Lodix likes this.
  13. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,720
    Likes Received:
    193
    Location:
    Stateless
    Well normalizing on the basics of clock speed and number of GPU units does not say much. Power and die size are better indication of the design efficiency.
    ARM new IPs are impressive both on the CPU side and the GPU side and the A35 are not in use yet.
    I wonder when the low and mid-end SOC will really moved forward, it seems market segmentation is the only reason holding their evolution at this point (I guess some twisted costumers perception wrt core number too).
     
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
  15. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,408
    Likes Received:
    172
    Location:
    Chania
    How high are the chances exactly that 8 clusters clocked at 900MHz will consume less power than 6 clusters at 630+MHz (while this time the amount of TMUs/cluster being equal)?
     
  16. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,720
    Likes Received:
    193
    Location:
    Stateless
    It is a logical fallacy, I can't answer the clusters are not the same. Now the clock speed is likely to affect performance efficiency.
    Your original post is extremely negative wrt to the giant stride ARM IPs are doing especially in the GPU realm. Furthermore there are not proper tests of a devices running that SOC.
    If it does not beat Apple custom GPU efforts as well as massively optimized software stack it will be a surprise to nobody. Yet it is one hell of a jump from previous arch.
     
    Lodix likes this.
  17. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,408
    Likes Received:
    172
    Location:
    Chania
    If power consumption would be low enough, they would eventually clock even higher.
    Yes the negativitiy is there because ARM takes traditionally more than one mouth full when it comes to marketing and presenting architectures, while reality until now traditionally paints a different picture.
    So where's the "giant stride" exactly? I've commented on Huawei's claims and I severely doubt that they've made any of those numbers up or would go through any efforts to show their own products in a negative light.

    No ARM GPU IP ever managed to beat cluster per cluster and clock per clock any PVR GPU IP so far. It won't change if someone compares performance numbers from other Rogue implementations either. It's not me that that makes those comparisons either; it's Huawei in the given case and the entire market that unfortunately takes Apple as a metric.
     
    #17 Ailuros, Nov 1, 2016
    Last edited: Nov 1, 2016
  18. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    What is that based on?
     
  19. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,408
    Likes Received:
    172
    Location:
    Chania
    Damn I've screwed up again :( Those claimed 27.2 GPixels for G71 are for 32 clusters and me dumbass calculated with 16 *kicks sand and shrugs*
     
  20. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,408
    Likes Received:
    172
    Location:
    Chania
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...