ARM Midgard Architecture

Found it yesterday and some might find it interesting:

http://community.arm.com/thread/5688

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle. http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:


  • 7: dot product (4 Muls, 3 adds)
  • 1: scalar add
  • 4: vec4 add
  • 4: vec4 multiply
  • 1: scalar multiply

So the formula is:

17 FP32 flops/cycle * ALU count * core count * frequency

T604 MP4 : 17 * 2 * 4 * 0.533 = 72.488 FP32 GFLOPS
T628 MP6 : 17 * 2 * 6 * 0.533 = 108.732 FP32 GFLOPS

This is assuming FP32, but as the ALU's vector units are quite flexible, you can actually do more work in the vector units using FP16, or less using FP64. You can achieve 5 FP64 FLOPS per ALU per cycle, so that gives us:

T604 MP4 : 5 * 2 * 4 * 0.533 = 21.32 FP64 GFLOPS
T628 MP6 : 5 * 2 * 6 * 0.533 = 31.98 FP64 GFLOPS
 
ARM's GPU compiler team were clearly never consulted before the Midgard ALU was designed, because not only is that arrangement really hard to do efficient codegen for in general, but it's also a chained arrangement with bypass paths, which I think is also encoded in the ISA.
 
Yep, that area annotation is wrong (excludes GMEM and isn't boundary accurate for the blocks it does enclose). It's a bit bigger.
 
Yep, that area annotation is wrong (excludes GMEM and isn't boundary accurate for the blocks it does enclose). It's a bit bigger.

I assume its inaccurate for all SoCs they try to investigate? Assuming the G6430 is 23mm2 would it also mean that it's roughly under 230Mio transistors?
 
ARM Mali-T604

Don't really know for the HiSilicon chip, I've never seen my own shot and that one it looks like it's been badly delayered. The other areas are all inaccurate by some amount.
 
When scaling up the various cost metrics of a mobile GPU design like die area, heat dissipation, and power consumption, the resulting mobile SoC will start to run too hot to be practical for a range of end products before it ever prices itself out of competing for those design wins from the sub-dollar cost increase of several extra square millimeters of GPU silicon.
 
So you think vendors have designed SoCs for certain market segments, but have never been able to find design wins because their power has ended up being too high?

That's a very rare occurrence in the grand scheme of things.
 
I think prioritizing performance per square millimeter over performance per milliwatt for mobile (whether by intention or by simply not having the architectural efficiencies to do otherwise) can result in a product along the lines of a K1 where the primary target market becomes a niche like tablets versus mainstream or high-end smartphones.
 
I think prioritizing performance per square millimeter over performance per milliwatt for mobile (whether by intention or by simply not having the architectural efficiencies to do otherwise) can result in a product along the lines of a K1 where the primary target market becomes a niche like tablets versus mainstream or high-end smartphones.

You know that would actually also encount Apple's SoCs since they've been since their A4 (I think) in a relative sense sacrificing die area in order to save more performance. The formula for the ULP SoC world is rather simple and it's called PPA and in that exact order:

Power
Performance
Area
 
?

I know Apple has designed for lower thermals/power by using more die area, which I'm saying is the correct priority for a mobile design. That's what I mean by prioritizing higher performance per milliwatt ahead of even performance per square millimeter.

I wonder if the start of that focus on larger silicon layouts was that Fast14 type technology they got from Intrinsity on, like mentioned, the Apple A4. Apple has managed to surprisingly shrink silicon usage with the A7, though, yet their priorities still seem to be in the right place.
 
?

I know Apple has designed for lower thermals/power by using more die area, which I'm saying is the correct priority for a mobile design. That's what I mean by prioritizing higher performance per milliwatt ahead of even performance per square millimeter.

I apologize; now that I'm re-reading it, I can see my blond moment.

I wonder if the start of that focus on larger silicon layouts was that Fast14 type technology they got from Intrinsity on, like mentioned, the Apple A4. Apple has managed to surprisingly shrink silicon usage with the A7, though, yet their priorities still seem to be in the right place.

Hard to tell without knowing transistor counts of former Apple SoCs and or competing SoCs to the A7. We know that the A7 has roughly 1b transistors spread over 101mm2, meaning roughly 9.9Mio transistors per sqmm for Samsung's 28nm but that's all we know.
 
I think prioritizing performance per square millimeter over performance per milliwatt for mobile (whether by intention or by simply not having the architectural efficiencies to do otherwise) can result in a product along the lines of a K1 where the primary target market becomes a niche like tablets versus mainstream or high-end smartphones.

You aren't making any sense here. The Tegra K1 GPU does not prioritize perf. per mm^2 over perf. per watt (in fact, it is ~ 1.5 more power efficient compared to the best ultra mobile GPU's available today). The SoC die size of TK1 is ~ 50% larger than Tegra 4, with most of the increase likely going to the GPU. TK1 is also not confined to tablets (which is not a niche market to begin with), as it will make it's way into portable gaming devices, high res. 4K TV's and monitors, high end smartphones, automotive infotainment/navigation/advanced driver assistance systems, and embedded devices for robotics, medical, and military applications.
 
Last edited by a moderator:
Yes, I too have read up on how nVidia's latest development hardware compares to the actual end products from last year's competition, and I also observe that the OEMs who build the smartphones at the highest performance end, where selecting an app processor without an integrated modem/baseband is a completely acceptable design decision and who've used nVidia in this space before, are not selecting Tegra K1 nor are the MediaTeks, Rockchips, Broadcomms, Samsungs, TIs, etc of the world licensing K1's GPU IP for their SoCs.
 
Yes, I too have read up on how nVidia's latest development hardware compares to the actual end products from last year's competition, and I also observe that the OEMs who build the smartphones at the highest performance end, where selecting an app processor without an integrated modem/baseband is a completely acceptable design decision and who've used nVidia in this space before, are not selecting Tegra K1 nor are the MediaTeks, Rockchips, Broadcomms, Samsungs, TIs, etc of the world licensing K1's GPU IP for their SoCs.

LOL. TK1 GPU perf. and perf. per watt is far superior to [end of last year's] highest end ultra mobile SoC's, and should be very competitive with any ultra mobile SoC for the duration of this year. You still aren't making any sense here. FYI, numerous high end phone SoC's do not have a baseband modem integrated on die (including S600, S805, etc). As for the licensing bit, that is aimed at a very select group of vertically integrated companies, and probably will not yield fruit until the Maxwell generation at the earliest due to the timeframes involved with IP development.

Anyway, your previous statement about prioritizing perf. per mm^2 rather than perf. per watt is nonsensical given the data we have at this time.
 
Last edited by a moderator:
It's also a generation ahead for which you'd expect it to run ahead anyway. Can we come back now to Mali please?
 
Back
Top