ARM Midgard Architecture

tangey · Feb 23, 2014

Could be the first Tseries graphics smartphone soc to ship in volume.

Lazy8s · Mar 24, 2014

The quad-core Mali-T760 inside Rockchip's quick-to-market Cortex-A12 SoC hits the benchmarks:

http://gfxbench.com/result.jsp?benchmark=gfx30&test=545&order=score&base=gpu

Ailuros · Mar 24, 2014

Lazy8s said:
The quad-core Mali-T760 inside Rockchip's quick-to-market Cortex-A12 SoC hits the benchmarks:

http://gfxbench.com/result.jsp?benchmark=gfx30&test=545&order=score&base=gpu

Assuming that 760 has a comparable frequency to the T628MP6, the first should perform with 4 clusters roughly as much as with 6 clusters for the latter but also with the cost of significantly higher die area.

Ailuros · Jun 19, 2014

Found it yesterday and some might find it interesting:

http://community.arm.com/thread/5688

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle. http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

7: dot product (4 Muls, 3 adds)

1: scalar add

4: vec4 add

4: vec4 multiply

1: scalar multiply

So the formula is:

17 FP32 flops/cycle * ALU count * core count * frequency

T604 MP4 : 17 * 2 * 4 * 0.533 = 72.488 FP32 GFLOPS
T628 MP6 : 17 * 2 * 6 * 0.533 = 108.732 FP32 GFLOPS

This is assuming FP32, but as the ALU's vector units are quite flexible, you can actually do more work in the vector units using FP16, or less using FP64. You can achieve 5 FP64 FLOPS per ALU per cycle, so that gives us:

T604 MP4 : 5 * 2 * 4 * 0.533 = 21.32 FP64 GFLOPS
T628 MP6 : 5 * 2 * 6 * 0.533 = 31.98 FP64 GFLOPS

Rys · Jun 20, 2014

ARM's GPU compiler team were clearly never consulted before the Midgard ALU was designed, because not only is that arrangement really hard to do efficient codegen for in general, but it's also a chained arrangement with bypass paths, which I think is also encoded in the ISA.

Nebuchadnezzar · Jun 20, 2014

This might be the most ghetto chip breakdown ever, but also the first time ever I see a die shot of a new Mali GPU: http://www.antutu.com/view.shtml?id=7879

Includes IP block size breakdowns: http://news.mydrivers.com/picture/309044/309044_36.html

Qualcomm is killing it if the 330 is really just 16mm².

Ailuros · Jun 20, 2014

Nebuchadnezzar said:
This might be the most ghetto chip breakdown ever, but also the first time ever I see a die shot of a new Mali GPU: http://www.antutu.com/view.shtml?id=7879

Includes IP block size breakdowns: http://news.mydrivers.com/picture/309044/309044_36.html

Excellent find Nebu.

Rys · Jun 20, 2014

Yep, that area annotation is wrong (excludes GMEM and isn't boundary accurate for the blocks it does enclose). It's a bit bigger.

Ailuros · Jun 20, 2014

Rys said:
Yep, that area annotation is wrong (excludes GMEM and isn't boundary accurate for the blocks it does enclose). It's a bit bigger.

I assume its inaccurate for all SoCs they try to investigate? Assuming the G6430 is 23mm2 would it also mean that it's roughly under 230Mio transistors?

Rys · Jun 20, 2014

ARM Mali-T604

Don't really know for the HiSilicon chip, I've never seen my own shot and that one it looks like it's been badly delayered. The other areas are all inaccurate by some amount.

Lazy8s · Jun 23, 2014

When scaling up the various cost metrics of a mobile GPU design like die area, heat dissipation, and power consumption, the resulting mobile SoC will start to run too hot to be practical for a range of end products before it ever prices itself out of competing for those design wins from the sub-dollar cost increase of several extra square millimeters of GPU silicon.

Rys · Jun 23, 2014

So you think vendors have designed SoCs for certain market segments, but have never been able to find design wins because their power has ended up being too high?

That's a very rare occurrence in the grand scheme of things.

Lazy8s · Jun 23, 2014

I think prioritizing performance per square millimeter over performance per milliwatt for mobile (whether by intention or by simply not having the architectural efficiencies to do otherwise) can result in a product along the lines of a K1 where the primary target market becomes a niche like tablets versus mainstream or high-end smartphones.

Ailuros · Jun 23, 2014

Lazy8s said:
I think prioritizing performance per square millimeter over performance per milliwatt for mobile (whether by intention or by simply not having the architectural efficiencies to do otherwise) can result in a product along the lines of a K1 where the primary target market becomes a niche like tablets versus mainstream or high-end smartphones.

You know that would actually also encount Apple's SoCs since they've been since their A4 (I think) in a relative sense sacrificing die area in order to save more performance. The formula for the ULP SoC world is rather simple and it's called PPA and in that exact order:

Power
Performance
Area

Lazy8s · Jun 23, 2014

?

I know Apple has designed for lower thermals/power by using more die area, which I'm saying is the correct priority for a mobile design. That's what I mean by prioritizing higher performance per milliwatt ahead of even performance per square millimeter.

I wonder if the start of that focus on larger silicon layouts was that Fast14 type technology they got from Intrinsity on, like mentioned, the Apple A4. Apple has managed to surprisingly shrink silicon usage with the A7, though, yet their priorities still seem to be in the right place.

Ailuros · Jun 23, 2014

Lazy8s said:
?

I know Apple has designed for lower thermals/power by using more die area, which I'm saying is the correct priority for a mobile design. That's what I mean by prioritizing higher performance per milliwatt ahead of even performance per square millimeter.

I apologize; now that I'm re-reading it, I can see my blond moment.

I wonder if the start of that focus on larger silicon layouts was that Fast14 type technology they got from Intrinsity on, like mentioned, the Apple A4. Apple has managed to surprisingly shrink silicon usage with the A7, though, yet their priorities still seem to be in the right place.

Hard to tell without knowing transistor counts of former Apple SoCs and or competing SoCs to the A7. We know that the A7 has roughly 1b transistors spread over 101mm2, meaning roughly 9.9Mio transistors per sqmm for Samsung's 28nm but that's all we know.

ams · Jun 23, 2014

Lazy8s said:
I think prioritizing performance per square millimeter over performance per milliwatt for mobile (whether by intention or by simply not having the architectural efficiencies to do otherwise) can result in a product along the lines of a K1 where the primary target market becomes a niche like tablets versus mainstream or high-end smartphones.

You aren't making any sense here. The Tegra K1 GPU does not prioritize perf. per mm^2 over perf. per watt (in fact, it is ~ 1.5 more power efficient compared to the best ultra mobile GPU's available today). The SoC die size of TK1 is ~ 50% larger than Tegra 4, with most of the increase likely going to the GPU. TK1 is also not confined to tablets (which is not a niche market to begin with), as it will make it's way into portable gaming devices, high res. 4K TV's and monitors, high end smartphones, automotive infotainment/navigation/advanced driver assistance systems, and embedded devices for robotics, medical, and military applications.

Lazy8s · Jun 24, 2014

Yes, I too have read up on how nVidia's latest development hardware compares to the actual end products from last year's competition, and I also observe that the OEMs who build the smartphones at the highest performance end, where selecting an app processor without an integrated modem/baseband is a completely acceptable design decision and who've used nVidia in this space before, are not selecting Tegra K1 nor are the MediaTeks, Rockchips, Broadcomms, Samsungs, TIs, etc of the world licensing K1's GPU IP for their SoCs.

ams · Jun 26, 2014

Lazy8s said:
Yes, I too have read up on how nVidia's latest development hardware compares to the actual end products from last year's competition, and I also observe that the OEMs who build the smartphones at the highest performance end, where selecting an app processor without an integrated modem/baseband is a completely acceptable design decision and who've used nVidia in this space before, are not selecting Tegra K1 nor are the MediaTeks, Rockchips, Broadcomms, Samsungs, TIs, etc of the world licensing K1's GPU IP for their SoCs.

LOL. TK1 GPU perf. and perf. per watt is far superior to [end of last year's] highest end ultra mobile SoC's, and should be very competitive with any ultra mobile SoC for the duration of this year. You still aren't making any sense here. FYI, numerous high end phone SoC's do not have a baseband modem integrated on die (including S600, S805, etc). As for the licensing bit, that is aimed at a very select group of vertically integrated companies, and probably will not yield fruit until the Maxwell generation at the earliest due to the timeframes involved with IP development.

Anyway, your previous statement about prioritizing perf. per mm^2 rather than perf. per watt is nonsensical given the data we have at this time.

Ailuros · Jun 26, 2014

It's also a generation ahead for which you'd expect it to run ahead anyway. Can we come back now to Mali please?

ARM Midgard Architecture

tangey

Lazy8s

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Rys

Graphics @ AMD

Nebuchadnezzar

Ailuros

Epsilon plus three

Rys

Graphics @ AMD

Ailuros

Epsilon plus three

Rys

Graphics @ AMD

Lazy8s

Rys

Graphics @ AMD

Lazy8s

Ailuros

Epsilon plus three

Lazy8s

Ailuros

Epsilon plus three

ams

Lazy8s

ams

Ailuros

Epsilon plus three

Similar threads