ARM Midgard Architecture

And get beaten by SGX 543MP4, a part a generation behind.

Not really, unless the SGX543MP4 is clocked at >400MHz.
Let's not mistake the 500MHz Cedar in E-350 with the 280MHz version in C-50/Z-01.

(EDIT: wrong codename)
 
Last edited by a moderator:
ARM are moving to a yearly cycle for new GPU IP (another MP capable Mali T-6xx core will follow in 2013.)

They're all still a part of the Midgard architecture. A totally new architecture is planned to follow on the standard ~4 year cadence.
 
Last edited by a moderator:
Not really, unless the SGX543MP4 is clocked at >400MHz.
Let's not mistake the 500MHz Cedar in E-350 with the 280MHz version in C-50/Z-01.

(EDIT: wrong codename)

28nm high end SoC GPUs aren't to get clocked lower than 500MHz (in fact they'll start at even higher frequencies than that) as you can see from the 2.0 GPixels ARM states for the T604MP4.

Let's not go to the Rogue generation (which would be more fair in all honesty due to that one and T6x0 belonging to the same generation) and let's do the Series5XT trickery again at just 500MHz:

SGX554MP2@500MHz = 72 GFLOPs/s

Past generation stuff Exophase just picked a 543MP4 which you'd need a far higher frequency to exceed the T604 rate. Again though not particularly fair since T604 belongs to the next generation (don't forget that native FP64 support eats up a shitload of die area); it still remains a fact that T604 ALU throughput sounds low. Especially if the Rogue in A9600 should be a MP2@667; if yes that would be >105 GFLOPs per core.

Just for the record by the time T604 and Rogue will ship in actual devices, C50 and any of today's lower end SoCs will be quite old news. You wouldn't imagine that AMD will increase significantly in performance (at least by a factor of 2.0x for each market segment) under 28nm now would you?
 
Let us not mistake the mem bw of Zacate with mem bw of any SoC that might be expected to have T 604.
Zacate is single-channel 64bit DDR3 1066/1333MHz.
Exynos 4210 (Mali 400MP4) and OMAP4 are already dual-channel 32bit LPDDR2.
You really think the T604 is expected to be in a SoC with less bandwidth than that?


What would be the clock for T604 to reach 68Gflops in the first place?
I wouldn't know. I just compared the 68GFLOPS in the slide with the 80GFLOPs in a 500MHz Cedar.
Exophase said the T604 would be beaten by a SGX543MP4, which is not true for the rumoured ~200MHz clock in PSVita.



Just for the record by the time T604 and Rogue will ship in actual devices, C50 and any of today's lower end SoCs will be quite old news. You wouldn't imagine that AMD will increase significantly in performance (at least by a factor of 2.0x for each market segment) under 28nm now would you?

Nope, I'm hoping for Krishna and Wichita to have at least a Caicos GPU in it (160sp, 8 TMUs, 4 ROPs), since the difference in transistor count for Cedar (80sp) is only ~21%, which should be negligible given the smaller node (Brazos are still made in 40nm).
 
Last edited by a moderator:
Past generation stuff Exophase just picked a 543MP4 which you'd need a far higher frequency to exceed the T604 rate.

Wouldn't ALU rate on the 543MP4 be the same as 554MP2 at the same clock? But yes, I was considering it the same clock speed.

Of course throwing around FLOPs counts as a means for comparison is pretty limited, when you have no idea of where those FLOPs are allocated.

As for native FP64, do you think ARM would really be allocating more than the bare minimum of necessary resources (ie, FP64 throughput at 1/4th FP32)? Although that'd still necessitate somewhat wider FP multipliers..
 
Nope, I'm hoping for Krishna and Wichita to have at least a Caicos GPU in it (160sp, 8 TMUs, 4 ROPs), since the difference in transistor count for Cedar (80sp) is only ~21%, which should be negligible given the smaller node (Brazos are still made in 40nm).

Well a Caicos GPU would possibly be from the unit count alone in the ST A9600 Rogue region, where ST is rating at over 210 GFLOPs/s and over 5.2GTexels/s (w/o overdraw). Still quite a distance to 68 GFLOPs/s and 2.0GTexels/s.
 
Wouldn't ALU rate on the 543MP4 be the same as 554MP2 at the same clock? But yes, I was considering it the same clock speed.

Yes it would; but if anyone would just be going for sterile ALU processing power I'd assume that a 554MP2 could be a better idea (less TMUs, z/stencil units and what not). From the sound of it T604 sounds still like a 1 TMU design like Mali400 unless I'm reading incorrectly into those performance figures (2.0 GPixels / 4 = 500MHz).

In the case of the 543/4MP4 you'd have 8 TMUs at 500MHz and not just 4 ;)

Of course throwing around FLOPs counts as a means for comparison is pretty limited, when you have no idea of where those FLOPs are allocated.

Without knowing how each of them looks like in real time throughput/efficiency it's just a game of theoretical vs. theoretical numbers. However irrelevant how bad the ALU efficiency might be on next generation GPU IP, 17 GFLOPs/core is still low.

As for native FP64, do you think ARM would really be allocating more than the bare minimum of necessary resources (ie, FP64 throughput at 1/4th FP32)? Although that'd still necessitate somewhat wider FP multipliers..

Don't know if it's technically accurate but I'd argue that it's not even a 4:1 ratio since the whole thing sounds like 34 FLOPs throughput which doesn't evenly divide by 4. As you say it most likely has some sort of VecN+1 ALUs where the "1" might stand for an additional ADD or MUL. In other words it could be 34 FLOPs single precision and 8 FLOPs double precision.

Frankly I doubt any of the future architectures has gone through any lenghts to achieve anything lower than 4:1. ARM itself has stated in one of their public writeups regarding T604 that FP64 might get used rarely in real time conditions in the embedded space, but when it does it's needed badly. I don't have anything to object to that and no I'd personally wouldn't want just yet valuable transistors to get invested in something like FP64 to a much higher degree at the cost of raw performance for the biggest majority of cases.
 
Zacate is single-channel 64bit DDR3 1066/1333MHz.
Exynos 4210 (Mali 400MP4) and OMAP4 are already dual-channel 32bit LPDDR2.
You really think the T604 is expected to be in a SoC with less bandwidth than that?
There is still some difference in mem bw. Just run the numbers. Arguably, these things are vastly more mem limited than ALU limited.

I wouldn't know. I just compared the 68GFLOPS in the slide with the 80GFLOPs in a 500MHz Cedar.
T604 will need 500MHz to reach 68 gflops.
 
There is still some difference in mem bw. Just run the numbers.

I wouldn't be so sure. LPDDR2 right now is around 800MHz for the dual-channel 32-bit implementations IIRC.

A T604 in a Cortex A15 SoC might go dual-channel 1333MHz LPDDR2 and beyond.
Lower-power Wichita might never pass 1600MHz DDR3, so they may end up pretty close.

T604 will need 500MHz to reach 68 gflops.
And where did that number come from? Honest question.
 
Last edited by a moderator:
That's fairly easy to answer. When they state themselves that a MP4 reaches 2.0GPixels/s I wonder what meows on a hot tin roof (tip: it ain't me) :LOL:
 
I wouldn't be so sure. LPDDR2 right now is around 800MHz for the dual-channel 32-bit implementations IIRC.

A T604 in a Cortex A15 might go dual-channel 1333MHz LPDDR2 and beyond.
Lower-power Wichita might never pass 1600MHz DDR3, so they may end up pretty close.
LPDDR2 stops at 1066 MHz effective frequency, so you need LPDDR3 for 1333+. There are other solutions to high bandwidth in SoCs that don't just chase DRAM frequency, which might be worth looking at.

An aside, but it's quite grating to read "T604 in a Cortex A15" :p
 
LPDDR2 stops at 1066 MHz effective frequency, so you need LPDDR3 for 1333+. There are other solutions to high bandwidth in SoCs that don't just chase DRAM frequency, which might be worth looking at.
Extra channels? More cache?


An aside, but it's quite grating to read "T604 in a Cortex A15" :p
Fixed. :p
 
As for native FP64, do you think ARM would really be allocating more than the bare minimum of necessary resources (ie, FP64 throughput at 1/4th FP32)? Although that'd still necessitate somewhat wider FP multipliers..
Isn't FP64 in this context just meaning the support of 64bit HDR frame buffer formats and textures with four 16 bit components, so 4*FP16 RGBA? So what is needed to support that is to beef up the TMUs and ROPs, not the multipliers in the ALUs. It's not about double precision at all.
 
No, T6xx supports IEEE754 double precision computation.
 
I wouldn't be so sure. LPDDR2 right now is around 800MHz for the dual-channel 32-bit implementations IIRC.

A T604 in a Cortex A15 SoC might go dual-channel 1333MHz LPDDR2 and beyond.
Lower-power Wichita might never pass 1600MHz DDR3, so they may end up pretty close.

I think you guys might be confusing DDR clock speed and transfer rate. Current high-end ARM SoCs only support LPDDR2 up to 400MHz, and if you look at Micron's product pages for instance you'll see that's also the fastest they sell. OMAP4470 announced support for 466MHz, that should give you an idea of the roadmap.. JEDEC currently specifies up to 533MHz.

LPDDR2 is only 1.2V vs 1.8V for normal DDR2, so it's going to have more limited clocks. I wonder if maybe Brazos (or if its successors) support 1.35V DDR3L, which should help reduce power consumption a little, although relative to the consumption of the SoC it's probably pretty minor.

Right now Tegra 2 is hamstrung not only by having only one channel but being limited to 300MHz for LPDDR2 (or 333MHz for DDR2, which of course consumes much more power).. but it seems to be doing okay without much bandwidth.

Exynos supports DDR3 (and incidentally, so does i.MX53 of all things), and it looks like Tegra 3 and OMAP5 will as well. This might be the better choice for tablets. No idea how high the bus and memory speeds will actually go on these SoCs. I imagine that Wichita will still have the advantage for a good while. But its GPU will inherently need more bandwidth.
 
I think you guys might be confusing DDR clock speed and transfer rate. Current high-end ARM SoCs only support LPDDR2 up to 400MHz, and if you look at Micron's product pages for instance you'll see that's also the fastest they sell. OMAP4470 announced support for 466MHz, that should give you an idea of the roadmap..
That's why I said effective frequency ;)
 
That's why I said effective frequency ;)

Ah, yeah. And thanks to bad marketing I thought that with DDR3 it was the real frequency ~_~

Nonetheless, DDR3 can achieve clocks double DDR2 due to having double the prefetch width. It does sound like ARM SoCs are approaching it afterall, but won't be hitting 667MHz any time soon.
 
Back
Top