ARM Midgard Architecture

arjan de lumens · Nov 10, 2010

ARM has now officially announced the Mali-T604, the successor to Mali-200 and Mali-400:

http://arm.com/about/newsroom/arm-h...ed-graphics-with-next-generation-mali-gpu.php

Edit:

More information at http://blogs.arm.com/multimedia/arm...itecture-for-highest-performance-flexibility/.

roninja · Nov 10, 2010

So at first glance appears to be ARM's solution up against Series 5 XT in the MP space. Doesn't sound like they've licenced it yet either. Lets see.

Exophase · Nov 10, 2010

Looks like this one can leverage cache coherence with Cortex-A15, a feature I don't expect IMG to provide. It's interesting that ARM is now positioning architectural advantages by pairing the two.

Also looks like they're offering clear up to double precision operations, hopefully in the fragment shaders which in Mali-400 are limited to only FP16. Don't really see how GPGPU they can be with only offering it in the vertex shaders, afterall.

argor · Nov 10, 2010

Midgard architecture (“Midgard” is the home of the gods in Norse mythology).

they are lousy in Norse mythology i expedited more from a design team based in norway
miðgarður is the Human realm in Norse mythology
while ásgarður is the home of the gods in Norse mythology
nice to see what they did cache coherence with Cortex-A15

wishiknew · Nov 11, 2010

Is this what's in Samsung Orion soc with that "5 times the 3D graphics performance over the previous processor generation"?

iwod · Nov 11, 2010

Are there anything special about their Mali GPU, compare to PowerVR GPU which is better?

And any reason why cache coherence wont work on PowerVR gen 6? ( After all SoC like A4 are custom designs anyway )

Ailuros · Nov 11, 2010

Exophase said:
Looks like this one can leverage cache coherence with Cortex-A15, a feature I don't expect IMG to provide. It's interesting that ARM is now positioning architectural advantages by pairing the two.

Yeah that's definitely interesting.

Also looks like they're offering clear up to double precision operations, hopefully in the fragment shaders which in Mali-400 are limited to only FP16. Don't really see how GPGPU they can be with only offering it in the vertex shaders, afterall.

Well I only overread the documents quickly but I couldn't find anything that suggests that fragment and geometry is still in separated units. In fact the diagram mentions up to 4 shader cores (whatever that stands for) and no sign of a geometry unit like in past diagrams.

Maybe arjan can shed some light into that part and explain if those are USC units now. I doubt they'd advertise GPGPU without being >FP16 in the fragment shaders.

JohnH · Nov 11, 2010

Exophase said:
Looks like this one can leverage cache coherence with Cortex-A15, a feature I don't expect IMG to provide.

Why would you "expect" that?

rpg.314 · Nov 11, 2010

Ailuros said:
Maybe arjan can shed some light into that part and explain if those are USC units now. I doubt they'd advertise GPGPU without being >FP16 in the fragment shaders.

OCL requires full fp32 precision.

JohnH · Nov 11, 2010

Ailuros said:
Well I only overread the documents quickly but I couldn't find anything that suggests that fragment and geometry is still in separated units. In fact the diagram mentions up to 4 shader cores (whatever that stands for) and no sign of a geometry unit like in past diagrams.

Maybe arjan can shed some light into that part and explain if those are USC units now. I doubt they'd advertise GPGPU without being >FP16 in the fragment shaders.

As you say, the absence of a seperate vertex processor in the diagram suggests a unified design, they probably don't want to shout too loudly about this as it flies in the face of their previous marketing.

To be honest imo this is all fluff at the moment, would be nice to see some numbers put beside the thing

arjan de lumens · Nov 11, 2010

Ailuros said:
Maybe arjan can shed some light into that part and explain if those are USC units now. I doubt they'd advertise GPGPU without being >FP16 in the fragment shaders.

Yes, it's unified. As for floating-point data types support, it supports fp16, fp32 and fp64.

Exophase · Nov 11, 2010

Seems like a pretty different design compared to Mali-400. One of the diagrams suggests a two ALU to TMU/ROP layout, making it similar to SGX540 per-core. I guess now vertex processing will be load balanced between the multiple cores, maybe that's a problem for drivers to deal with.

JohnH said:
To be honest imo this is all fluff at the moment, would be nice to see some numbers put beside the thing

At least it's more than IMG has said about Series 6 ;p

JohnH · Nov 11, 2010

Exophase said:
At least it's more than IMG has said about Series 6 ;p

MfA · Nov 11, 2010

Exophase said:
Looks like this one can leverage cache coherence with Cortex-A15, a feature I don't expect IMG to provide.

They might take a little longer, but the coherent link almost certainly just comes down to a ring bus with snooping ... connecting it to L2 texture cache isn't rocket science (and L1 will probably just be flushed when necessary). GPU writes are probably write through, if they get cached at all, so those stay coherent with the CPUs automatically.

Just because there is cache coherency doesn't mean there is a fully coherent multi-level fully read-write cache hierarchy inside the Mali ... there almost certainly isn't (if NVIDIA didn't do it, I don't see Mali doing it).

Exophase · Nov 11, 2010

MfA said:
They might take a little longer, but the coherent link almost certainly just comes down to a ring bus with snooping ... connecting it to L2 texture cache isn't rocket science (and L1 will probably just be flushed when necessary). GPU writes are probably write through, if they get cached at all, so those stay coherent with the CPUs automatically.

It's still a matter of interfacing it specifically to an ARM coherency link, although I really have no idea how interested IMG is or isn't in doing something like this (you could say having standard parts that can interface with AMBA isn't different, or maybe this is just a glue implementation detail that can be handled by something else or the SoC implementers)

Coherency with writes out from the GPU probably barely matter. Isn't the real point of this to get parameter data directly to the GPU instead of crossing into main memory first?

MfA said:
Just because there is cache coherency doesn't mean there is a fully coherent multi-level fully read-write cache hierarchy inside the Mali ... there almost certainly isn't (if NVIDIA didn't do it, I don't see Mali doing it).

Well I don't know what nVidia's coherency is in this context, but at the very least Tegra 2 is "coherent" by default on account of sharing the L2 cache.

Ailuros · Nov 12, 2010

arjan de lumens said:
Yes, it's unified. As for floating-point data types support, it supports fp16, fp32 and fp64.

A-HA!!!

rpg.314 said:
OCL requires full fp32 precision.

See above

JohnH · Nov 12, 2010

Exophase said:
It's still a matter of interfacing it specifically to an ARM coherency link, although I really have no idea how interested IMG is or isn't in doing something like this (you could say having standard parts that can interface with AMBA isn't different, or maybe this is just a glue implementation detail that can be handled by something else or the SoC implementers)

Even the curent SGX supports cache coherency, it's a function of the SOC (and CPU suport) as to if this gets hooked up to the CPU or not.

Coherency with writes out from the GPU probably barely matter. Isn't the real point of this to get parameter data directly to the GPU instead of crossing into main memory first?

The last thing you want to be doing is streaming GPU input parameters through the CPU cache, the volume of this data is likely to just flush all the stuff you do want in your cache out of it. You also don't want the GPU constantly snooping the CPU cache as it will kill the performance of both. This type of data is best streamed dirctly to memory using write combiners to maximise throughput (as it normally is in desktop space), doing otherwise is likely to hurt overall perf.

John.

MfA · Nov 12, 2010

If you are trying to do a low latency double buffered OpenCL operation on the GPU you might want to have the input or output bypass memory altogether.

Although it seems a bit far fetched for the moment.

JohnH · Nov 12, 2010

MfA said:
If you are trying to do a low latency double buffered OpenCL operation on the GPU you might want to have the input or output bypass memory altogether.

Although it seems a bit far fetched for the moment.

The data ultimately has to go back to memory (it's still a WB cache we're talking about here), but it's true that you might be able to avoid some round trips if you're doing a lot of passing data backwards and forwards in OpenCL.

Personally I'd be looking very hard at my algorithm if this is what's happening!

Interestingly some might argue that these scenarios would be better handled by extending the local memory hierarchy another level...

Anyway as I say, SGX already largely supports all this

John.

MfA · Nov 12, 2010

Doesn't ARM have page table attributes to determine caching behaviour BTW?

ARM Midgard Architecture

arjan de lumens

roninja

Exophase

argor

wishiknew

iwod

Ailuros

Epsilon plus three

JohnH

rpg.314

JohnH

arjan de lumens

Exophase

JohnH

MfA

Exophase

Ailuros

Epsilon plus three

JohnH

MfA

JohnH

MfA

Similar threads