ARM Midgard Architecture

So at first glance appears to be ARM's solution up against Series 5 XT in the MP space. Doesn't sound like they've licenced it yet either. Lets see.
 
Looks like this one can leverage cache coherence with Cortex-A15, a feature I don't expect IMG to provide. It's interesting that ARM is now positioning architectural advantages by pairing the two.

Also looks like they're offering clear up to double precision operations, hopefully in the fragment shaders which in Mali-400 are limited to only FP16. Don't really see how GPGPU they can be with only offering it in the vertex shaders, afterall.
 
Last edited by a moderator:
Midgard architecture (“Midgard” is the home of the gods in Norse mythology).
they are lousy in Norse mythology i expedited more from a design team based in norway
miðgarður is the Human realm in Norse mythology
while ásgarður is the home of the gods in Norse mythology
nice to see what they did cache coherence with Cortex-A15
 
Is this what's in Samsung Orion soc with that "5 times the 3D graphics performance over the previous processor generation"?
 
Are there anything special about their Mali GPU, compare to PowerVR GPU which is better?

And any reason why cache coherence wont work on PowerVR gen 6? ( After all SoC like A4 are custom designs anyway )
 
Looks like this one can leverage cache coherence with Cortex-A15, a feature I don't expect IMG to provide. It's interesting that ARM is now positioning architectural advantages by pairing the two.

Yeah that's definitely interesting.

Also looks like they're offering clear up to double precision operations, hopefully in the fragment shaders which in Mali-400 are limited to only FP16. Don't really see how GPGPU they can be with only offering it in the vertex shaders, afterall.

Well I only overread the documents quickly but I couldn't find anything that suggests that fragment and geometry is still in separated units. In fact the diagram mentions up to 4 shader cores (whatever that stands for) and no sign of a geometry unit like in past diagrams.

Maybe arjan can shed some light into that part and explain if those are USC units now. I doubt they'd advertise GPGPU without being >FP16 in the fragment shaders.
 
Well I only overread the documents quickly but I couldn't find anything that suggests that fragment and geometry is still in separated units. In fact the diagram mentions up to 4 shader cores (whatever that stands for) and no sign of a geometry unit like in past diagrams.

Maybe arjan can shed some light into that part and explain if those are USC units now. I doubt they'd advertise GPGPU without being >FP16 in the fragment shaders.

As you say, the absence of a seperate vertex processor in the diagram suggests a unified design, they probably don't want to shout too loudly about this as it flies in the face of their previous marketing.

To be honest imo this is all fluff at the moment, would be nice to see some numbers put beside the thing ;)
 
Seems like a pretty different design compared to Mali-400. One of the diagrams suggests a two ALU to TMU/ROP layout, making it similar to SGX540 per-core. I guess now vertex processing will be load balanced between the multiple cores, maybe that's a problem for drivers to deal with.

To be honest imo this is all fluff at the moment, would be nice to see some numbers put beside the thing ;)

At least it's more than IMG has said about Series 6 ;p
 
Looks like this one can leverage cache coherence with Cortex-A15, a feature I don't expect IMG to provide.
They might take a little longer, but the coherent link almost certainly just comes down to a ring bus with snooping ... connecting it to L2 texture cache isn't rocket science (and L1 will probably just be flushed when necessary). GPU writes are probably write through, if they get cached at all, so those stay coherent with the CPUs automatically.

Just because there is cache coherency doesn't mean there is a fully coherent multi-level fully read-write cache hierarchy inside the Mali ... there almost certainly isn't (if NVIDIA didn't do it, I don't see Mali doing it).
 
Last edited by a moderator:
They might take a little longer, but the coherent link almost certainly just comes down to a ring bus with snooping ... connecting it to L2 texture cache isn't rocket science (and L1 will probably just be flushed when necessary). GPU writes are probably write through, if they get cached at all, so those stay coherent with the CPUs automatically.

It's still a matter of interfacing it specifically to an ARM coherency link, although I really have no idea how interested IMG is or isn't in doing something like this (you could say having standard parts that can interface with AMBA isn't different, or maybe this is just a glue implementation detail that can be handled by something else or the SoC implementers)

Coherency with writes out from the GPU probably barely matter. Isn't the real point of this to get parameter data directly to the GPU instead of crossing into main memory first?

Just because there is cache coherency doesn't mean there is a fully coherent multi-level fully read-write cache hierarchy inside the Mali ... there almost certainly isn't (if NVIDIA didn't do it, I don't see Mali doing it).

Well I don't know what nVidia's coherency is in this context, but at the very least Tegra 2 is "coherent" by default on account of sharing the L2 cache.
 
It's still a matter of interfacing it specifically to an ARM coherency link, although I really have no idea how interested IMG is or isn't in doing something like this (you could say having standard parts that can interface with AMBA isn't different, or maybe this is just a glue implementation detail that can be handled by something else or the SoC implementers)
Even the curent SGX supports cache coherency, it's a function of the SOC (and CPU suport) as to if this gets hooked up to the CPU or not.
Coherency with writes out from the GPU probably barely matter. Isn't the real point of this to get parameter data directly to the GPU instead of crossing into main memory first?

The last thing you want to be doing is streaming GPU input parameters through the CPU cache, the volume of this data is likely to just flush all the stuff you do want in your cache out of it. You also don't want the GPU constantly snooping the CPU cache as it will kill the performance of both. This type of data is best streamed dirctly to memory using write combiners to maximise throughput (as it normally is in desktop space), doing otherwise is likely to hurt overall perf.

John.
 
If you are trying to do a low latency double buffered OpenCL operation on the GPU you might want to have the input or output bypass memory altogether.

Although it seems a bit far fetched for the moment.
 
If you are trying to do a low latency double buffered OpenCL operation on the GPU you might want to have the input or output bypass memory altogether.

Although it seems a bit far fetched for the moment.

The data ultimately has to go back to memory (it's still a WB cache we're talking about here), but it's true that you might be able to avoid some round trips if you're doing a lot of passing data backwards and forwards in OpenCL.

Personally I'd be looking very hard at my algorithm if this is what's happening!

Interestingly some might argue that these scenarios would be better handled by extending the local memory hierarchy another level...

Anyway as I say, SGX already largely supports all this ;)

John.
 
Back
Top