ARM Midgard Architecture

Even the curent SGX supports cache coherency, it's a function of the SOC (and CPU suport) as to if this gets hooked up to the CPU or not.

Oh, okay. I'm not really familiar with what's available.

The last thing you want to be doing is streaming GPU input parameters through the CPU cache, the volume of this data is likely to just flush all the stuff you do want in your cache out of it. You also don't want the GPU constantly snooping the CPU cache as it will kill the performance of both. This type of data is best streamed dirctly to memory using write combiners to maximise throughput (as it normally is in desktop space), doing otherwise is likely to hurt overall perf.

John.

I agree that you don't want it thrashing cache, and the improved latency doesn't get you anything. But wouldn't it be possible, at least in some scenarios, to get much higher throughout going straight to the chip instead of to memory and back? I know on mobiles memory is getting wider and memory buses are getting faster, maybe faster than the GPUs can even really keep up with it, but it'd still be better to not have to spend the bandwidth on this.

In the desktop space you'd really have no better choice since you're going over PCI-E anyway.
 
I agree that you don't want it thrashing cache, and the improved latency doesn't get you anything. But wouldn't it be possible, at least in some scenarios, to get much higher throughout going straight to the chip instead of to memory and back? I know on mobiles memory is getting wider and memory buses are getting faster, maybe faster than the GPUs can even really keep up with it, but it'd still be better to not have to spend the bandwidth on this.
Sure, you can construct scenarios where it's a net win but in most real graphics use cases it isn't, either in overall system performance or power consumption (e.g. to snoop an L1 cache you may be bringing quiet a lot of silicone out of a low power state).

It makes even less sense when you're already trying to feed multiple CPU cores from that cache infrastructure, stuffing high BW GPU traffic through it is just going to kill your CPU performance.

IMO this stuff only really starts to work consistently when you have a proper shared memory hierarchy, for example a large shared cache or shared memory buffers as you then remove the need for snooping as accesses are inherently coherent.

In the desktop space you'd really have no better choice since you're going over PCI-E anyway.

Even in desktop systems with shared local memory (integrated graphics) you will not typically see use of coherent accesses for the bulk of graphics, they still use uncached/write combined memory.

John.
 
Sure, you can construct scenarios where it's a net win but in most real graphics use cases it isn't, either in overall system performance or power consumption (e.g. to snoop an L1 cache you may be bringing quiet a lot of silicone out of a low power state).

It most likely won't be going to L1. For ARM systems, the L2 cache is considered a "system cache" accessed by the CPU as an external device. Likely these coherency protocols apply only to the L2 where it can act as a coherent on-chip scratch buffer. The L1 and L0 are usually tightly coupled with the CPU itself. There is the potential for a GPU to accidentally snoop to a matching L1 set, but if the code is written right, the CPU and GPU working spaces should always be in different pools.

It makes even less sense when you're already trying to feed multiple CPU cores from that cache infrastructure, stuffing high BW GPU traffic through it is just going to kill your CPU performance.

It actually makes perfect sense if it only applied to the L2 (which, when coded right, it should). Since all cores (not just the CPU, but other co-processors) share the L2 cache.

IMO this stuff only really starts to work consistently when you have a proper shared memory hierarchy, for example a large shared cache or shared memory buffers as you then remove the need for snooping as accesses are inherently coherent.

Like the L2 cache on these SoC's :)

Even in desktop systems with shared local memory (integrated graphics) you will not typically see use of coherent accesses for the bulk of graphics, they still use uncached/write combined memory..

That's mostly because the bus in desktops are orders of magnitudes slower than any of the endpoint device's ability to read/write it and also that memory bandwidth almost always matches bus bandwidth.

This isn't true of most ARM SoC's as the external memory is likely to be very slow compared to the on-chip bus and the on-chip bus generally runs at a frequency that's more than half the CPU clockspeed.
 
Last edited by a moderator:
By the way...

The unique tri-pipe architecture allows the Mali-T604 GPU to be used for general purpose computing...

Can someone explain what they could mean with "tri-pipe" architecture?
 
It most likely won't be going to L1. For ARM systems, the L2 cache is considered a "system cache" accessed by the CPU as an external device. Likely these coherency protocols apply only to the L2 where it can act as a coherent on-chip scratch buffer. The L1 and L0 are usually tightly coupled with the CPU itself. There is the potential for a GPU to accidentally snoop to a matching L1 set, but if the code is written right, the CPU and GPU working spaces should always be in different pools.
That's certainly how I would expect things to work, although I think there's the tendency from the CPU vendor side to try and come up with reasons to couple them more tightly (and bias importance towards the CPU unsurprisingly).

It actually makes perfect sense if it only applied to the L2 (which, when coded right, it should). Since all cores (not just the CPU, but other co-processors) share the L2 cache.
I disagree, the CPU vendor is designing a cache to meet the needs of the CPU which are subtly different to those of the GPU, you really need one bulk level of caching between the CPU and any shared infrastructure to avoid the CPU being mangled by latency.
Like the L2 cache on these SoC's :)
Not the same thing, a true shared cache does not require snooping protocols, traffic just flows through the cache and is inherently coherent.
That's mostly because the bus in desktops are orders of magnitudes slower than any of the endpoint device's ability to read/write it and also that memory bandwidth almost always matches bus bandwidth.

This isn't true of most ARM SoC's as the external memory is likely to be very slow compared to the on-chip bus and the on-chip bus generally runs at a frequency that's more than half the CPU clockspeed.

Yes external memory may be much slower than internal buses, however that doesn't change the fact that pushing GPU traffic through the CPU cache would just thrash the cache to the extreme detriment of the CPU. Further when you actually look at the typically bandwidth provide by the coherency buses it is often poor in comparison to even the external memory bandwidth. Obviously the later could be viewed as an implementation issue, but when you look at the problems associated with feeding a non latency tolerant CPU (none of them are) I doubt this is going to change significantly.

Note that I think that once we get to systems with 10's of MBs available for caches the dynamics of this are likely to change, unfortunately we're not quite at that point.

All IMHO, of course ;)
John.
 
That's certainly how I would expect things to work, although I think there's the tendency from the CPU vendor side to try and come up with reasons to couple them more tightly (and bias importance towards the CPU unsurprisingly).

ARM v7 has recommendations as far as cache implementation and many of its cache operations are built to distinguish between L2 as an external cache and any level above as tightly coupled. You can get around this, of course, but it's easier just to add an L0 as the nearest level cache and use L1 as you would L2.

In either case, most CPU designs for such an SoC are intended for a whole system, and the CPU design unlikely to focus solely on CPU performance.

I disagree, the CPU vendor is designing a cache to meet the needs of the CPU which are subtly different to those of the GPU, you really need one bulk level of caching between the CPU and any shared infrastructure to avoid the CPU being mangled by latency.

Yes and no. Again, you're thinking in terms of a desktop system, where the bus is incredibly high latency and low bandwidth compared to the last level of CPU cache. Typical ARM SoC's have high-speed and low-latency on-chip buses that can usually saturate the L2 array and provide acceptable (2-3 CPU cycles) latency for a read access.

And since we're talking ARM here, the application is always (for now) going to be an SoC. ARM is a system design company as much as a CPU design company. Their IP will take into account system needs as well as the needs of the CPU. Other ARM architectural licensees do the same.

Not the same thing, a true shared cache does not require snooping protocols, traffic just flows through the cache and is inherently coherent.

I'm not sure how you imagined shared cache works but snooping is definitely required at all times. Before write-back occurs to the shared cache, a snoop barrier must be sent to all other possible write-sources lest there is a hazard. Additionally, snoop kills and invalidates must be supported in the case you happen to write to a cached address that is in another core's L1 or L0.

Yes external memory may be much slower than internal buses, however that doesn't change the fact that pushing GPU traffic through the CPU cache would just thrash the cache to the extreme detriment of the CPU. Further when you actually look at the typically bandwidth provide by the coherency buses it is often poor in comparison to even the external memory bandwidth. Obviously the later could be viewed as an implementation issue, but when you look at the problems associated with feeding a non latency tolerant CPU (none of them are) I doubt this is going to change significantly.

You'd be surprised. We're talking ~233MHz LP-DDR1/2 here. Perhaps 64-bit, but in most cases, 32-bit bus. Not what you'd see in a desktop. We're also talking DRAM that requires wake-from-low-power-state, which adds even more latency.

In contrast, the on-chip bus is generally ~800MHz 64-bit. Some high-end chips (I'd venture Marvell's) may expand this to dual buses with a 128-bit data path.

In ~1GHz CPU's, that's really not that much latency to go over the bus.

Note that I think that once we get to systems with 10's of MBs available for caches the dynamics of this are likely to change, unfortunately we're not quite at that point.

GPU's in these classes of SoC's aren't going to eat up a ton of memory and they aren't going to eat it up in a haphazard, indeterministic way like CPU code would. Generally speaking, while sharing the L2 does affect CPU performance by some percentage, system performance is more important. This isn't Intel. ARM vendors don't sell CPU's; they sell SoC's.
 
I disagree, the CPU vendor is designing a cache to meet the needs of the CPU which are subtly different to those of the GPU, you really need one bulk level of caching between the CPU and any shared infrastructure to avoid the CPU being mangled by latency.
Sharing cache works pretty well in intel's sandy bridge... :) CPU performance seems virtually unaffected compared to the previous generation, if not even better actually.

You may protest it's not the same thing, as SB shares its L3 with the GPU, but if you consider SB is a generally much higher clocked/wider CPU than any ARM device ever designed maybe it's not so different after all. SB's higher data consumption rates would logically need beefier caching to let the chip perform properly.
 
ARM v7 has recommendations as far as cache implementation and many of its cache operations are built to distinguish between L2 as an external cache and any level above as tightly coupled. You can get around this, of course, but it's easier just to add an L0 as the nearest level cache and use L1 as you would L2.

In either case, most CPU designs for such an SoC are intended for a whole system, and the CPU design unlikely to focus solely on CPU performance.
This differs from what we've tended to see to date within SOCs, for the vast majority the L2 is _effectively_ tightly coupled to the the CPU.
Yes and no. Again, you're thinking in terms of a desktop system, where the bus is incredibly high latency and low bandwidth compared to the last level of CPU cache. Typical ARM SoC's have high-speed and low-latency on-chip buses that can usually saturate the L2 array and provide acceptable (2-3 CPU cycles) latency for a read access.
Not sure why you'd think I'm talking in terms of desktops systems given where most of our architectures end up. 2-3 cycles can only imply tight (physical) coupling, otherwise layout practicalities would be adding additional clocks from registering block I/O to make it possible to lay the SOC out. Further, 2-3 clocks is only going to be the case when nothing else is touching the cache, a burst from the GPU will increase that latency signifiantly and it will significantly impact the CPU.
And since we're talking ARM here, the application is always (for now) going to be an SoC. ARM is a system design company as much as a CPU design company. Their IP will take into account system needs as well as the needs of the CPU. Other ARM architectural licensees do the same.
Again, I can only state what I've seen in real designs.
I'm not sure how you imagined shared cache works but snooping is definitely required at all times. Before write-back occurs to the shared cache, a snoop barrier must be sent to all other possible write-sources lest there is a hazard. Additionally, snoop kills and invalidates must be supported in the case you happen to write to a cached address that is in another core's L1 or L0.
You're only considering localised caches that allow sharing via snoop or similar capabilities, this is consistent with the L2 being local and hence tightly coupled to the CPU, a true shared cache does not require this, I'll let you work out why this is for yourself.
You'd be surprised. We're talking ~233MHz LP-DDR1/2 here. Perhaps 64-bit, but in most cases, 32-bit bus. Not what you'd see in a desktop. We're also talking DRAM that requires wake-from-low-power-state, which adds even more latency.

In contrast, the on-chip bus is generally ~800MHz 64-bit. Some high-end chips (I'd venture Marvell's) may expand this to dual buses with a 128-bit data path.

In ~1GHz CPU's, that's really not that much latency to go over the bus.
Yes I'd agree with the rough ball park for those figures, although I'd say at 233MHz you very much looking at the lower end of memory speeds.
GPU's in these classes of SoC's aren't going to eat up a ton of memory and they aren't going to eat it up in a haphazard, indeterministic way like CPU code would. Generally speaking, while sharing the L2 does affect CPU performance by some percentage, system performance is more important. This isn't Intel. ARM vendors don't sell CPU's; they sell SoC's.

Sorry but you need to have a closer look at the content that is actually being run today in handheld devices, this has long since exceeded a few thousand polygons with tiny low detailed textures and this is only going to increase with future generations of HW. Basically mobile GPUs are already eating up enough data to thrash the sort of L2 cache sizes we see on these SOCs many times over.

John.
 
Sharing cache works pretty well in intel's sandy bridge... :) CPU performance seems virtually unaffected compared to the previous generation, if not even better actually.

You may protest it's not the same thing, as SB shares its L3 with the GPU, but if you consider SB is a generally much higher clocked/wider CPU than any ARM device ever designed maybe it's not so different after all. SB's higher data consumption rates would logically need beefier caching to let the chip perform properly.

Yes, it shares an L3, it's the right point in the hierarchy to be considering sharing a cache between the CPU and the GPU ;)
 
This differs from what we've tended to see to date within SOCs, for the vast majority the L2 is _effectively_ tightly coupled to the the CPU.

Which designs have you been working with? The L2 cache for ARM designs have, IIRC, always been external caches hanging off the system bus.

Not sure why you'd think I'm talking in terms of desktops systems given where most of our architectures end up. 2-3 cycles can only imply tight (physical) coupling, otherwise layout practicalities would be adding additional clocks from registering block I/O to make it possible to lay the SOC out. Further, 2-3 clocks is only going to be the case when nothing else is touching the cache, a burst from the GPU will increase that latency signifiantly and it will significantly impact the CPU.

I see where the confusion is. I mean "an additional 2-3 cycles" vs what the L2 cache latency would be if it were hanging off the CPU directly.

Again, I can only state what I've seen in real designs.

I imagine most people follow ARM's design recommendations for where to put the L2 cache. I can only speak of one design that's out there currently though as I don't know the details of other vendors.

You're only considering localised caches that allow sharing via snoop or similar capabilities, this is consistent with the L2 being local and hence tightly coupled to the CPU, a true shared cache does not require this, I'll let you work out why this is for yourself.

I think that's just semantics. The L2 is no more "tightly coupled" to the CPU than it is to the DSP, or GPU, or codec pipeline. They all have equal, coherent access to the L2 cache and it all goes over the system bus. Whether you consider that a "true shared cache" is up to you.

Yes I'd agree with the rough ball park for those figures, although I'd say at 233MHz you very much looking at the lower end of memory speeds.

I'm not aware of a mobile ARM SoC with memory speeds that are much higher. 45nm Snapdragon clocks its memory at around ~280MHz. The OMAPs run it even slower, IIRC.

Sorry but you need to have a closer look at the content that is actually being run today in handheld devices, this has long since exceeded a few thousand polygons with tiny low detailed textures and this is only going to increase with future generations of HW. Basically mobile GPUs are already eating up enough data to thrash the sort of L2 cache sizes we see on these SOCs many times over.

Perhaps. But that's how it's designed today as per ARM. Many vendors may choose to implement their cache hierarchy differently but judging from their various presentations, it doesn't look like they deviated from the ARM standard implementation.

I still contend that thrashing isn't as big of a problem as you may think. But I've yet to see a profile from a use case in which both the CPU and GPU are in contention for cache space so I'll defer making a definitive comment on that.
 
Which designs have you been working with? The L2 cache for ARM designs have, IIRC, always been external caches hanging off the system bus.

I see where the confusion is. I mean "an additional 2-3 cycles" vs what the L2 cache latency would be if it were hanging off the CPU directly.

Cortex-A8's cache is internal and targeted for 8 cycles, but for some reason it appears to be more like around 20 cycles on OMAP3. A test number has shown up for OMAP4 claiming L2 latency is only 9 cycles (http://pandaboard.org/pbirclogs/index.php?date=2010-11-05), which would suggest only one more cycle than the minimum for Cortex-A8 despite hanging off of the AXI bus. But others are telling me that the number is probably broken somehow :/

I've seen another number of 25 cycles for Tegra 2, which has its L2 cache shared with the GPU:

http://forum.canardpc.com/showpost.php?p=3295102&postcount=491

I imagine most people follow ARM's design recommendations for where to put the L2 cache. I can only speak of one design that's out there currently though as I don't know the details of other vendors.

For Cortex-A8 they don't have a choice to make it external (although they could add an external L3 cache..) For Cortex-A9 they don't have a choice to make it internal, and they'll probably all have the same topology since they'll probably all be using ARM's L2 cache controller IP.

I think that's just semantics. The L2 is no more "tightly coupled" to the CPU than it is to the DSP, or GPU, or codec pipeline. They all have equal, coherent access to the L2 cache and it all goes over the system bus. Whether you consider that a "true shared cache" is up to you.

I'm not aware of L2 being shared on any ARM SoC outside of Tegra 2 (don't know about Tegra 1). On OMAP3 and OMAP4 the GPU and DSP are both connected to a bus called the L3 interconnect; both have their own L2 cache, and I'm pretty sure there isn't any cache coherency between them.

I'm not aware of a mobile ARM SoC with memory speeds that are much higher. 45nm Snapdragon clocks its memory at around ~280MHz. The OMAPs run it even slower, IIRC.

Tegra 2 is known to be running base clocks at 333MHz, OMAP4 claims 400MHz (but they're not running anywhere near that on PandoBoard)
 
Cortex-A8's cache is internal and targeted for 8 cycles, but for some reason it appears to be more like around 20 cycles on OMAP3. A test number has shown up for OMAP4 claiming L2 latency is only 9 cycles (http://pandaboard.org/pbirclogs/index.php?date=2010-11-05), which would suggest only one more cycle than the minimum for Cortex-A8 despite hanging off of the AXI bus. But others are telling me that the number is probably broken somehow :/

I've seen another number of 25 cycles for Tegra 2, which has its L2 cache shared with the GPU:

http://forum.canardpc.com/showpost.php?p=3295102&postcount=491

I would be surprised if any L2 read managed to complete in 8 cycles. I only have one (well, two now) designs as my reference but the L2 to CPU interface can add quite a bit on top of the array latency itself.

Tegra 2 is known to be running base clocks at 333MHz, OMAP4 claims 400MHz (but they're not running anywhere near that on PandoBoard)

Which, compared to the internal bus, is not that fast. The higher-end SoC's are running 128-bit AXI buses. Some at ~800MHz or more.

I'm not familiar with the memory interface of OMAP4, but I believe the OMAP3 was a 32-bit data interface. Same is true of the Samsung SoC as well as Snapdragon.
 
ARM definitely designed Cortex-A8 for 8 cycles but apparently other penalties can come into play.. it's not really that astonishing given that Atom is around 16 cycles at typical speeds upwards of 2x what Cortex-A8 was pitching.

What I remember is OMAP4 being capable of dual-channel 32-bit and Tegra 2 not being capable, but I have to double check on this. Yes, main memory bandwidth is going to pale in comparison to L2 bandwidth, over AXI or not. Linley group talked down Cortex-A9 for having L2 behind AXI and Snapdragon not, but I'm not sure how much it's really hurting it.
 
Which designs have you been working with? The L2 cache for ARM designs have, IIRC, always been external caches hanging off the system bus.
Obviously I can't disclose which customer designs I may be privvy to ;)

However, some system diagrams certainly may make it look like it's a shared bus so I can understand why you might think it's the case, however the "system" bus is not always the fastest bus in the device. The better architectures "might" not actually push the bulk of the GPU traffic through this bus, although you would be correct to say that some do.
I see where the confusion is. I mean "an additional 2-3 cycles" vs what the L2 cache latency would be if it were hanging off the CPU directly.
Plus significant latency when the GPU is let in.

I imagine most people follow ARM's design recommendations for where to put the L2 cache. I can only speak of one design that's out there currently though as I don't know the details of other vendors.

I think that's just semantics. The L2 is no more "tightly coupled" to the CPU than it is to the DSP, or GPU, or codec pipeline. They all have equal, coherent access to the L2 cache and it all goes over the system bus. Whether you consider that a "true shared cache" is up to you.
It isn't semantics if the bulk of the GPU traffic doesn't use that path.
I'm not aware of a mobile ARM SoC with memory speeds that are much higher. 45nm Snapdragon clocks its memory at around ~280MHz. The OMAPs run it even slower, IIRC.

Perhaps. But that's how it's designed today as per ARM. Many vendors may choose to implement their cache hierarchy differently but judging from their various presentations, it doesn't look like they deviated from the ARM standard implementation.

I still contend that thrashing isn't as big of a problem as you may think. But I've yet to see a profile from a use case in which both the CPU and GPU are in contention for cache space so I'll defer making a definitive comment on that.

There is no perhaps about it ;)
 
Obviously I can't disclose which customer designs I may be privvy to ;)

However, some system diagrams certainly may make it look like it's a shared bus so I can understand why you might think it's the case, however the "system" bus is not always the fastest bus in the device. The better architectures "might" not actually push the bulk of the GPU traffic through this bus, although you would be correct to say that some do.

What is "the bulk of GPU traffic"? The most bandwidth intensive would be data to the display/frame buffer. I know of at least one SoC out there that does that through the AXI bus. The CPU to GPU traffic is best done through AXI (and L2 cache, in fact).

Plus significant latency when the GPU is let in.

Yes, but again, what is a scenario in a handset that would have that happen?

It isn't semantics if the bulk of the GPU traffic doesn't use that path.

Sure it is. Even the CPU doesn't always use L2 (or L0/L1 for that matter) for all of its traffic. Cache should only be used where it makes sense to use it. If there's traffic going to/from the GPU where it makes sense to put in shareable cache, then it should be put in shareable cache. If not, a separate memory pool or bus should be used.
 
Back
Top