ARM Midgard Architecture

MfA · Nov 16, 2010

metafor said:
The CPU to GPU traffic is best done through AXI (and L2 cache, in fact).

Ignoring really close knit CPU/GPU interaction I don't see why this is necessary so. For say command buffers the amount of bandwidth which is consumed is relatively small and it's streamed so neither latency nor temporal coherence is really important either. It saves a little power I guess, but most of the time there is probably data which gets more bang out of cacheline bucks.

Simon F · Nov 16, 2010

metafor said:
What is "the bulk of GPU traffic"? The most bandwidth intensive would be data to the display/frame buffer.

Hardly, well, at least not in the leading mobile devices.

rapso · Nov 16, 2010

I would expect that most traffic on "leading mobile devices", using ImgTec GPUs, is vertex traffic, due to the interpolator-buffer that is streamed out and in again.

But I wouldn't expect that's the usual case for other GPUs. In my software renderer the most traffic is coming from the zbuffer (I don't have any hierarchical Z, tho), next on stage is the framebuffer and if you just consider passes that contribute to the final image, it's texture DMA, but if you take other passes into account e.g. shadows, it's Vertex data, and then texture data.
That's because of the usually good compression of textures. one vertex is easily 32byte, while tixel are 4 or 8 bit. if you consider that ~half of the vertices are processed because of backfaces, you have already an 16:1 ratio of vertex vs texture.
I use some caches, but not because of lowering the latency, but to lower the bandwidth need, freeing some resources for other parts of the system.

I didn't read any paper regarding Mali, but I think someone said it's deferred, but not tiled as ImgTec, but rather binning drawcall based? In that case I wouldn't expect the framebuffer has any benefits using the cache. (sorry if that was already said here, I wasn't reading all of the thread, just the last 10 post, kinda

).

Ailuros · Nov 16, 2010

rapso said:
I would expect that most traffic on "leading mobile devices", using ImgTec GPUs, is vertex traffic, due to the interpolator-buffer that is streamed out and in again.

I'd love to stand corrected but I severely doubt that's true.

I didn't read any paper regarding Mali, but I think someone said it's deferred, but not tiled as ImgTec, but rather binning drawcall based? In that case I wouldn't expect the framebuffer has any benefits using the cache. (sorry if that was already said here, I wasn't reading all of the thread, just the last 10 post, kinda ).

I'm sure many will protest again, but the Mali strikes me as an early-Z IMR. Most definitely tile based but not necessarily a strict deferred renderer despite what their recent material might claim. Who cares really about those definitions in the end anyway? If any architecture doesn't have any severe disadvantages (which I don't think is the case with Mali) then definitions are the last thing I personally care about.

JohnH · Nov 16, 2010

metafor said:
What is "the bulk of GPU traffic"? The most bandwidth intensive would be data to the display/frame buffer. I know of at least one SoC out there that does that through the AXI bus. The CPU to GPU traffic is best done through AXI (and L2 cache, in fact).

This is wrong on two counts, a) in the architecture used in most mobile devices FB bandwidth is not the most intensive BW consumer, b) it would be idiotic to feed FB data through the L2 cache as this will almost invariably be large enough to result in thrashing.

Yes, but again, what is a scenario in a handset that would have that happen?

Any GPU access to the cache increases the probability of increasing the latency off CPU fetches, as GPU traffic is inherently high BW it will pretty much happen all the time.

Sure it is. Even the CPU doesn't always use L2 (or L0/L1 for that matter) for all of its traffic. Cache should only be used where it makes sense to use it. If there's traffic going to/from the GPU where it makes sense to put in shareable cache, then it should be put in shareable cache. If not, a separate memory pool or bus should be used.

Consider a typical gaming scenario, the game engine will typically lean heavily on the L2 cache for its larger data structures, if you choose to also stream data through the cache to the GPU you will invariably, a) throw out data structures that the CPU needs to run the game engine and b) increase the latency on any fetches that still manage to hit. Basically it just isn't a performance or power win.

Now you might say that UI scenarios are different as there is no game engine and the geometry load is low so the geometry should be pushed through the cache. However in this case the geometry load is irrelevant relative to things like texture BW, so you don't actualy gain anything of significance by doing this.

John.

metafor · Nov 16, 2010

JohnH said:
This is wrong on two counts, a) in the architecture used in most mobile devices FB bandwidth is not the most intensive BW consumer, b) it would be idiotic to feed FB data through the L2 cache as this will almost invariably be large enough to result in thrashing.

I don't believe I suggested FB be sent through the L2. Actually, quite the opposite.

Any GPU access to the cache increases the probability of increasing the latency off CPU fetches, as GPU traffic is inherently high BW it will pretty much happen all the time.

Yes but my question is when does the GPU and CPU contend for cachelines? I.e. in what scenario will the GPU be loading massive amounts of data from memory that isn't written by the CPU? I'll agree that when writing to the FB, it makes little to no sense to use the L2, but what other situation would the GPU be making data accesses that wasn't the working set of the CPU as well? More importantly, in cases where the GPU is highly utilized, how many scenarios also involve the CPU being highly utilized on a separate task that doesn't use the same working dataset?

Consider a typical gaming scenario, the game engine will typically lean heavily on the L2 cache for its larger data structures, if you choose to also stream data through the cache to the GPU you will invariably, a) throw out data structures that the CPU needs to run the game engine and b) increase the latency on any fetches that still manage to hit. Basically it just isn't a performance or power win.

What's the alternative here? Geometry data has to be sent from the CPU to GPU. Even more so, the CPU's current working set in this case is geometry data. Now, yes, the CPU also has to work on running the game (AI, controls, physics, objects, etc.) but are any of those more important than providing geometry data to the GPU for rendering? Is that even a bottleneck?

Would you advocate storing geometry data out to DRAM first and having the GPU read it?

JohnH · Nov 16, 2010

metafor said:
I don't believe I suggested FB be sent through the L2. Actually, quite the opposite.

True, you didn't specificaly say it should be sent through the L2, my mistake. Although your original comment also suggests that you have a sample of 1 to be base your opinion on

Yes but my question is when does the GPU and CPU contend for cachelines? I.e. in what scenario will the GPU be loading massive amounts of data from memory that isn't written by the CPU? I'll agree that when writing to the FB, it makes little to no sense to use the L2, but what other situation would the GPU be making data accesses that wasn't the working set of the CPU as well? More importantly, in cases where the GPU is highly utilized, how many scenarios also involve the CPU being highly utilized on a separate task that doesn't use the same working dataset?

Typically the CPU is generating data for the GPU from a data set that the GPU does need not access i.e. although related they are not the same data, further the data generated by the CPU typically does not need to be read back by the CPU so the CPU recieves no benifit from writing through the cache, all you're doing is unnecessarily pushing out other data that is likely to be subsequently useful.

What's the alternative here? Geometry data has to be sent from the CPU to GPU. Even more so, the CPU's current working set in this case is geometry data. Now, yes, the CPU also has to work on running the game (AI, controls, physics, objects, etc.) but are any of those more important than providing geometry data to the GPU for rendering? Is that even a bottleneck?

Would you advocate storing geometry data out to DRAM first and having the GPU read it?

In a well written engine much of the geometry does not come directly from the CPU, it comes from static VBOs and IBOs that are CPU write once GPU read many. This data is typicaly many time the size of the L2 cache, it simply cannot live there.

Even if the amount of (CPU generated) dynamic geometry is significant it still makes sense to send via DRAM due to the thrashing reasons mentioned (several times) above. And, if the amount of dynamic geometry is small it becomes irrelevent how it's sent to the GPU.

As I originally said, when we get to having 10's of MB's of on SoC storage, either as caches or addressable storage then these arguments may change, but as I said before, we're not quite there yet.

metafor · Nov 16, 2010

JohnH said:
True, you didn't specificaly say it should be sent through the L2, my mistake. Although your original comment also suggests that you have a sample of 1 to be base your opinion on

In previous replies, I specifically mentioned the FB as the one case I cannot see a shared cache being useful.

Typically the CPU is generating data for the GPU from a data set that the GPU does need not access i.e. although related they are not the same data, further the data generated by the CPU typically does not need to be read back by the CPU so the CPU recieves no benifit from writing through the cache, all you're doing is unnecessarily pushing out other data that is likely to be subsequently useful.

The benefit isn't for the CPU to re-read but rather for the GPU to be able to access without having to go through DRAM. More over, the CPU does not have to write that data to DRAM. What's the typical bottleneck in a game engine? Can a 15-20% increase in L2 miss for the CPU scene model really become the limiting factor compared to having to wait an atrocious amount of time before rendering the next frame?

In a well written engine much of the geometry does not come directly from the CPU, it comes from static VBOs and IBOs that are CPU write once GPU read many. This data is typicaly many time the size of the L2 cache, it simply cannot live there.

Being many times the size of L2 does not mean it can't be cached in L2. There'd be no need for memory then. That doesn't mean there aren't benefits to caching it. In fact, read-many is the primary reason for caching.

Basically, you have limited cache space and unlike a desktop scenario, no local memory for the GPU to read from. What's more bandwidth intensive? Where is the typical bottleneck? Is it the CPU's ability to model the scene and generate geometry information or the GPU's ability to read geometry data and render it?

I had always been under the assumption that it was the later but you're welcomed to tell me it's the former.

Even if the amount of (CPU generated) dynamic geometry is significant it still makes sense to send via DRAM due to the thrashing reasons mentioned (several times) above. And, if the amount of dynamic geometry is small it becomes irrelevent how it's sent to the GPU.

As I originally said, when we get to having 10's of MB's of on SoC storage, either as caches or addressable storage then these arguments may change, but as I said before, we're not quite there yet.

I can't say I agree with the "all or nothing" notion you have about caching.

MfA · Nov 16, 2010

metafor said:
In fact, read-many is the primary reason for caching.

No, locality of reference is.

Data in excess of cache size which is streamed in once every 60th of a second is read many times, but shouldn't be cached.

Exophase · Nov 17, 2010

Ailuros said:
I'm sure many will protest again, but the Mali strikes me as an early-Z IMR. Most definitely tile based but not necessarily a strict deferred renderer despite what their recent material might claim. Who cares really about those definitions in the end anyway? If any architecture doesn't have any severe disadvantages (which I don't think is the case with Mali) then definitions are the last thing I personally care about.

Who will protest? ARM themselves say it's an IMR. So long as we're being clear with our terminology there shouldn't be much room for argument. The framebuffer is rendered to on-chip tile memory. It doesn't perform deferred shading like IMG GPUs do.. in fact, I don't think anyone but IMG does.

I guess there's some controversy because ARM once referred to their architecture as deferred, but that was just in the sense that scene data is captured and binned before rendering begins. I guess one way to call it is that vertex data is deferred but fragment data isn't.

Of course, ARM could have changed something here.

JohnH · Nov 17, 2010

metafor said:
The benefit isn't for the CPU to re-read but rather for the GPU to be able to access without having to go through DRAM. More over, the CPU does not have to write that data to DRAM. What's the typical bottleneck in a game engine? Can a 15-20% increase in L2 miss for the CPU scene model really become the limiting factor compared to having to wait an atrocious amount of time before rendering the next frame?

I'd suggest that you need to go and look at both the volume of data involved and how a write back cache actually works. The reality is more likely to be that you're not increasing the CPU L2 miss rate by 15-20% but you're actually pushing the misses on CPU fetches to 100%. And even where the CPU does hit it's latency is going to be increased by contention.

Being many times the size of L2 does not mean it can't be cached in L2. There'd be no need for memory then. That doesn't mean there aren't benefits to caching it. In fact, read-many is the primary reason for caching.

Not correct, streaming data that is larger than the cache through cache will just result in all data being evicted and re-read each time it is read i.e. you are thrashing the cache for zero benifit.

Basically, you have limited cache space and unlike a desktop scenario, no local memory for the GPU to read from. What's more bandwidth intensive? Where is the typical bottleneck? Is it the CPU's ability to model the scene and generate geometry information or the GPU's ability to read geometry data and render it?

I had always been under the assumption that it was the later but you're welcomed to tell me it's the former.

The key statement here is that you have limited cache memory i.e. you do not have enough cache memory to store the data set that is going to accessed on every frame, probably by a large factor, the result that however you express it you end up pushing and pulling all that data in and out of DRAM anyway thrashing the cache in the process. As such you're better leaving the L2 to the CPU which will benifit significantly.

I can't say I agree with the "all or nothing" notion you have about caching.

It isn't all or nothing, it's about not trying to squeeze something somewhere it doesn't fit, as I've said several times now, this gets more interresting when we get to 10's of MB of on die memory.

JohnH · Nov 17, 2010

Exophase said:
Who will protest? ARM themselves say it's an IMR. So long as we're being clear with our terminology there shouldn't be much room for argument. The framebuffer is rendered to on-chip tile memory.

It doesn't perform deferred shading like IMG GPUs do.. in fact, I don't think anyone but IMG does.

I guess there's some controversy because ARM once referred to their architecture as deferred, but that was just in the sense that scene data is captured and binned before rendering begins. I guess one way to call it is that vertex data is deferred but fragment data isn't.

Mali is not an IMR, it is not a hybryd IMR/TBR, it IS a tile based renderer, all they've done is to remove deferred texturing/shaing (a the key optimisation), and yes IMG does still do deferred texturing/shading.

Of course, ARM could have changed something here.

Other than the detailed operation of thier tiling engine and the move to unified shading (even though they have previously claimed it was the wrong thing to do) I think you'll find it's still pretty much the same..

John.

Exophase · Nov 17, 2010

JohnH said:
Mali is not an IMR, it is not a hybryd IMR/TBR, it IS a tile based renderer, all they've done is to remove deferred texturing/shaing (a the key optimisation), and yes IMG does still do deferred texturing/shading.

Haha, I like the way you phrase that, "remove deferred texturing/shading", as if their design consisted of copying IMG then changing it. That's a really funny way to put it. Deferred shading does have drawbacks too, like the penalty for alpha test. Of course IMG still does it, no one was questioning that.

Honestly your response is kind of frustrating because I made it clear I understand how Mali works, I made it clear how I was using the definitions, it's like you saw "IMR" and stopped reading. There isn't some kind of industry standard for what "IMR" means, there isn't really anything wrong with defining it by order of fragment operations and not by scene gathering/lack thereof.

JohnH said:
Other than the detailed operation of thier tiling engine and the move to unified shading (even though they have previously claimed it was the wrong thing to do) I think you'll find it's still pretty much the same..

John.

Do you know more than they've announced? At the very least, their arrangement of ALUs to TMUs and ROPs looks like it could be different (something more like 2:1 than 1:1).

Ailuros · Nov 17, 2010

Exophase said:
Who will protest? ARM themselves say it's an IMR. So long as we're being clear with our terminology there shouldn't be much room for argument. The framebuffer is rendered to on-chip tile memory. It doesn't perform deferred shading like IMG GPUs do.. in fact, I don't think anyone but IMG does.

Many did in the past and in more than one spots. As I said I couldn't care less what one calls A or B.

I guess there's some controversy because ARM once referred to their architecture as deferred, but that was just in the sense that scene data is captured and binned before rendering begins. I guess one way to call it is that vertex data is deferred but fragment data isn't.

Of course, ARM could have changed something here.

Advanced tile-based deferred rendering and local buffering of intermediate pixel states.

http://www.arm.com/products/multimedia/mali-graphics-hardware/mali-400-mp.php

First under "features". It's still there even today.

***edit: by the way I'm not particularly fond of public catfights: http://channel.hexus.net/content/item.php?item=27539

metafor · Nov 17, 2010

JohnH said:
I'd suggest that you need to go and look at both the volume of data involved and how a write back cache actually works. The reality is more likely to be that you're not increasing the CPU L2 miss rate by 15-20% but you're actually pushing the misses on CPU fetches to 100%. And even where the CPU does hit it's latency is going to be increased by contention.

That depends entirely on how the cache is designed and its replacement algorithm. A system cache would not function exactly like a typical CPU L2 write-back cache.

Not correct, streaming data that is larger than the cache through cache will just result in all data being evicted and re-read each time it is read i.e. you are thrashing the cache for zero benifit.

That depends on read order, cache replacement algorithm and software. Scorpion, for example, provides cache-ops for set, way and MVA locking.

The key statement here is that you have limited cache memory i.e. you do not have enough cache memory to store the data set that is going to accessed on every frame, probably by a large factor, the result that however you express it you end up pushing and pulling all that data in and out of DRAM anyway thrashing the cache in the process. As such you're better leaving the L2 to the CPU which will benifit significantly.

Only with an LRU cache would this occur and only if the GPU reads the entire geometry map blindly in order.

It isn't all or nothing, it's about not trying to squeeze something somewhere it doesn't fit, as I've said several times now, this gets more interresting when we get to 10's of MB of on die memory.

But it is "all or nothing". Your contention is that there is never, say, a 256KB piece of data that the GPU constantly accesses in which it is advantageous to keep locally in order to be read fast. It doesn't have to be the entire geometry map.

JohnH · Nov 18, 2010

metafor said:
That depends entirely on how the cache is designed and its replacement algorithm. A system cache would not function exactly like a typical CPU L2 write-back cache.

That depends on read order, cache replacement algorithm and software. Scorpion, for example, provides cache-ops for set, way and MVA locking.

Only with an LRU cache would this occur and only if the GPU reads the entire geometry map blindly in order.

Nonsense, if the data set being pulled through the cache is larger than the cache then there is no replacement policy that isn't going to result in thrashing of that cache.

Note that typically these types of caches use random or psuedo LRU replacement policies.

But it is "all or nothing". Your contention is that there is never, say, a 256KB piece of data that the GPU constantly accesses in which it is advantageous to keep locally in order to be read fast. It doesn't have to be the entire geometry map.

If that peice of data represents a significant part of the required BW and other things don't result in it being evicted from the cache (which they will) then I'd agree with you, but generally this isn't the case.

As a matter of interrest you'll notice that ARM show the GPU with it's own L2 cache, there are very good reasons why this is a seperate cache instead of sharing the L2 with the CPU.

JohnH · Nov 18, 2010

Exophase said:
Haha, I like the way you phrase that, "remove deferred texturing/shading", as if their design consisted of copying IMG then changing it. That's a really funny way to put it. Deferred shading does have drawbacks too, like the penalty for alpha test. Of course IMG still does it, no one was questioning that.

The is no aditional cost for alpha test associated with being deferred, all that happens is that in the worst you loose the benifit of the deferred rendering on those primitives.

Honestly your response is kind of frustrating because I made it clear I understand how Mali works, I made it clear how I was using the definitions, it's like you saw "IMR" and stopped reading. There isn't some kind of industry standard for what "IMR" means, there isn't really anything wrong with defining it by order of fragment operations and not by scene gathering/lack thereof.

I know what you meant, I was just re-iterating the point fully to try and avoid ambiguity. The distinction between IMR and TBR is the gathering of a hole scenes data, it has nothing to do with the order of fragment operations, the later always being preserved (or the effect of) irespective of the technology used as this is required to retain API conformance.

Do you know more than they've announced? At the very least, their arrangement of ALUs to TMUs and ROPs looks like it could be different (something more like 2:1 than 1:1).

Maybe my statement was stronger than should have been, however the arrangement of these units implies nothing about the basic approach to rendering, the safest bet right now is that there aren't significant changes in this respect.

Exophase · Nov 18, 2010

JohnH said:
The is no aditional cost for alpha test associated with being deferred, all that happens is that in the worst you loose the benifit of the deferred rendering on those primitives.

There certainly is on SGX530 and presumably SGX535 - we've witnessed it first hand, and why else would IMG's and Apple's document insist on avoiding it for alpha blend instead? You make the whole thing much slower just by having it off, regardless of whether or not the alpha test passes or fails.

JohnH said:
I know what you meant, I was just re-iterating the point fully to try and avoid ambiguity. The distinction between IMR and TBR is the gathering of a hole scenes data, it has nothing to do with the order of fragment operations, the later always being preserved (or the effect of) irespective of the technology used as this is required to retain API conformance.

The truth is, "IMR" and "deferred" don't have any kind of standardized definitions. Furthermore, something doesn't have to be tile based to be scene gathering. The order of the effect of fragment operations isn't what I'm talking about.

If you're going to use "immediate" to mean non-scene gathering (you need another term specifically for "non-tile based") then you need a term to describe not performing deferred shading.

JohnH said:
Maybe my statement was stronger than should have been, however the arrangement of these units implies nothing about the basic approach to rendering, the safest bet right now is that there aren't significant changes in this respect.

I agree, although I would say that not being limited to FP16 in the fragment pipeline makes the architecture seem far more viable.

JohnH · Nov 18, 2010

Exophase said:
There certainly is on SGX530 and presumably SGX535 - we've witnessed it first hand, and why else would IMG's and Apple's document insist on avoiding it for alpha blend instead? You make the whole thing much slower just by having it off, regardless of whether or not the alpha test passes or fails.

The cost you're seeing has nothing to do with the device being deferred.

The truth is, "IMR" and "deferred" don't have any kind of standardized definitions. Furthermore, something doesn't have to be tile based to be scene gathering. The order of the effect of fragment operations isn't what I'm talking about.

Hmm, to quote your previous post,

There isn't some kind of industry standard for what "IMR" means, there isn't really anything wrong with defining it by order of fragment operations and not by scene gathering/lack thereof"

You're clearly refering to order of fragment operations here, there's not really any other way I can read that! Anyway...

If you're going to use "immediate" to mean non-scene gathering (you need another term specifically for "non-tile based") then you need a term to describe not performing deferred shading.

In this context IMR exclusively refers to "Immediate mode rendering", I'm pretty sure there is no contention within the industry as a whole as to what an immediate mode renderer is.

For the non IMR case the collectve term that was traditionally used was "scene capture" based devices, as it happens there aren't any non tile based scene capture devices out there or visa versa so TBR/TBDR have ended up implying this. Anyone who has been in the industry for any length of time or is aware of the history will recognise and understand this.

Anyway this is just pointless semantics.

Exophase · Nov 18, 2010

Order of fragment operations and order of the EFFECT of fragment operations are two different things. Depth-test is a fragment operation, and therefore Mali and SGX perform fragment operations in separate order.

I obviously know what IMR stands for.. I don't think there's as much of a consensus as you think, as indicated by Ailuros's previous post.

I've read before about GMA945 being scene capturing; never heard anything on it being tile based. I assume that scanline based is being considered a type of tiling.

ARM Midgard Architecture

MfA

Simon F

Tea maker

rapso

Ailuros

Epsilon plus three

JohnH

metafor

JohnH

metafor

MfA

Exophase

JohnH

JohnH

Exophase

Ailuros

Epsilon plus three

metafor

JohnH

JohnH

Exophase

JohnH

Exophase

Similar threads