Alternative AA methods and their comparison with traditional MSAA*

ihamoitc2005 · Jul 9, 2010

Cache miss

Shifty Geezer said:
Working in tiles on a framebuffer with no random-access patterns, the memory access patterns would be deterministic and there'd be no reason for a cache miss, the cache would be intelligently fetching the data ahead of use. If it doesn't, it's a broken cache.

This I agree my friend. But it is the domino-effect of massive L2 occupation (in small L2) by this "tiles" for cache miss by other threads that I worry.

liolio · Jul 9, 2010

ihamoitc2005 said:
This I agree my friend. But it is the domino-effect of massive L2 occupation (in small L2) by this "tiles" for cache miss by other threads that I worry.

You don't get streaming... neither the fact that bandwidth for moving 7MB is not the limiting factor especially over a time superior to 10ms.

patsu · Jul 9, 2010

I don't think there are data dependencies at all. I don't know what they do but it looks like a completely data parallel task. Each SPU/CPU core works on its tile. The tile have to fit in Xenon L1 data cache or SPU local store.

http://forum.beyond3d.com/showpost.php?p=1433406&postcount=443

In the MLAA algorithm however, pixels are not independent, but have a rather strict order in which they need to be processed.

If the problem size is large enough, they can still spread the load to many many processors, hence massively parallelizable. But there is inherent data dependency due to the order imposed. You may need to write any intermediate or final result back before fetching new data.

liolio said:
Sorry but I don't see how this change the fact that streaming thus bandwidth is not the issue. So the issue raises about L2 cache size, etc. is not relevant to the problem.

I didn't say the bandwidth is the issue. I'm just saying you can't plug SPU numbers into 360 to derive more 360 numbers. There may be assumptions that are broken. The algorithm/implementation may also be different on different architecture after optimization. e.g., The GPU MLAA paper seems to tweak the original algorithm to make it run fast on GPUs (There are some precomputed factors involved, but I'm not sure what they really are).

Overall I would be surprise if a VMX/Px core is a match to a SPU but I don't see what L2 cache would have to do with it. The LS is a bit like a huge L1 cache in characteristic so tile size may be more optimal on SPUs.
As an example: one 64x64 pixels tile (color+Z => 32KB) is enough to fill xenon L1 data cache, you can put two in LS (double buffering as you said) so when done on a tile you would have to wait for data to be moved from L2 to L1. So you may end using tinier tiles which for whatever reason may end less optimal (or execution would stall still data are ready in L1)

If I remember correctly, the L1 cache/LocalStore is a few times faster than L2 cache. The 4ms (or 20ms) figures in SPU MLAA is achieved under that kind of hardware environment. They also use the DMA to load the LS asynchronously and completely under programmer's control. So the SPE can go ahead and do some other stuff at the right time.

liolio · Jul 9, 2010

Ok I give up believe whatever you want... it won't change the fact that MLAA is not possible on Xenon because it doesn't have the horse power to do so...
Patsu what you say makes no sense sorry to be straight but that's it.
How can you say it's not possible to calculate time to move a buffer based on bandwidth?
You think MLAA is bandwith limited?
Obviously pixels are link still you have to split the frame buffer. If the data were large enough with a lot of dependency it would not scale well. You're idea of data parallelism is messed up.

SPU doing something else while applying MLAA that makes no sense either imho, especially if you consider double buffering as a possibility.

patsu · Jul 9, 2010

liolio said:
Ok I give up believe whatever you want...
Patsu what you say makes no sense sorry to be straight but that's it.
How can you say it's not possible to calculate time to move a buffer based on bandwidth?

You can. But moving a buffer one-off is not MLAA. If you want to derive bandwidth usage for MLAA, you would need to look at the MLAA implementation and its access pattern.

You think MLAA is bandwith limited?

See my posts above, especially "I didn't say the bandwidth is the issue."

Obviously pixels are link still you have to split the frame buffer. If the data were large enough with a lot of dependency it would not scale well. You're idea of data parallelism is messed up.

SPU doing something else while applying MLAA that makes no sense either imho, especially if you consider double buffering as a possibility.

When there is an imposed order in the data, it means that the CPU will need to write result back into the main memory before processing the depending data. This may trigger some cache coherency logic for the L2 cache (even if it's locked ?), hence slowing down memory access compared to the totally separate and cache-logic free LocalStore.

As for SPU doing something else, I meant the SPU can do something else while the DMA controller is fetching data into LocalStore. On a regular CPU, it must always go through the same automatic L1->L2->main memory chain. Sometimes, it'd have to wait for the memory subsystem to come back, especially when performing cache logic.

liolio · Jul 9, 2010

patsu said:
You can. But moving a buffer one-off is not MLAA. If you want to derive bandwidth usage for MLAA, you would need to look at the MLAA implementation and its access pattern.

But which "memory"? What do they consider the neighborhood of pixel? You think a pixel would be compare to every single of a 48x48 pixels tiles (half xenon L1 data cache)? I think not, so you know that the tile size could be really tiny and fit the L1 data cache (most likely the relevant tile could be way tinier than the 64x64 I stated).

See my posts above, especially "I didn't say the bandwidth is the issue."

I still can't see the difference between internal bandwidth and external bandwidth. Neither you understand data locality.

When there is an imposed order in the data, it means that the CPU will need to write result back into the main memory before processing the depending data.

Sorry that's completely wrong.

This may trigger some cache coherency logic for the L2 cache (even if it's locked ?), hence slowing down memory access compared to the totally separate and cache-logic free LocalStore.

As for SPU doing something else, I meant the SPU can do something else while the DMA controller is fetching data into LocalStore. On a regular CPU, it must always go through the same automatic L1->L2->main memory chain. Sometimes, it'd have to wait for the memory subsystem to come back, especially when performing cache logic.

You don't know how coherency is handle in Xenon as other modern processor it's neither inclusive or exclusive it's something cleverer. I don't know either it's not public (neither it's public for Intel or AMD). But it's clear you didn't know who what inclusive or exclusive L1 & L2 cache.
Say it's inclusive, one line of L1 is modified, so L2 will have to be modified. in our case it's "free". Cpu still continue load data from I$ and D$.
There could be contentions but not in case of data parallel workload , only one core work one define set of data, L2 value won't get overwritten by another resource.

Overall there is no magic about the SPU, they are like old processor which didn't had cache.
I friendly advice you to find the old stuffs from arstechnica about CPU, pipelining, etc.
They put it really nicely, I managed to get through with few knowledge before hand.

Coherent memory space and cache doesn't mean that developers controls nothing (especially on a "primitive" CPU as xenon that is in order, with no speculative execution, etc.), it works by it-self, etc. It just helps (but cost silicon, power, etc.).

patsu · Jul 9, 2010

liolio said:
But which "memory"? What do they consider the neighborhood of pixel? You think a pixel would be compare to every single of a 48x48 pixels tiles (half xenon L1 data cache)? I think not, so you know that the tile size could be really tiny and fit the L1 data cache (most likely the relevant tile could be way tinier than the 64x64 I stated).

I have no idea. The GPU implementation uses a 512x512 precomputed area table texture to filter lines up to 512 pixels for example. You'll have to look at the individual MLAA algorithms.

I still can't see the difference between internal bandwidth and external bandwidth. Neither you understand data locality.

Sorry that's completely wrong.

You don't know how coherency is handle in Xenon as other modern processor it's neither inclusive or exclusive it's something cleverer. I don't know either it's not public (neither it's public for Intel or AMD). But it's clear you didn't know who what inclusive or exclusive L1 & L2 cache.
Say it's inclusive, one line of L1 is modified, so L2 will have to be modified. in our case it's "free". Cpu still continue load data from I$ and D$.
There could be contentions but not in case of data parallel workload , only one core work one define set of data, L2 value won't get overwritten by another resource.

It is true that modern CPUs have various write-back, write-through or even by-pass schemes for caches. I believe the PowerPC family uses a MERSI scheme. I believe under this scheme, you pay when you read a "dirty" location.

Overall there is no magic about the SPU, they are like old processor which didn't had cache.
I friendly advice you to find the old stuffs from arstechnica about CPU, pipelining, etc.
They put it really nicely, I managed to get through with few knowledge before hand.

They replaced the traditional memory hierarchy with LocalStore to save space/heat, speed up access, and avoid the "memory wall" problem (for multi-core access). In the process, they trade ease-of-programming away.

assen · Jul 9, 2010

Christer Ericson said:
Sorry, this is all wrong. It seems from a later post that you got this from misinterpreting Cedric talking about general development time and thinking that quotes like "very long time", "a few months", "few weeks later" referred to work spent on SPU AA. That's not the case. As a game developer you should know that stuff gets worked on, then off, then on-again, then off-again, and the SPU AA code was no exception.

OK, gotta admit I was wrong on that. Looking forward to using what you and the SCEE team did one day in the future.

Neb · Jul 9, 2010

Would be interesting to know how AAA solutions for current 360 games relates to the Intel MLAA or similar. There is the GOW "MLAA" (I put quotes becouse AFAIK it is a derivate of Intel MLAA for better or worse with no public papers out). Btw didn't the original Intel MLAA run like a dog at 120ms on Cell at initial start? How would things look for 360 regarding MLAA or "MLAA" with some work, maybe yes maybe no? :smile:

liolio · Jul 9, 2010

patsu said:
I have no idea. The GPU implementation uses a 512x512 precomputed area table texture to filter lines up to 512 pixels for example. You'll have to look at the individual MLAA algorithms.

Interesting

It's strange is that from the presentation linked on Dave's blog?

It is true that modern CPUs have various write-back, write-through or even by-pass schemes for caches. I believe the PowerPC family uses a MERSI scheme. I believe under this scheme, you pay when you try to read a "dirty" location.

Interesting too, I did a search

for those interested, MESI and MERSI protocols, this should not hurts either

I never read about that

So let try to find out what would happen to us.
*The data is present in both L1 and L2
So the cache lines should be in S state, so clean readable and writable.

*Then the CPU request a "read for ownership"
So L1 cache line is read and L2 cache line set to invalid.

*Now the line is in the M state and is different from RAM value.
The line is still readable but the hardware has to block access to that line in the main RAM.
Data in RAM has to be updated

From the third link I gave above.

If data are written to the cache, they must at some point be written to main memory as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to the main memory. Instead, the cache tracks which locations have been written over (these locations are marked dirty). The data in these locations are written back to the main memory when that data is evicted from the cache. For this reason, a miss in a write-back cache may sometimes require two memory accesses to service: one to first write the dirty location to memory and then another to read the new location from memory.
There are intermediate policies as well. The cache may be write-through, but the writes may be held in a store data queue temporarily, usually so that multiple stores can be processed together (which can reduce bus turnarounds and so improve bus utilization).
.

Supposedly MESRI protocol use a write-back policy. But we know for sure only for G4.
In that pdf I read that Xenon is MESI & write-though with lots of store buffering.

*So we have two cases:
-The cache line in only read, nobody try to over write it. At some point when write -back happen it will be changed to the E state (so writable or can be evicted if needed).
In that case everything is fine the "cost" for coherency is free, I mean you pay it through higher latencies to your caches. For coherency traffic to the main RAM you have 5.4GB/s.

-the cpu wants to overwrite the cache line and the say cache line is still in M state... the execution should stall till the state change to E.
Here it gets costly as you lose cycles.
from the same link as above:

cache write miss to a data cache generally causes the least delay, because the write can be queued and there are few limitations on the execution of subsequent instructions. The processor can continue until the queue is full.

So that's where it's tricky and maybe someone could shine in

In the Xenon when is the cache line state is changed?
At the moment the data are put 1) in the "store buffer" or 2) when they leave it (to the RAM)?
1) looks nice, the execution can resume and as long as snooping works properly it should be OK.

They replaced the traditional memory hierarchy with LocalStore to save space/heat, speed up access, and avoid the "memory wall" problem (for multi-core access). In the process, they trade ease-of-programming away.

Overall if caches works properly, I can't see coherency being what prevent Xenon to achieve MLAA @720p/30fps. From what I understand you always pay the cost for coherency through higher latencies (or depending on how Xenon works ), extra silicon, power consumption.
Tile size could be a problem (it should not exceed L1 data cache).

I still put my bet on raw horse power has the limiting factor.

patsu · Jul 9, 2010

Nebula said:
There is the GOW "MLAA" (I put quotes becouse AFAIK it is a derivate of Intel MLAA for better or worse with no public papers out). Btw didn't the original Intel MLAA run like a dog at 120ms on Cell at initial start?

Someone posted the implementation history of GoW MLAA here but I can't find it at the moment. I suppose I could write an Intel MLAA that runs dogslow on Cell too but it'd be missing the point. Once an implementation is optimized for Cell (20ms for 1 SPU), the final performance numbers will most likely be tied to that platform's characteristics.

Other architectures (GPU or regular CPU) will have to find their own ways to do MLAA. The GPU one seems to take some shortcut, but can run on GPU cores generously. I'm curious how that modified MLAA could turn out on Cell too.

EDIT:

liolio said:
Overall if caches works properly, I can't see coherency being what prevent Xenon to achieve MLAA @720p/30fps. From what I understand you always pay the cost for coherency through higher latencies (or depending on how Xenon works ), extra silicon, power consumption.
Tile size could be a problem (it should not exceed L1 data cache).

I still put my bet on raw horse power has the limiting factor.

In the first place, I didn't say which is the contributing factor. I just don't think that plugging the performance numbers from Cell to calculate 360 bottleneck is the right thing to do. FWIW, I think everything on Cell contributed together to make MLAA possible:
http://forum.beyond3d.com/showpost.php?p=1445386&postcount=622

Local store bandwidth, large register file + low instruction latency, dual issue and the wonders of the MFC (in combination with the rather good XDR bandwidth).

I suspect you could try making certain shortcuts to cut down on what you actually have to compute (that's seemingly what the GPU-MLAA guys did). I suspect if we ever see a CPU based AA scheme on the 360, that's what they are going to do.

So, the computational power is indeed one of the major factors.

Neb · Jul 9, 2010

patsu said:
Other architectures (GPU or regular CPU) will have to find their own ways to do MLAA. The GPU one seems to take some shortcut, but can run on GPU cores generously. I'm curious how that modified MLAA could turn out on Cell too.

It seems so, so far for an unoptimised version with messy code. But what makes you think the GOW "MLAA" doesn't use shortcuts, has the GOW MLAA papers been released? Same for other methods.

patsu · Jul 9, 2010

Ah... nicely set up for someone in the know to take the question. Thank you.

T.B. · Jul 9, 2010

patsu said:
Ah... nicely set up for someone in the know to take the question. Thank you.

And ruin all the fun? I couldn't possibly do that, now can I?

liolio · Jul 9, 2010

patsu said:
EDIT:
In the first place, I didn't say which is the contributing factor. I just don't think that plugging the performance numbers from Cell to calculate 360 bottleneck is the right thing to do. FWIW, I think everything on Cell contributed together to make MLAA possible:
http://forum.beyond3d.com/showpost.php?p=1445386&postcount=622

So, the computational power is indeed one of the major factors.

Well I've been reading quiet some stuffs about coherency and I'm really happy you brang info on the coherency protocol

I've been reading stuffs and as I did I pass by this post (T_B's one) and just came to check something I forgot in regard to Xenon, which must not help either horrendous L1 latencies.
Crap 16cyles, that's L2 level...

AlNom · Jul 9, 2010

On a side note, I can't remember if I mentioned it already, but there is a sample in the XDK for an edge filter/AA. The gamefest presentation only briefly mentioned it a couple years back, but I wonder if that's what AvP ended up using.

patsu · Jul 9, 2010

T.B. said:
And ruin all the fun? I couldn't possibly do that, now can I?

Oh crap, the Jackalope didn't take the bait. I should have been more subtle. :runaway:

Shifty Geezer · Jul 9, 2010

Nebula said:
It seems so, so far for an unoptimised version with messy code. But what makes you think the GOW "MLAA" doesn't use shortcuts...

Shortcuts are generally welcome when the results don't have (too many) negative side-effects, and at the moment the thing about GWAA isn't just that it's fast enough to be used, but also that it has the best IQ. This can't be said of the current GPU implementation that is taking shortcuts to get it to run quickly but not produce results of the same quality.

DeanA · Jul 10, 2010

T.B. said:
And ruin all the fun? I couldn't possibly do that, now can I?

No.. no you can't!

patsu · Jul 10, 2010

I'll just thicken my skin and take it as a compliment.

[size=-2]Thank you

XXOO[/size]

Alternative AA methods and their comparison with traditional MSAA*

ihamoitc2005

liolio

Aquoiboniste

patsu

liolio

Aquoiboniste

patsu

liolio

Aquoiboniste

patsu

assen

Neb

Iron "BEAST" Man

liolio

Aquoiboniste

patsu

Neb

Iron "BEAST" Man

patsu

T.B.

liolio

Aquoiboniste

AlNom

Moderator

patsu

Shifty Geezer

uber-Troll!

DeanA

patsu

Similar threads