Alternative AA methods and their comparison with traditional MSAA*

Why would we consider one of Xenon's VMX units to be the equivalent of an SPU?
For a best case basis for comparison! In terms of float calculations they have pretty much the same raw capability IIRC, but availability of data and SPU's dual-issue Float+Int could be a factor. That's why I said it'd be intersting to know what it is about SPUs that makes MLAA possible if it's not on other processors with similar wide vector units.
 
That's why I said it'd be intersting to know what it is about SPUs that makes MLAA possible if it's not on other processors with similar wide vector units.

Local store bandwidth, large register file + low instruction latency, dual issue and the wonders of the MFC (in combination with the rather good XDR bandwidth). :)

I suspect you could try making certain shortcuts to cut down on what you actually have to compute (that's seemingly what the GPU-MLAA guys did). I suspect if we ever see a CPU based AA scheme on the 360, that's what they are going to do.
 
Thanks T.B. And I suppose being able to bribg latency down by spreading the workload over 4-5 SPUs is also very helpful ...
 
Local store bandwidth, large register file + low instruction latency, dual issue and the wonders of the MFC (in combination with the rather good XDR bandwidth). :)
that's nice to hear. Something to add to the, "Is Cell actually good at anything?" thread. :D
 
Thanks T.B. And I suppose being able to bribg latency down by spreading the workload over 4-5 SPUs is also very helpful ...

That is not really an "also". ;)
The higher your shared memory bandwidth is, the more processing units can you feed. This is where the rather fast XDR comes in. On the other hand, if you can move memory operations away from shared resources, the load on them gets reduced and you can again feed more units.
We're very heavy users of memory bandwidth, but the by far largest part of that is luckily local.
 
Or, just as with exclusive titles, shows even moreso than months and months of optimization by experienced, high-level developers at Santa Monica and SCEA (vs a small team of researchers) are offering something special here, providing better quality with better efficiency; and this in turn points to the advantages of huge corporations with deep pockets for funding show-off projects?
Sorry, this is all wrong. It seems from a later post that you got this from misinterpreting Cedric talking about general development time and thinking that quotes like "very long time", "a few months", "few weeks later" referred to work spent on SPU AA. That's not the case. As a game developer you should know that stuff gets worked on, then off, then on-again, then off-again, and the SPU AA code was no exception.

The SPU AA library was written primarily by Matteo and Tobias at ATG, and on our end Cedric did the integration. Some other people helped out here and there with stuff, but when it comes down to it, it was these three guys doing the work. I can't say exactly how many hours Matteo and Tobias spent on their end (I know how much time Cedric spent in total) but there sure as heck was no "months and months" of writing/optimization; that's totally made up by misreading what was said.

The library has been cleaned up since the early version we got, and today, basic integration can be done in a day. In our case what took extra time was sorting out "early adopter" issues (changing requirements; lots of going back and forth, with a day wasted on almost every issue due to inter-continent collaboration; etc) and to totally rearrange our rendering loop, because we wanted to make sure we placed the SPU AA cycles in exactly the right part of our frame, so as not to introduce additional latencies (which, for us, led to a LOT of rearrangements for our already convoluted rendering frame).

So, no "huge corporations with deep pockets" are needed to pull this off. You just need a few really solid programmers spending a few weeks of concentrated time!
 
Sorry, this is all wrong. It seems from a later post that you got this from misinterpreting Cedric talking about general development time and thinking that quotes like "very long time", "a few months", "few weeks later" referred to work spent on SPU AA. That's not the case. As a game developer you should know that stuff gets worked on, then off, then on-again, then off-again, and the SPU AA code was no exception.

The SPU AA library was written primarily by Matteo and Tobias at ATG, and on our end Cedric did the integration. Some other people helped out here and there with stuff, but when it comes down to it, it was these three guys doing the work. I can't say exactly how many hours Matteo and Tobias spent on their end (I know how much time Cedric spent in total) but there sure as heck was no "months and months" of writing/optimization; that's totally made up by misreading what was said.

The library has been cleaned up since the early version we got, and today, basic integration can be done in a day. In our case what took extra time was sorting out "early adopter" issues (changing requirements; lots of going back and forth, with a day wasted on almost every issue due to inter-continent collaboration; etc) and to totally rearrange our rendering loop, because we wanted to make sure we placed the SPU AA cycles in exactly the right part of our frame, so as not to introduce additional latencies (which, for us, led to a LOT of rearrangements for our already convoluted rendering frame).

So, no "huge corporations with deep pockets" are needed to pull this off. You just need a few really solid programmers spending a few weeks of concentrated time!

You know, everything on the ps3 it's possible only with a huge amount of smart developers, time & money :rolleyes:
 
You know, everything on the ps3 it's possible only with a huge amount of smart developers, time & money :rolleyes:

Well, that's partly true. It took a considerable amount of smart developers time (=money) to get this to work on God of War 3. Only after that, it's great now that the technology could be easily reused in LittleBigPlanet at extremely low cost and effort.
 
So, no "huge corporations with deep pockets" are needed to pull this off. You just need a few really solid programmers spending a few weeks of concentrated time!

Aint that costly when you have small dev teams,"MLAA" is just one feature of hundreds to revise or implement in an engine besides all the rest. Actually reading game manuals for various games reveals most of them have very few graphic programmers while artist group is bigger and so are most other groups.
 
Well, that's partly true. It took a considerable amount of smart developers time (=money) to get this to work on God of War 3. Only after that, it's great now that the technology could be easily reused in LittleBigPlanet at extremely low cost and effort.

But Arwin, doesn't this conclusion hold for every new tech?
I mean, the guys with the first 3D engine need some dev time to come up with the idea and the first implementation, right? Now it is standard!

The same goes for normal mapping, AA, HDR, "put fancy graphics tech in here",... the guys who did it first suffer the most, until the tech majored and gets streamlined - all other guys profit from their pioneer work!


If we believe that statements of Media Molecule, MLAA is now basically a drop in feature, just remember the DF interview guys:

Alex Evans said:
"We really got to ride on their shoulders there - when we got the MLAA code from Sony, it was already in a pretty usable and fast state. We dropped it in during an afternoon, I believe, and it did save us a little GPU time. As with any change, there are knock-ons, a bit of SPU rescheduling etc, but it's definitely a net win."
 
Aint that costly when you have small dev teams,"MLAA" is just one feature of hundreds to revise or implement in an engine besides all the rest. Actually reading game manuals for various games reveals most of them have very few graphic programmers while artist group is bigger and so are most other groups.
Don't lose sight of the argument. Christer's valuable, detailed explanation is in direct response to the question of MLAA's GPU implementation versus its Cell version.

Assen's point was it took a lot of a people a lot of time and money to get MLAA up and running as efficiently and effectively in GOW (and subsequent titles) as they did; a considerable investment that isn't there for exploring MLAA implementations on GPUs. However, now we know GWAA is principally the work of two people in its Cell implementation, one of them being our very own T.B.; clearly not outside the realm of any other small-scale developer or technical enthusiast or giant GPU IHV looking for a feature edge. Thus whatever results we get from GPU efforts, saying they aren't getting the same sort of backing as GWAA got is not a particular valid explanation if the GPU results aren't as effective. In essence Christer has put paid to the "Cell gets more investment so of course it'll get better results" argument.
 
L2

So MLAA is not possible on 360 because of it's weak CPU? or because the 360's CPU is already pushed and used for other things in games? I was under the impression that Xenon is still under-utilized and most developers are relying on the easy and relatively strong Xenos - IIRC Capcom's MT Framework based games is an exception but they still manage to add dynamic 2-4xAA in their games resulting in very good IQ.

It's sad to not have a first-party like ND or Guerilla on the 360 to finally see how good a game can look with an engine specifically built around 360...Gears 1/2 and the upcoming third game look awesome but it's still UE3.5+ and Bungie's effort with Reach is kinda disappointing considering the game is again not running at 720p and having temporal AA especially on the system that it's well known for it's easy AA implementation.

My friend, many years ago I said that L2 cache will be a problem for this CPU and still I think it is.

This is why (rumor) they say that many do not use CPU very well despite simple homogenous design (what Carmack likes).

I think for MLAA you have to move a lot of data and SPU maybe they create small tiles and feed it into the Cell (1.5MB local store net to fill) to do whatever they want. This is probably difficult with Xenon to reserve enough L2 without "hitting" RAM.

Even if Xenon has 10MB L2, still the Xenon cores is not the "smartest kid on the block" types.
 
My friend, many years ago I said that L2 cache will be a problem for this CPU and still I think it is.

This is why (rumor) they say that many do not use CPU very well despite simple homogenous design (what Carmack likes).

I think for MLAA you have to move a lot of data and SPU maybe they create small tiles and feed it into the Cell (1.5MB local store net to fill) to do whatever they want. This is probably difficult with Xenon to reserve enough L2 without "hitting" RAM.

Even if Xenon has 10MB L2, still the Xenon cores is not the "smartest kid on the block" types.
???
Sorry but that doesn't make sense to me. Xenon L2 has its laking but I don't think it's the culprit here.
Put one PPU aside on xenon and cell your left with respectively 2 PPU/VMX units vs 6 SPU, 3 time the raw power worse when you consider efficiency and bandwidth.
Either the L2 or the local stores save no RAM/V-RAM the CPUs would read from RAM/VRAM and then overwrite it with the result. Having a L2 big enough to fit the RT/frame buffer would not change much as I think that in that case the most limiting factor for the 360 CPU is raw processing power.
There is no culprit outside of the obvious, the cell has more than double the throughput of Xenon, higher efficiency on top of it and it is fed with more bandwidth.
 
Last edited by a moderator:
Misunderstanding

???
Sorry but that doesn't make sense to me. Xenon L2 has its laking but I don't think it's the culprit here.
Put one PPU aside on xenon and cell your left with respectively 2 PPU/VMX units vs 6 SPU, 3 time the raw power worse when you consider efficiency and bandwidth.
Either the L2 or the local stores save no RAM/V-RAM the CPUs would read from RAM/VRAM and then overwrite it with the result. Having a L2 big enough to fit the RT/frame buffer would not change much as I think that in that case the most limiting factor for the 360 CPU is raw processing power.
There is no culprit outside of the obvious, the cell has more than double the throughput of Xenon, higher efficiency on top of it and it is fed with more bandwidth.

I don't mean it will have RAM/V-RAM space problem but RAM read/write latency problem and maybe even dominos side-effect cache miss for other functions because of excess L2 "locked" for this issue.

More data you can have in cache, less read/write problems you have.

For processing power, I think, probably 2 PPU with VMX is enough for MLAA if there is no data access problem.
 
I don't mean it will have RAM/V-RAM space problem but RAM read/write latency problem and maybe even dominos side-effect cache miss for other functions because of excess L2 "locked" for this issue.

More data you can have in cache, less read/write problems you have.

For processing power, I think, probably 2 PPU with VMX is enough for MLAA if there is no data access problem.
No, I think your wrong.
I've lost track of informations in this thread so I'm no longer sure which data are used for MLAA (doesn't change much anyway). So say it's color+depth buffer (1280x720x8Bytes=>7.4MB).
So xenon has a 10.8GB/s of bandwidth to xenos and so main RAM. That's up and down so I'll consider 5.4GB/s up (read the frame buffer / RT from Main RAM).
7.4/5.400 is 1.37ms. That's the ideal figure (for streaming the frame buffer to the CPU).

It tooks ~4ms to apply MLAA on the Cell. If it was bandwidth limited it would only take ~1.8GB/s of bandwidth to stream the buffer to the GPU (7.4/0.004) which should not be that taxing on the 360 and a breath on the PS3.

So clearly there would be no need to reserve that much of the L2 as streaming the framebuffer to the CPUs should not be a problem.

The problem is imho tied to computational resources which Cell has in spare. It takes 20ms of SPU time to apply MLAA, if VMX units were as efficient as SPUs at handling the task that would mean locking two cores for 10ms (in which case bandwidth to main RAM is even more of a concern). that's sound like a lot to me and it would most likely be more than that. At this point dealing the 4xAA overhead may be tempting.
 
You may not be able to plug Cell timing numbers into 360 (to calculate bandwidth consumption) because the architectures are different. The SPUs have DMA to double buffer main memory access, and L1 level LocalStore performance to achieve 20ms per SPU. They spread the load to 5 independent SPUs to achieve 4 ms, with data dependencies in the calculations (Speed up may not be linear).

Secondly, the bandwidth consumption should also be tied to the number of times memory is read and written (plus consideration for data dependency "wait", and cache hits + misses). In short, the formula should depend on how exactly your algorithm works on the target machine.
 
You may not be able to plug Cell timing numbers into 360 (to calculate bandwidth consumption) because the architectures are different. The SPUs have DMA to double buffer main memory access, and L1 level LocalStore performance to achieve 20ms per SPU. They spread the load to 5 independent SPUs to achieve 4 ms, with data dependencies in the calculations (Speed up may not be linear).
Sorry but I don't see how this change the fact that streaming thus bandwidth is not the issue. So the issue raises about L2 cache size, etc. is not relevant to the problem.

I don't think there are data dependencies at all. I don't know what they do but it looks like a completely data parallel task. Each SPU/CPU core works on its tile. The tile have to fit in Xenon L1 data cache or SPU local store.

Overall I would be surprise if a VMX/Px core is a match to a SPU but I don't see what L2 cache would have to do with it. The LS is a bit like a huge L1 cache in characteristic so tile size may be more optimal on SPUs.
As an example: one 64x64 pixels tile (color+Z => 32KB) is enough to fill xenon L1 data cache, you can put two in LS (double buffering as you said) so when done on a tile you would have to wait for data to be moved from L2 to L1. So you may end using tinier tiles which for whatever reason may end less optimal (or execution would stall still data are ready in L1)
Same for the code size, you would want it to fit in the L1 instruction cache (My guess is that it should fit but anyway) to make sure to avoid cache miss. You may end optimizing the code for size not for performances.
SPU may be faster at executing some instructions, etc. and so on.

But no matter the architectural differences and SPU advantages in this regard the point is that ihamoitc2005 point is most likely wrong. the main problem is processing power available which is already 2.5 lesser on Xenon without taking in account other factors favouring the SPU architecture.

Secondly, the bandwidth consumption should also be tied to the number of times memory is read and written (plus consideration for data dependency "wait", and cache hits + misses). In short, the formula should depend on how exactly your algorithm works on the target machine.
Indeed see above but one has not to mixed up bandwidth between execution units and L1/LS between L1 and L2 and external bandwidth.
The problem you're rising concern the first case, I'm not sure that there are any relevant difference in this regard between Xenon L1 and SPU LS neither that a difference in this regard would explain the difference in perfs between Xenon and SPU (vs other neat advantage: twice as big L1 while being as fast, no cache miss).
 
Last edited by a moderator:
Not bandwidth

but RAM read/write waiting. This is why Xenon CPU has put cache "lock" option for streaming data from CPU to GPU. But this also has problem because this can make cache-miss problems more because "lock" makes less L2 available for other threads. Also, I don't know how flexible the "lock" is so maybe you can or cannot use this feature to create workspace inside L2 for MLAA. I think I remember you can only use it one way but I am not sure what is possible with the clever programmers. But, important thing is you never want CPU to read/write direct from RAM (very very very slow) and that is what cache miss can do.

Less direct read/write to RAM you have the better. Always you want buffer.

Reason why Cell is so fast is not only because of many SPUs but also because of DMA/RingBus design to manage RAM read/write wait.

In a way that is the best feature of Cell. WIthout that, SPU raw power will be useless.

I agree that Cell is many times (I even say more than 3x) faster than Xenon for this kind of things, but I still do not understand why you feel (especially if you believe small cache is no problem) 2 PPE with VMX cannot do MLAA at 720P/30fps.
 
Last edited by a moderator:
but RAM read/write waiting. This is why Xenon CPU has put cache "lock" option for streaming data from CPU to GPU. But this also has problem because this can make cache-miss problems more because "lock" makes less L2 available for other threads. Also, I don't know how flexible the "lock" is so maybe you can or cannot use this feature to create workspace inside L2 for MLAA. I think I remember you can only use it one way but I am not sure what is possible with the clever programmers. But, important thing is you never want CPU to read/write direct from RAM (very very very slow) and that is what cache miss can do.
Frame buffer/RT has to be in the main RAM no hardware aside RBE has access to the EDRAM. There is no reading from the GPU. Cache lock is a PowerPC feature it's not specific to Xenon. In our case I don't even know if it could be used as I'm not sure more than one core at a time can do it. Your comment about create a "workspace" in L2 got me thinking that you don't understand streaming and what is a data parallel problem.
On top of it I don't see how having tiles instead whatever other data in L2 makes your point more valid.
And what is this idea about not writing or reading from RAM? At some point you have to read Data and export results. Did you understand what I read about tile size and L1.

Less direct read/write to RAM you have the better. Always you want buffer.
Reason why Cell is so fast is not only because of many SPUs but also because of DMA/RingBus design to manage RAM read/write wait.
In a way that is the best feature of Cell. WIthout that, SPU raw power will be useless.
No you want to maximize bandwidth where it is, in our case between L1 or local store and the execution units.
And you're wrong about why the Cell or I should says SPUs are fast, whereas standard processor can work in a stream processing manner the Cell, the Broadband engine enforces this model, code has to fit in local store and data is streamed in local store, the fact that it is backed up by a lot of bandwidth is parallel to this design choice. In this design there is no cache miss because the code has to fit in the local store as by it-self the SPU is blind and have no clue that the RAM existence. Data parallel workload are the best fit to stream processor, post processing like MLAA qualify.

I agree that Cell is many times (I even say more than 3x) faster than Xenon for this kind of things, but I still do not understand why you feel (especially if you believe small cache is no problem) 2 PPE with VMX cannot do MLAA at 720P/30fps.
Cache size is not a problem, the problem is how you hide latency.
Xenon misses the horse power to match the cell in that case.
May the resolution be higher BUT the effect applied less costly so Xenon is not limited by computation, it would not be better as it would become bandwidth constrain.
 
Last edited by a moderator:
But, important thing is you never want CPU to read/write direct from RAM (very very very slow) and that is what cache miss can do.
Working in tiles on a framebuffer with no random-access patterns, the memory access patterns would be deterministic and there'd be no reason for a cache miss, the cache would be intelligently fetching the data ahead of use. If it doesn't, it's a broken cache. ;)
 
Back
Top