Alternative AA methods and their comparison with traditional MSAA*

I'd like to say that this is my favorite thread on b3d, even though I dont grasp more than 10% of it when it goes down into caches and what not.

And I'll join Patsu on the plank, when and where can we expect to hear/see a talk about the GoW MLAA? :)
 
I never said bandwidth, my friend

You don't get streaming... neither the fact that bandwidth for moving 7MB is not the limiting factor especially over a time superior to 10ms.

You need to consider what is requirement of streaming.

Think about HD video playback (perfect streaming example) on your computer. If your CPU has small cache, even a fast CPU can have poor performance. If your CPU has large cache then playback will be smooth.

This has nothing to do with "bandwidth" but what you can maybe call efficiency.
 
Actually, how useful is a large cache for decoding a video, say... H.264. Is it done in the GPU these days or CPU ?
 
You need to consider what is requirement of streaming.

Think about HD video playback (perfect streaming example) on your computer. If your CPU has small cache, even a fast CPU can have poor performance. If your CPU has large cache then playback will be smooth.

Actually larger cache was not equal to better perfomance in games nor decoding but it could be. Remember P4, A64 era?

I even compared perfomance between 2 CPUs clocked at same speed with same architecture but one had twice the L2 cache but got at best 5-10% perfomance improvement for a wide range of applications.
 
P4

Actually larger cache was not equal to better perfomance in games nor decoding but it could be. Remember P4, A64 era?

I even compared perfomance between 2 CPUs clocked at same speed with same architecture but one had twice the L2 cache but got at best 5-10% perfomance improvement for a wide range of applications.

Actually Celeron and P4 is perfect example for this kind of application (streaming video processing) while doing other tasks.
 
Actually Celeron and P4 is perfect example for this kind of application (streaming video processing) while doing other tasks.

Maybe it helped out a bit but they still underperformed greatly clock by clock vs the A64 in games to decoding to multi tasking. The A64 having much less L2 cache.

Actually, how useful is a large cache for decoding a video, say... H.264. Is it done in the GPU these days or CPU ?

Dunno but with a well optimised decoder (software) a low-end CPU like Core2Duo at around 1.6-1.8GHz would have no trouble with 1080p 25fps 40mbps H.264 decoding. Problem is a lot of software decoders are really badly optimised while delivering worse IQ.

But yes GPU can help out quite a lot. For example latest Rambo movie in Blu-ray 1080p 24hz, avg:20mbps (read from disc, not ripped) + image enhancing features uses about 25-30% (30hz playback rate) of my E8400 Dual-cores time. With GPU acceleration it is 3-5%. The image quality is top notch on par with the best I was shown at Hi-Fi center where I live.
 
Last edited by a moderator:
Maybe it helped out a bit but they still underperformed greatly clock by clock vs the A64 in games to decoding to multi tasking. The A64 having much less L2 cache.
indeed, but things have to be considered ceteris paribus (on top of decoding can be branchy; complex and AMD Intel a run for their money back in time).
See this quote, streaming has nothing to do with DMA vs coherent memory space.
But Cell has indeed a lot more bandwidth, more execution resources. The size of L2 is a bit irrelevant in that context.
Still by design to match the Cell on this kind of workload you would need as many cores with good cache hierarchy and characteristics, backed by enough bandwidth, etc. The CPU would end much bigger, warmer, etc. than the Cell. Think of xenon with 7 cores, a lot of bandwidth actaully you might need no matter that better CPU cores than Px.

Back to L2 cache is really not a problem modern CPUs are going with tinier L1 caches with better characteristic (bulldozer won't come with 64KB i$ and d$ if rumour are to be trusted), manufacturers have chose to go with tinier L2 cache too (256KB is become a standard).
It's more complicated than size alone, there is the hierarchy, latency and bandwidth offered by those caches, associativity, etc. For example Intel caches are greater than AMD even if usually they are tinier.


EDIT
I've been reading my posts from Friday to realize that they were pretty aggressive, must have been in a bad mood.
Anyway no matter the reasons, Patsu and Ihamoict forgive me :oops:
 
Last edited by a moderator:
indeed, but things have to be considered ceteris paribus (on top of decoding can be branchy; complex and AMD Intel a run for their money back in time).
See this quote, streaming has nothing to do with DMA vs coherent memory space.

That quote is meaningless. It only says a cache-based design can emulate a LocalStore design. It says nothing about relative performance in specific application context (i.e., it's over generalized). The cache-less design is not popular because it's hard to program for, but SPUs are specialized media processors and there is a PPU (with a cache !) to support general purpose computing for regular workload. That PPU is too weak.

DMA is relevant in streaming applications. As long as you have data locality, the Cell architecture will fly because of the simple memory hierarchy and fast LocalStore. My original question was: How much does cache size matters for video decoding ?


Dunno but with a well optimised decoder (software) a low-end CPU like Core2Duo at around 1.6-1.8GHz would have no trouble with 1080p 25fps 40mbps H.264 decoding. Problem is a lot of software decoders are really badly optimised while delivering worse IQ.

But yes GPU can help out quite a lot. For example latest Rambo movie in Blu-ray 1080p 24hz, avg:20mbps (read from disc, not ripped) + image enhancing features uses about 25-30% (30hz playback rate) of my E8400 Dual-cores time. With GPU acceleration it is 3-5%. The image quality is top notch on par with the best I was shown at Hi-Fi center where I live.

Yap, CPU decoding may turn off some intensive/advanced features. It usually can't do composition too, and % utilization may be too high to be comfortable.
 
Yap, CPU decoding may turn off some intensive/advanced features. It usually can't do composition too, and % utilization may be too high to be comfortable.

Actually it got all advanced features on or else it falls well under 20% for CPU. Dont give to little credit to what the CPUs are capable of with well coded decoders that utilise the CPUs well and all their features and extensions for a substantial perfomance boost versus crap decoders like the Quicktime ones etc.

Makes me remember how several years ago I was surprised how easily a single core A64 3200+ handled a 1080p 25Hz non-variable 40mbps MPEG-2 test video clip (or rather short CGI movie). Somwhere at 50-70% CPU utilisation with no GPU acceleration. :)
 
Last edited by a moderator:
... for the Core2Duo ? I'll need to ask the people around my office area to see what the scoop is. I believe they licensed a H.264 software decoder for Intel CPU somewhere.

EDIT:
Makes me remember how several years ago I was surprised how easily a single core A64 3200+ handled a 1080p 25Hz non-variable 40mbps MPEG-2 test video clip (or rather short CGI movie). Somwhere at 50-70% CPU utilisation with no GPU acceleration. :)

Ha ha... yeah. MPEG2 is a piece of cake these days compared to MPEG4.

Open source projects should have pretty good H.264 decoding these days.
 
Well of course or any processor with enough capabilities. They are all x86 and most share the same instruction sets.

Ha ha... yeah. MPEG2 is a piece of cake these days compared to MPEG4.

Open source projects should have pretty good H.264 decoding these days.

Yeah good old days! :LOL:

But yeah codec compilation with FFDSHOW by "Shark" is great with friendly GUI. Though I mostly stick to TotalMedia Theatre 3. It even has support for GPGPU processing for "expensive" upscaling aswell as sharpening and dynamic light adjustment etc with SimHD plugin for MPEG-2... so far. Thing is I read a review some time ago regarding the plguin which can be run in either CPU or GPU mode with same quality. By review using earlier versions their SimHD plugin their CPU (Phenom II X3 720) did ~2fps while the GPU (9800GTX+) did ~25fps with upscaling to 1920x1080. Funny becouse now with revised code it just takes about ~20-30% of my CPU time to run the plugin without assist from GPU and upscale 8mpbs DVD MPEG-2 movie to 1920x1080 and so of course solid and smooth and same quality!

Oh my what optimisations do and utilising CPU extensions. It's like how crippled games are with DX9 API vs DX10, DX10.1 and especially DX11 not utilising HW functions at all they just being there idling etc.

But this is getting very OT so I'll chip it off here. :)
 
Last edited by a moderator:
Maybe it helped out a bit but they still underperformed greatly clock by clock vs the A64 in games to decoding to multi tasking. The A64 having much less L2 cache.

Which ones? Each design had different models.

Generally, A64 had much larger L1 than P4.

Small L1 means L2 has to be much larger.

Xenon L2 has a cache "lock" capability (request from GPU) to provided dedicated cache area to stream data direct to GPU and not use RAM. This is have smooth streaming and avoid RAM read/write.
 
Getting off topic, but interesting nonetheless.

There... some informal H.264 decoder benchmarks on Core2Duo and A64 (in the second one):

Dec 2006: http://www.anandtech.com/show/2132/4 (Need GPU to play High Profile H.264)

The problem is the bottleneck is not CPU capabilities but rather software in itself. The codec for H.264 in that version of PDVD is horrible inefficient (and some versions up to) while producing worse IQ than other H.264 codecs that runs several times faster just becouse they are better optimised to take advantage of HW. I had PDVD it before but ditched that program in favor of better perfomance and image quality.

This is one of the software being limitation cases. Multiplatform games spring to mind? :)

The CPU they use (X6800 as main CPU) also dont differ that much from mine in perfomance except some refinements in my CPU revision, SSE4 vs SSE3 instruction set and perhaps something more I just cant remember.

But as you said OT so perhaps for another thread though I remember there was one already some time ago.
 
This is one of the software being limitation cases. Multiplatform games spring to mind? :)

It'd be case by case. e.g., Cross platform physics library on Cell runs well, and can scale to all 6 (free) SPUs and PPU.

Multiplatform games involves the entire console, broader skillsets and workflow. So it'd be affected by other parts, budget, people and techniques used. You'll be able to find titles that run well on either or both platforms for assorted reasons.

In the context of this thread for example, MLAA is done first on Cell because it suits the system well. Saboteur (cross platform title) also uses a related technique. At this level, it depends on how the entire system works together (CPU + GPU + memory + ...)
 
Using ffmpeg-mt, which is the most efficient cpu-based decoder out there, a C2D at 1.6 Ghz decodes a very high bitrate 1080p MKV rip at around 85-90% cpu time on both cores. This task is also easily achieved by the cpu in a WDTV live, which is a lot less powerful compared to a C2D.
 
It'd be case by case. e.g., Cross platform physics library on Cell runs well.

Multiplatform games involves the entire console, skills and workflow. So it'd be affected by other parts, and techniques used. You'll be able to find titles that runs well on either or both platform for assorted reasons.

Yes and remember how it was in the beginning?

Same with F@H which has improved perfomance over the years by several times on same CPUs just becouse they started to utilise them better and extensions.

Case is even the latest PDVD gives worse IQ and worse perfomance vs TMT3. What made me switch though later PDVDs are more efficient while having better IQ than older versions.
 
Same with F@H which has improved perfomance over the years by several times on same CPUs just becouse they started to utilise them better and extensions.

Sure, as long as developers continue to use Cell (well, they are sort of forced to use it anyway, given that PS3 user base is growing -- see Valve), software techniques will evolve. Perhaps this is why you hear some PS3 developers say they expect to see more advancement on the platform. The hardware will run certain types of solutions better than others though.

We will see software advancement on 360 too.
 
See this quote, streaming has nothing to do with DMA vs coherent memory space.
But Cell has indeed a lot more bandwidth, more execution resources. The size of L2 is a bit irrelevant in that context.

L2 size is important if you have many threads and reusing data. This is why I feel it is difficult to have full use of all 3 Xenon cores.

Back to L2 cache is really not a problem modern CPUs are going with tinier L1 caches with better characteristic (bulldozer won't come with 64KB i$ and d$ if rumour are to be trusted), manufacturers have chose to go with tinier L2 cache too (256KB is become a standard).
It's more complicated than size alone, there is the hierarchy, latency and bandwidth offered by those caches, associativity, etc. For example Intel caches are greater than AMD even if usually they are tinier.

I think small dedicated L2 per core is ok when you have large L3. What is important is that you have enough buffer between execution and RAM, no?


I've been reading my posts from Friday to realize that they were pretty aggressive, must have been in a bad mood.
Anyway no matter the reasons, Patsu and Ihamoict forgive me :oops:

My friend, it is good to have passion for your arguments. I think everyone sometimes can be frustrated and this is normal. I think on this forum I have seen only one person be very rude for no reason all the time and I will not say that person's name.
 
Using ffmpeg-mt, which is the most efficient cpu-based decoder out there, a C2D at 1.6 Ghz decodes a very high bitrate 1080p MKV rip at around 85-90% cpu time on both cores. This task is also easily achieved by the cpu in a WDTV live, which is a lot less powerful compared to a C2D.
I thought CoreAVC was the fastest CPU-based decoder? At least for h264/avc/mpeg4.

Anyway, sorry, OT.
 
Back
Top