Graphical effects rarely seen this gen that you expect/hope become standard next-gen

Neb · May 31, 2010

Billy Idol said:
Sorry guys, this I don't understand...look for instance at these pics by nightshade (maybe not the best example in the world, I admit)

http://forum.beyond3d.com/showpost.php?p=1416616&postcount=52

Is this effect only SSAA?
Hm, I also often have the feeling that the lightning looks somehow better...

They are offscreen and blurry and lighting "enhanced" due to camera objective, evironment light, monitor light and camera settings.

Regarding SSAA:
must the higher resolution before downsampling be an integer of the "goal" resolution...for instance a factor of 2?

No.

What about a factor of 1.5:
1920x1080 => 1280x720 would be a factor of 1.5

Sure.

Would you still get a decent effect? How would this compare to lets say standard 720p+2xMSAA?

Yes, better IQ becouse of detail density and all edges recieve AA + transparencies, and super sampling for post proccess buffers.

Arwin · May 31, 2010

Nebula said:
They could just run it on GPGPU (CUDA/ATI Stream) on DX9-DX11 on Nvidia 8xxx ATI 2xxx hardware. Though I wonder why Intel hasn't used this themselves as they put out the algorithm several years ago, no?

I was thinking DX11 because I don't know if in DX9 you have flexible enough access to the various stages of the graphics pipeline, but I guess for MLAA you probably only use simple line detection in one or two render-targets, so I guess that's probably no issue for DX9.

Shifty Geezer · May 31, 2010

Nebula said:
They could just run it on GPGPU (CUDA/ATI Stream) on DX9-DX11 on Nvidia 8xxx ATI 2xxx hardware. Though I wonder why Intel hasn't used this themselves as they put out the algorithm several years ago, no?

T.B. posted about this...

T.B. said:
...In any case, the algorithm used in GoW is not something easily ported to a GPU. I've been giving it some thought over the months and I don't see how you would do that efficiently, especially on Xenos. Of course, someone might surprise me.

So why haven't Intel or nVidia/AMD created an implementation on GPUs? At the moment I'd have to say because the GPUs cannot run it efficiently enough to be worth bothering with. I'd be surprised if the reason is absolutely no-one has tried to get MLAA working well on GPUs, as it's a superb IQ tool and the GPU manufacturer who can offer an implementation in the drivers would have a significant advantage over their rival.

That's not to say GPU's won't find an MLAA implemetation, but at the moment it's not a case of, "oh, just run it on CUDA on a DX9 GPU."

Neb · May 31, 2010

@ Arwin

I remember a pdf document by ATI that talked about edge detection and showed different forms of edge detection methods performed via DX8.1, PS1.4. Might be the initial steps for their edge detect mode.

EDIT: http://ati.amd.com/developer/shaderx/shaderx_imageprocessing.pdf

Neb · May 31, 2010

Shifty Geezer said:
T.B. posted about this...

But that is the GOW solution done for PS3 architecture. And nothing prevents to acheive even better MLAA solution on other hardware by taking advantage of the hardware features. I heared/read other devs say how X was impossible or very hard to do on X platform and yet it appeared in some form..

So why haven't Intel or nVidia/AMD created an implementation on GPUs? At the moment I'd have to say because the GPUs cannot run it efficiently enough to be worth bothering with. I'd be surprised if the reason is absolutely no-one has tried to get MLAA working well on GPUs, as it's a superb IQ tool and the GPU manufacturer who can offer an implementation in the drivers would have a significant advantage over their rival.

Maybe becouse it is a hit or miss regarding visual improvements. Might also need to be implemented via game engine and not forced from drivers as that would probably not work well. Like ATI's edge detect mode algorithm dont work in all games. But there are some MLAA alike solutions like for example Metro 2033 has for IIRC DX9-DX11.

Atleast ATI is doing some form of custom AA with edge detection and more with shaders.

http://developer.amd.com/assets/Architecture_Overview_RH.pdf (page 42 and beyond)

That's not to say GPU's won't find an MLAA implemetation, but at the moment it's not a case of, "oh, just run it on CUDA on a DX9 GPU."

Well if algorithm developer says algorithm is better suited for GPU than CPU (PC). And a CPU of the type Quad-core 3.0GHz needs ~5ms for Intel MLAA...

...can be implemented on the GPU even if the main algorithm runs on the CPU.

...MLAA, even in its current un-optimized implementation, is reasonably fast, processing about 20M pixels per second on a single 3GHz core.

...It is embarrassingly parallel and on a multicore machine can be used to achieve better load balancing by processing the final output image in idle threads.

...To avoid mixing integer and float values, for the CPU version of MLAA we did not assign different weights to different channels, deferring more advanced luminance processing until GPU implementation.

...We did not try to implement other MLAA steps using SSE® instructions (though it might be possible to do it using SSE4 operations), opting instead for preserving the universal nature of the algorithm. The upcoming Larrabee chip, as well as modern GPU cards, are capable of handling 8-bit data extremely efficiently, so our algorithm will benefit from porting to these architectures.

http://visual-computing.intel-research.net/publications/mlaa.pdf

Shifty Geezer · May 31, 2010

Nebula said:
But that is the GOW solution done for PS3 architecture. And nothing prevents to acheive even better MLAA solution on other hardware by taking advantage of the hardware features.

Possibly. I have repeatedly said we may yet see an MLAA algorithm running on GPUs. However, it is wrong to say at this point, "it can be done on a DX9 GPU," when the algorithm and hardware architecture and current lack of examples do not support that view. It's also wrong to say nothing prevents other architectures running a better solution. Architectures have inherent limits due to the compromises made to make them ideally suited for their designed workloads, programmability versus performace regards GPUs. If GPUs were as capable as CPUs for running 'CPU-code', we wouldn't have CPUs now, would we?

Again, that's not to say the task cannot be mapped onto GPUs; I have remarkable faith in the ingenuity of developers and researchers! But let's not jump the gun and say DX9 GPUs are capable of producing better IQ MLAA than GOW3 on faith alone.

Maybe becouse it is a hit or miss regarding visual improvements.

How is it hit or miss? It's at worse no more hit-and-miss that HDR breaking MSAA.

Atleast ATI is doing some form of custom AA with edge detection and more with shaders.
http://developer.amd.com/assets/Architecture_Overview_RH.pdf (page 42 and beyond)

That's edge-detection for an optimised multiple sampling. MLAA requires information for the whole edge, not just the immediate locality. For this reason GPUs are a very poor fit. Optimised edge-based multisampling will provide better quality of sub-pixel meshes than MLAA, and may well offer excellent IQ that's a suitable alternative. Again though, nothing yet points to GPUs being able to implement the MLAA concept and filling in the antialised detail based on edge information.

Neb · May 31, 2010

Shifty Geezer said:
Possibly. I have repeatedly said we may yet see an MLAA algorithm running on GPUs. However, it is wrong to say at this point, "it can be done on a DX9 GPU," when the algorithm and hardware architecture and current lack of examples do not support that view.

First of all I said...

They could just run it on GPGPU (CUDA/ATI Stream) on DX9-DX11 on Nvidia 8xxx ATI 2xxx hardware.

..which is a big difference from your sentence morphing.

"it can be done on a DX9 GPU,"

But let's not jump the gun and say DX9 GPUs are capable of producing better IQ MLAA than GOW3 on faith alone.

Never have I stated it would do better... nor that it would be DX9 GPUs.

How is it hit or miss? It's at worse no more hit-and-miss that HDR breaking MSAA.

Tell that to the sub pixel sized polygons and transparency.

Again though, nothing yet points to GPUs being able to implement the MLAA concept and filling in the antialised detail based on edge information.

The Intel MLAA concept seems well suited for GPUs or are the Intel MLAA algorithm developer(s) lying in their document? (see comments from previous post)

patsu · May 31, 2010

Nebula said:
Well if algorithm developer says algorithm is better suited for GPU than CPU (PC). And a CPU of the type Quad-core 3.0GHz needs ~5ms for Intel MLAA...

http://visual-computing.intel-research.net/publications/mlaa.pdf

I think on PC, they prefer a general solution since the hardware has an abundant and upgradable amount of spare capacity. They can always scale MSAA higher and higher to have similar results. So the benefits may be lesser. However, Intel's MLAA should be doable in modern GPUs.

The real question is not whether it can be done on certain GPUs. It's whether it can be done within the allocated time, and reduce/minimize the so-called strobing effects and support sub-pixel handling (if necessary). Why did the Metro 2033 guys not implement MLAA ?

Neb · May 31, 2010

patsu said:
I think on PC, they prefer a general solution since the hardware has an abundant and upgradable amount of spare capacity. They can always scale MSAA higher and higher to have similar results. So the benefits may be lesser. However, Intel's MLAA should be doable in modern GPUs.

Well might be for pushing sales on hardware. But just speculation!

Yes either on GPU or CPU as it has low impact on CPU perfomance and many games dont use all threads nor keep threads fully occupied. Think about Crysis only uses 2 threads leaving a Quad with 2 idling threads and 2 threads in use are not keept fully busy!

The real question is not whether it can be done on certain GPUs. It's whether it can be done within the allocated time, and reduce/minimize the so-called strobing effects and sub-pixel handling (if necessary). Why did the Metro 2033 guys not implement MLAA ?

The perfomance should give them a wide playfield which will change its form depeding on efficiency. About Metro 2033 I dont know but perhaps a solution suitable on multiplatform. According to devs quality should be slightly worse than for MLAA on certain angles. Think they mentioned ~5ms perfomance hit on Xenos.

patsu · May 31, 2010

Nebula said:
Well might be for pushing sales on hardware. But just speculation!

Yes either on GPU or CPU as it has low impact on CPU perfomance and many games dont use all threads nor keep threads fully occupied. Think about Crysis only uses 2 threads leaving a Quad with 2 idling threads and 2 threads in use are not keept fully busy!

I am not familiar with PC pipeline these days.
Don't you have to let the CPU read and update video memory efficiently first, especially when people love 60fps or higher on PCs with everything turned on ?

The perfomance should give them a wide playfield which will change its form depeding on efficiency. About Metro 2033 I dont know but perhaps a solution suitable on multiplatform. According to devs quality should be slightly worse than for MLAA on certain angles. Think they mentioned ~5ms perfomance hit on Xenos.

I remember some other posters complained that aliasing is visible for near vertical (or horizontal ?) lines. How does Metro AA work w.r.t. Intel MLAA ?

Johnny_Physics · May 31, 2010

legendCNCD said:
http://capped.tv/fairlight_cncd-agenda_circling_forth
Our fast made fluid particle demo won this years BP10, though its very gpu consuming, not for faint (under gtx260)

This kind of thing I'd really like to see next-gen. Enough for particles

That demo is totally awesome.
For the ones on slower connections(like me) or slower computers/laptops(like me) here's another link:
http://www.youtube.com/watch?v=ON4N0yGz4n8

legendCNCD, would you mind if I PM:ed you about about a few things?

T.B. · May 31, 2010

Nebula said:
The Intel MLAA concept seems well suited for GPUs or are the Intel MLAA algorithm developer(s) lying in their document? (see comments from previous post)

That's some strong wording you got there. Alex Reshetov thinks it should be portable to GPUs, I think it's not a problem well suited for them. I can see that that's confusing, but it's no reason to start calling people liars, especially when talking about their published research.

So, I just had a quick look at Alex' code and it seems that he's distributing work in blocks of 8 (horizontal or vertical) scanlines(*). So, is that embarrassingly parallel? For very large numbers of lines, it pretty much is. For small numbers of lines, it isn't. So, given a 720p image, I get a maximum of 90 blocks for the vertical case. Is that enough for GPU parallelization? Not on CUDA it isn't. But on larrabee it just might be.

In other words: Different perspectives, different opinions.

(* No, that's not what we do.)

patsu · May 31, 2010

Ok, I see where the implementation discrepancies lie now. Thanks for clarifying, T.B.

Neb · May 31, 2010

T.B. said:
That's some strong wording you got there. Alex Reshetov thinks it should be portable to GPUs, I think it's not a problem well suited for them. I can see that that's confusing, but it's no reason to start calling people liars, especially when talking about their published research.

And neither have I called anyone liar.

...can be implemented on the GPU even if the main algorithm runs on the CPU.

...MLAA, even in its current un-optimized implementation, is reasonably fast, processing about 20M pixels per second on a single 3GHz core.

...It is embarrassingly parallel and on a multicore machine can be used to achieve better load balancing by processing the final output image in idle threads.

...To avoid mixing integer and float values, for the CPU version of MLAA we did not assign different weights to different channels, deferring more advanced luminance processing until GPU implementation.

...We did not try to implement other MLAA steps using SSE® instructions (though it might be possible to do it using SSE4 operations), opting instead for preserving the universal nature of the algorithm. The upcoming Larrabee chip, as well as modern GPU cards, are capable of handling 8-bit data extremely efficiently, so our algorithm will benefit from porting to these architectures.

http://visual-computing.intel-research.net/publications/mlaa.pdf

patsu · May 31, 2010

Nebula said:
http://visual-computing.intel-research.net/publications/mlaa.pdf

It looks like they are referring to pure CPU performance on an image when talking about their multicore implementations. In a game, the entire run-time needs to be integrated and shared with the GPU pipeline. Someone should already be experimenting with the setup on the PC side as we speak.

The author mentioned that his current CPU framework can be used for raytracing too. But that does not necessarily mean that it can run a raytraced game efficiently. He may be talking about something of a much larger scale (hence massively parallelizable). This would tie in with T.B.'s observation (that it may be difficult for a small dataset, but can scale better for large dataset): [size=-2]Remember, we are talking about a research paper in general. The application area is open and left for implementors to tackle on a case-by-case basis.[/size]

Morphological antialiasing (MLAA) has a set of unique characte- ristics distinguishing it from other algorithms. It is completely independent from the rendering pipeline. In effect, it can be used for both rasterization and ray tracing applications, even though we consider it naturally aligned with ray tracing algorithms, for which there is no hardware acceleration available. It represents a single post-processing kernel, which can be used in any ray tracing application without any modifications and, in fact, can be imple- mented on the GPU even if the main algorithm runs on the CPU.

...

Even though ray tracing rendering is highly parallelizable as such, creating or updating accelerating structures for dynamic models has certain scalability issues [Wald et al. 2007]. MLAA is well positioned to use these underutilized cycles to achieve zero impact on overall performance.

EDIT:
This line of discussion reminds me of Guerilla's slides and comments about dynamic radiocity:
http://forum.beyond3d.com/showpost.php?p=1434701&postcount=209

What can be done if the CPU, GPU and memory are "tightly coupled" ?

Neb · May 31, 2010

patsu said:
It looks like they are referring to pure CPU performance on an image when talking about their multicore implementations. In a game, the entire run-time needs to be integrated and shared with the GPU pipeline. Someone should already be experimenting with the setup on the PC side as we speak.)

Yes the other link has one example of perfomance with Intel MLAA on PC.

The author mentioned that his current CPU framework can be used for raytracing too. But that does not necessarily mean that it can run a game efficiently. He may be talking about something of a much larger scale (hence massively parallelizable). This would tie in with T.B.'s observation (that it may be difficult for a small dataset, but can scale better for large dataset):

With software rendering SSAA, MSAA(?) would cost tremendously and perhaps what MLAA was mainly targeted for. Anyway they give some mp numbers for what a single thread can proccess per second with un-optimised code, single CPU thread. And aint such HW like Nvidia and ATI GPUs mass multi-core hardware that relies on large scale parallelisation?

patsu · May 31, 2010

Nebula said:
Yes the other link has one example of perfomance with Intel MLAA on PC.

With software rendering SSAA, MSAA(?) would cost tremendously and perhaps what MLAA was mainly targeted for. Anyway they give some mp numbers for what a single thread can proccess per second with un-optimised code, single CPU thread. And aint such HW like Nvidia and ATI GPUs mass multi-core hardware that relies on large scale parallelisation?

Yes, single thread implementation means he's not relying much on parallelization (now).

And if he sets his eyes on grand challenges, his parallelization numbers and approach may not be applicable for small problems like games.

There is overhead in parallelization. It's easier to hide and spread the overhead for larger datasets (and longer time horizon).

Neb · May 31, 2010

patsu said:
Yes, that means he's not relying much on parallelization (now).

And if he sets his eyes on grand challenges, his parallelization numbers and approach may not be applicable for small problems like games.

Well it is an example for perfomance but it mention it's 'embarrassingly parallel'. Anyway doing it for a scene in a modelling program or game, both are images per second with X amount of mega pixels per frame. The MLAA would be applied to final output frames.

I find this interesting.

...We did not try to implement other MLAA steps using SSE® instructions (though it might be possible to do it using SSE4 operations), opting instead for preserving the universal nature of the algorithm. The upcoming Larrabee chip, as well as modern GPU cards, are capable of handling 8-bit data extremely efficiently, so our algorithm will benefit from porting to these architectures.

patsu · May 31, 2010

Nebula said:
Well it is an example for perfomance but it mention it's 'embarrassingly parallel'.

Exactly. I don't think it conflicts with T.B.'s observation/findings. For a large enough dataset and time horizon, you can splurge on the cores. There are data dependencies/ordering in the dataset, so for a small problem set, you may not gain as much. Note that the author never promise that his algorithm will be applicable in all situations. He uses 8-bit data so that it can be ported to more platforms (CPU and GPU).

Neb · May 31, 2010

patsu said:
Exactly. I don't think it conflicts with T.B.'s observation/findings. For a large enough dataset and time horizon, you can splurge on the cores. There are data dependencies/ordering in the dataset, so for a small problem set, you may not gain as much.

As I understand it, it scales better for higher-resolutions. Example of 720p not being suitable for CUDA but how about resolutions that is used on PC which ranges in the 2-4x pixel amount?

There are data dependencies/ordering in the dataset, so for a small problem set, you may not gain as much.

But would you really have to gain a lot to have it run fast? The example of a Q3.0GHz having the MLAA take 5ms render time on a 1024x1024 render.

Note that the author never promise that his algorithm will be applicable in all situations. He uses 8-bit data so that it can be ported to more platforms (CPU and GPU).

That is understandable and brings the question to how to find a solution? Adapt to hardware. Isn't GOW3s MLAA a modified version of Intel MLAA, I mean it aint running a chalk copy of the Intel MLAA 'un-optimised' algorithm or? What about the other games solution on all platforms?

Graphical effects rarely seen this gen that you expect/hope become standard next-gen

Neb

Iron "BEAST" Man

Arwin

Now Officially a Top 10 Poster

Shifty Geezer

uber-Troll!

Neb

Iron "BEAST" Man

Neb

Iron "BEAST" Man

Shifty Geezer

uber-Troll!

Neb

Iron "BEAST" Man

patsu

Neb

Iron "BEAST" Man

patsu

Johnny_Physics

T.B.

patsu

Neb

Iron "BEAST" Man

patsu

Neb

Iron "BEAST" Man

patsu

Neb

Iron "BEAST" Man

patsu

Neb

Iron "BEAST" Man