Is free AA really worth it?

Status
Not open for further replies.
BenQ said:
Ontop of that, it is only the uninformed among us who believe that the entire 100 transistors, making up the daughter die, is used for "just" AA. :LOL:
And it is only the most wishful thinking x360 fans that believe it is used for anything incredibly significant.

Seriously, every x360 fan is expecting ATI to have delivered the miracle GPU that has incredible shading power and 'free' AA, that whomps RSX. When the reality is that RSX is a beast that I have no doubt will trump Xenos in many departments.

You can't have everything ;)

(Ps; Far Cry, HL2 and WoW are some of the most graphically intensive titles out atm and didn't take much of a hit. And with PS3 being a closed environment the GPU can be utilised much more efficiently for the application of AA.)
 
Nicked said:
When the reality is that RSX is a beast that I have no doubt will trump Xenos in many departments.
Maybe you should justify your opinion..
 
Nicked said:
BenQ said:
Ontop of that, it is only the uninformed among us who believe that the entire 100 transistors, making up the daughter die, is used for "just" AA. :LOL:
And it is only the most wishful thinking x360 fans that believe it is used for anything incredibly significant.

Seriously, every x360 fan is expecting ATI to have delivered the miracle GPU that has incredible shading power and 'free' AA, that whomps RSX. When the reality is that RSX is a beast that I have no doubt will trump Xenos in many departments.

You can't have everything ;)

(Ps; Far Cry, HL2 and WoW are some of the most graphically intensive titles out atm and didn't take much of a hit. And with PS3 being a closed environment the GPU can be utilised much more efficiently for the application of AA.)

A fanboys responce such as yours is hardly worth a lengthy responce, so here's the one you get.

You claim that the dauhter die will not be used for anything significant and most likely believe that it's "just there for AA".

But perhaps you should ask yourself what those 192 tiny CPU's surrounding the embedded eDRAM are for?

And then ask yourself why then needed 256G/s of bandwidth for "just AA."

If that were the case, there would be NO AA performaed by ANY GPU today, and AA would be little more than a feature we dream of having in the far distant future.

And after your done that, go here and inform yourself......

http://www.beyond3d.com/articles/xenos/
 
BenQ said:
It was taking a 17% hit at 1024 X 768 and a whopping 44% hit at 1600X 1200.
It's also a game with very limited shader usage compared to what I expect to see on average in upcoming games. As far as I'm concerned D3 is just as irellevant as the other games on that chart in respect to RSX&PS3 performance.
 
BenQ said:
AA is FAR from "free" for the G70. I'm sure your thinking of those charts nVidia released right?
Alas, those charts are going to leave a lasting legacy of FUD it seems in the latest 'my xxx is better than your xxx' 'debate' :rolleyes:
 
BenQ said:
A fanboys responce such as yours is hardly worth a lengthy responce, so here's the one you get.

You claim that the dauhter die will not be used for anything significant and most likely believe that it's "just there for AA".

But perhaps you should ask yourself what those 192 tiny CPU's surrounding the embedded eDRAM are for?

And then ask yourself why then needed 256G/s of bandwidth for "just AA."

If that were the case, there would be NO AA performaed by ANY GPU today, and AA would be little more than a feature we dream of having in the far distant future.

And after your done that, go here and inform yourself......

http://www.beyond3d.com/articles/xenos/

There is not a single programmable part on the eDRam-Die. Its all fixed function: Color/Alpha-Blends, z-Stencil Test and FSAA.
Denoting it SIMD-Elements is actually weird, but I guess that would mean it counts for the 1 Teraflop PR. Calling them CPUs is just outright wrong.

You should take your own advice and read the article you posted, together with some research what SIMD, CPU and programmable actually means :?
 
Talking of Xenos and AA, is this 95% figure is believable with tiling?
http://www.beyond3d.com/articles/xenos/index.php?p=05#tiled
So in terms of supporting FSAA the developers really only need to care about whether they wish to utilise this tiling solution or not when deciding what depth of FSAA to use (with consideration to the depth of the buffers they require as well). ATI have been quoted as suggesting that 720p resolutions with 4x FSAA, which would require three tiles, has about 95% of the performance of 2x FSAA.
 
BenQ said:
And then ask yourself why then needed 256G/s of bandwidth for "just AA."
Alas, those charts are going to leave a lasting legacy of FUD it seems in the latest 'my xxx is better than your xxx' 'debate' :p

FYI the 256 GB/s isn't needed for AA. You can do AA with less bandwidth. The 256 GB/s internal eDRAM bandwidth rate is more than enough for AA requirements. The internal logic gates are very specialist single function processing units and not little CPUs, the idea being to stop thrashing RAM bandwidth by coupling these functions with their own high speed RAM so they can do their work optimally without affecting system BW. I don't know the full set of functions these eDRAM logic circuits do, but Npl gives a good idea above - it's testing and blending functions.
 
It could be aswell that the 10MB Framebuffer is divided into - say 16 - Banks, each with the same, but seperate block of logic. that would leave 16GB/s and 12 "logic Units" per bank. This should simplify the logic as each 12 "logic Units" only have to access a smaller Block of Ram (and lower latencies too).

Just guessing
 
Shifty Geezer said:
FYI the 256 GB/s isn't needed for AA.

At 4GP/s with 4 AA samples per pixel (at 4 bytes per sample), to update the z-data requires a read of 64GB/s and a write of 64GB/s - that's 128GB/s.

That's simply to perform AA.

Jawed
 
I like how the ps3 is painted as giving developers a choice, but oh btw, you don't get to choose what shaders do what, unlike the Xbox360's unified design. You could go back and forth all day.

Both machines are *complete* designs, balanced from thier own perspective. You can't just take 100m from the daughter die and plug it into the gpu, because the C1 was designed with the bandwidth savings of that 100m. You might as well suggest that Sony take one or two SPEs and plug those transistors into RSX, since it's been suggested by certain sites *cough* that the SPEs are just wasted die space anyway.

The AA vs no AA argument doesn't make any sense either, because the edram module doesn't do just AA.
 
Jaws said:
blakjedi said:
...
The 192 highspeed SIMD units on the EDRAM can be used for anything...

Sorry, I must've missed something but where did you get the idea that those 192 ALUs are programmable SIMD units?

Npl said:
There is not a single programmable part on the eDRam-Die. Its all fixed function: Color/Alpha-Blends, z-Stencil Test and FSAA.
Denoting it SIMD-Elements is actually weird, but I guess that would mean it counts for the 1 Teraflop PR. Calling them CPUs is just outright wrong.

Ahem. 8) Without copying the entire article I will only quote the parts are relevant to my statements... These quotes come direct from from Dave's article.

In simple terms the MEMEXPORT function is a method by which Xenos can push and pull vectorised data directly to and from system RAM. This becomes very useful with vertex shader programs as with the capabilities to scatter and gather to and from system RAM the graphics processor suddenly becomes a very wide processor for general purpose floating point operations.

MEMEXPORT expands the graphics pipeline further forward and in a general purpose and programmable way.

Other examples for its use could be to provide image based operations such as compositing, animating particles, or even operations that can alternate between the CPU and graphics processor.

With the capability to fetch from anywhere in memory, perform arbitrary ALU operations and write the results back to memory, in conjunction with the raw floating point performance of the large shader ALU array, the MEMEXPORT facility does have the capability to achieve a wide range of fairly complex and general purpose operations; basically any operation that can be mapped to a wide SIMD array can be fairly efficiently achieved and in comparison to previous graphics pipelines it is achieved in fewer cycles and with lower latencies. For instance, this is probably the first time that general purpose physics calculation would be achievable, with a reasonable degree of success, on a graphics processor and is a big step towards the graphics processor becoming much more like a vector co-processor to the CPU.

More people need to actually read the article instead of just trying to put other people down. :rolleyes:
 
blakjedi said:
Jaws said:
blakjedi said:
...
The 192 highspeed SIMD units on the EDRAM can be used for anything...

Sorry, I must've missed something but where did you get the idea that those 192 ALUs are programmable SIMD units?

Npl said:
There is not a single programmable part on the eDRam-Die. Its all fixed function: Color/Alpha-Blends, z-Stencil Test and FSAA.
Denoting it SIMD-Elements is actually weird, but I guess that would mean it counts for the 1 Teraflop PR. Calling them CPUs is just outright wrong.

Ahem. 8) Without copying the entire article I will only quote the parts are relevant to my statements... These quotes come direct from from Dave's article.

In simple terms the MEMEXPORT function is a method by which Xenos can push and pull vectorised data directly to and from system RAM. This becomes very useful with vertex shader programs as with the capabilities to scatter and gather to and from system RAM the graphics processor suddenly becomes a very wide processor for general purpose floating point operations.

MEMEXPORT expands the graphics pipeline further forward and in a general purpose and programmable way.

Other examples for its use could be to provide image based operations such as compositing, animating particles, or even operations that can alternate between the CPU and graphics processor.

With the capability to fetch from anywhere in memory, perform arbitrary ALU operations and write the results back to memory, in conjunction with the raw floating point performance of the large shader ALU array, the MEMEXPORT facility does have the capability to achieve a wide range of fairly complex and general purpose operations; basically any operation that can be mapped to a wide SIMD array can be fairly efficiently achieved and in comparison to previous graphics pipelines it is achieved in fewer cycles and with lower latencies. For instance, this is probably the first time that general purpose physics calculation would be achievable, with a reasonable degree of success, on a graphics processor and is a big step towards the graphics processor becoming much more like a vector co-processor to the CPU.

More people need to actually read the article instead of just trying to put other people down. :rolleyes:

Err...those articles are referring to the unified shaders, not the ALUs on the eDram. AFAIK, those ALUs are not programmable in any significant way.
 
Those quotes are talking about the shaders, not the logic on the eDRAM, as far as I can see.

JAWED : I don't know the maths for rendering too well, and you identify bandwidth usage nicely - thanks. My perspective is existing GPUs can do AA without that much BW, ergo it isn't all needed. It'll be good to see how much is used when we get performance metrics for the hardware in operation, as written how you've written it, it looks like AA is totally impossible at highres on existing GPUs and RSX who have but half or a third even of that 128 GB/s required BW.

Also, why 4 billion pixels? 270p ~ 1 million pixels x 60 fps x 4 for AA = 240 million pixels a second, 1/16th your 4 GPixel figure, which places BW needs at 4 GB/s for Z only; which explains why current hardware can manage it with 30ish GB/s BW.
 
blakjedi said:
...
More people need to actually read the article instead of just trying to put other people down. :rolleyes:

Who's putting you down?

Your quotes from Dave's article are referring to Xenos' shader core ALUs, i.e. the 48 vec4+48 scalar units being programmable SIMD units NOT the 192 ALUs in the EDRAM module that you were originally referring to. You're misunderstanding the article and what the GP execution units are.
 
Jaws said:
blakjedi said:
...
More people need to actually read the article instead of just trying to put other people down. :rolleyes:

Who's putting you down?

Your quotes from Dave's article are referring to Xenos' shader core ALUs, i.e. the 48 vec4+48 scalar units being programmable SIMD units NOT the 192 ALUs in the EDRAM module that you were originally referring to. You're misunderstanding the article and what the GP execution units are.

Shifty Geezer said:
Those quotes are talking about the shaders, not the logic on the eDRAM, as far as I can see.

Up! you're right. my fault. :oops: :D
 
Shifty Geezer said:
JAWED : I don't know the maths for rendering too well, and you identify bandwidth usage nicely - thanks. My perspective is existing GPUs can do AA without that much BW, ergo it isn't all needed. It'll be good to see how much is used when we get performance metrics for the hardware in operation, as written how you've written it, it looks like AA is totally impossible at highres on existing GPUs and RSX who have but half or a third even of that 128 GB/s required BW.

Two things:

1. current GPUs don't actually run at their peak fill-rate in any games. There's always a bottleneck somewhere to prevent that.

Xenos's relatively low peak fill-rate is actually achievable and not only that but it's achievable with AA, which no other current GPU can sustain.

2. current GPUs make use of various forms of compression to avoid consuming vast amounts of bandwidth (whether it's for AA or just plain z-testing).

Xenos doesn't compress data within the EDRAM module - saving transistors both for packing and un-packing data as well as avoiding the need to create buffers to support the packing and un-packing. This also reduces latency meaning there is no time spent idle within the EDRAM, waiting for data.

Compression efficiency falls off as triangles get smaller or as the poly count increases, so as games progress throughout the next gen bandwidth demand goes up (even without AA, due to higher framebuffer workload).

The effect in Xenos is that all framebuffer tasks, including AA processing, cannot slow down Xenos because the peak bandwidths generated in dealing with these tasks at 4GP/s are designed-in to the EDRAM unit.

Jawed
 
blakjedi said:
Jaws said:
blakjedi said:
...
The 192 highspeed SIMD units on the EDRAM can be used for anything...

Sorry, I must've missed something but where did you get the idea that those 192 ALUs are programmable SIMD units?

Npl said:
There is not a single programmable part on the eDRam-Die. Its all fixed function: Color/Alpha-Blends, z-Stencil Test and FSAA.
Denoting it SIMD-Elements is actually weird, but I guess that would mean it counts for the 1 Teraflop PR. Calling them CPUs is just outright wrong.

Ahem. 8) Without copying the entire article I will only quote the parts are relevant to my statements... These quotes come direct from from Dave's article.

In simple terms the MEMEXPORT function is a method by which Xenos can push and pull vectorised data directly to and from system RAM. This becomes very useful with vertex shader programs as with the capabilities to scatter and gather to and from system RAM the graphics processor suddenly becomes a very wide processor for general purpose floating point operations.

MEMEXPORT expands the graphics pipeline further forward and in a general purpose and programmable way.

Other examples for its use could be to provide image based operations such as compositing, animating particles, or even operations that can alternate between the CPU and graphics processor.

With the capability to fetch from anywhere in memory, perform arbitrary ALU operations and write the results back to memory, in conjunction with the raw floating point performance of the large shader ALU array, the MEMEXPORT facility does have the capability to achieve a wide range of fairly complex and general purpose operations; basically any operation that can be mapped to a wide SIMD array can be fairly efficiently achieved and in comparison to previous graphics pipelines it is achieved in fewer cycles and with lower latencies. For instance, this is probably the first time that general purpose physics calculation would be achievable, with a reasonable degree of success, on a graphics processor and is a big step towards the graphics processor becoming much more like a vector co-processor to the CPU.

More people need to actually read the article instead of just trying to put other people down. :rolleyes:

:rolleyes: oh dear, oh dear...
 
Shifty Geezer said:
My perspective is existing GPUs can do AA without that much BW, ergo it isn't all needed.
eDram block is designed to provide the maximum bandwith needed by the ROPs operating at their peak efficiency.
The point is not whether this peak bandwith usage will ever happen (it won't), it's to guarantee ROPs will never stall waiting for FB/Z memory accesses.

With your more standard GPU, on-demand loaded caches take the role of feeding the processing units faster. Unlike eDram they only need to be small in size to give good utilization, but they do not guaratee anything - a badly behaved application can become completely limited by external memory bandwith, giving you poor GPU utilization.
 
gurgi said:
I like how the ps3 is painted as giving developers a choice, but oh btw, you don't get to choose what shaders do what, unlike the Xbox360's unified design. You could go back and forth all day.

Both machines are *complete* designs, balanced from thier own perspective. You can't just take 100m from the daughter die and plug it into the gpu, because the C1 was designed with the bandwidth savings of that 100m. You might as well suggest that Sony take one or two SPEs and plug those transistors into RSX, since it's been suggested by certain sites *cough* that the SPEs are just wasted die space anyway
You can't choose what each individual xenos pipeline does (three arrays).

On Xenos, transistors for FSAA (and other fixed functions) will have influenced the budget on the main process. If microsoft insisted on matching GPU costs with RSX, we might have seen a 300m xenos process without a daughter die.

It seems the case that Nvidia pursues pixel processing more (in their decision not to include edram), while Ati stresses post processing; not to say that AA will be difficult on RSX either.
 
Status
Not open for further replies.
Back
Top