Is HDR "Free" for the Xenos?

... meaning to say I haven't seen anyone work the math out.
Well, lemme start by kicking off some math. You guys are free to tweak variables as needed.

Let's assume: 4xAA, 32-bpp texels, 24-bit depth, 8-bit stencil. 720p (1280x720), 60 frames/second.

Let's also assume that we do a depth first pass with average complexity 2, then do a color write pass such that early-z / hier-z culls all non-visible pixels in that pass for free.

Let's also ignore the effects of tiling, and assume everything fits in the eDRAM (ie: the eDRAM is infinitly large)

The initial Z passes will use up: 4 bytes * 1280 * 720 * 60 frames * 2 (avg complexity) * 4 samples = 1.65 GB/sec.

If AA compression is 90% efficient, bandwidth for Z drops to just 0.53 GB/sec.

The color passes would then need: 4 bytes * 1280 * 720 * 60 * 4 samples = 0.82 GB/sec (0.27 GB/sec with 90% effective AA compression).

Let's assume you blend a lot, so let's assume AA compression drops to 80% (more intersections with blending) and the bandwidth requirements otherwise double.

You end up needing 0.66 GB/sec for color.

Total eDRAM bandwidth so far: 2.32 GB/sec.

Now let's throw in tiling due to a limited eDRAM size, but ignore the additional VS work: We need to read the eDRAM (10MB), then write it out to main memory. At 3 tiles * 60 fps, that's 1.8 GB/sec for reads.

Total eDRAM bandwidth so far: 4.12 GB/sec.

The eDRAM has 256 GB/sec.

Even if we factor in inefficiency in the eDRAM implementation, and assume scenes are 10x more complex, it's hard to reduce 256 GB/sec to just ~40 GB/sec.

Clearly, the eDRAM is likely never the bottleneck.

What about the external memory though?

The bandwidth there is 22.4 GB/sec.

If we read 1.8GB/sec from the eDRAM (to page out 3 tiles @ 60fps), then downsample, then write back colors to main memory, we end up with a needed local DRAM bandwidth of:
1280 * 720 * 4 * 60 = 205 MB/sec.

Thus, there is 22.4 - 0.2 = 22.2 GB/sec of bandwidth left (99%) for texturing or other applications.

Assuming a memory efficiency of 85%, we get: 0.85 * 22.4 - 0.2 = 19.04 GB/sec (still 99%).

That number is independent of the scene complexity (assuming you still run at a constant 60fps).
 
The copy from the EDRAM back buffer into the front buffer is just 8 bits per colour per pixel, 3 bytes per pixel.

As I've just posted on the previous page, performance appears to be dependent on the number of triangles that cross tile boundaries.

I was sorta hoping you'd have a go at working out how many triangles might do so... :cry:

Jawed
 
The copy from the EDRAM back buffer into the front buffer is just 8 bits per colour per pixel, 3 bytes per pixel.
Yup. I used 4 bytes/pixel instead. Not that it matters, the difference is small enough to just not matter.

I was sorta hoping you'd have a go at working out how many triangles might do so...

That entirely depends how the triangles flow through the pipeline. Anyone have a good detailed article on the topic?
 
bob, what's with the 'aa compression'? there's no such thing to my knowledge in c1 - if we can talk about any comrpession there it comes from the embedded ROPs - i.e. you save bandwidth from not having to send all 4 subsamples to the framebuffer, but within the (embedded) framebuffer there's not compression whatsoever.
 
Bob said:
... meaning to say I haven't seen anyone work the math out.

Total eDRAM bandwidth so far: 4.12 GB/sec.

The eDRAM has 256 GB/sec.

Even if we factor in inefficiency in the eDRAM implementation, and assume scenes are 10x more complex, it's hard to reduce 256 GB/sec to just ~40 GB/sec.

Clearly, the eDRAM is likely never the bottleneck.

What about the external memory though?

The bandwidth there is 22.4 GB/sec.

I tought the 256gb/s was only the speed between the edram and the logic????
 
bob, what's with the 'aa compression'? there's no such thing to my knowledge in c1 - if we can talk about any comrpession there it comes from the embedded ROP
Sure, you can remove AA compression from my numbers. It doesn't change much.

1.65 GB/sec for Z + 1.65 GB/sec for color + 1.8 GB/sec for tiling (worst case) == 5.1 GB/sec.

For a scene 10x more complex, you get 34.8 GB/sec, well under the 256 GB/sec available.

I tought the 256gb/s was only the speed between the edram and the logic????
The eDRAM contains the logic - that's how you effectively get 256 the GB/sec. Btw, it's taken in consideration in my numbers.
 
Fafalada said:
Turning off the AA doesn't increase the peak fill-rate in Xenos.
Nope, but turning off FP16 does :p

Yep!!! Is 2GP/s enough for any 720p FP16 HDR game?...

Might it be the increase in texturing bandwidths that kills performance, rather than a slowdown in the export to ROPs or within the ROPs themselves?

I suppose cross-platform games like Splinter Cell might stick with FP16, but XB360 exclusives will stick to FP10.

Jawed
 
Jawed said:
Yep!!! Is 2GP/s enough for any 720p FP16 HDR game?...
Personally I think it's plenty. There's only a few places where we need all the fillrate we can get - one area is shadow rendering, and that doesn't require FP buffers, so it'll trivially work at max fill.
The other stuff like lots of transparent particles and postprocessing is up for debate - if you really spend a lot of time doing those then you may consider giving up FP16, but I think in most cases it shouldn't be an issue.

I think that the biggest speed advantage of FP10 over FP16 is most likely not fillrate - but reduced number of tiles that you need to fill the screen.
 
Fafalada said:
I think that the biggest speed advantage of FP10 over FP16 is most likely not fillrate - but reduced number of tiles that you need to fill the screen.

Do you think the impact of FP16 textures on overall bandwidth would be insignificant?

Jawed
 
Jawed said:
The 5% seems to relate to the count of triangles that are transformed-and-lit in vertex shader code. I'm guessing that ATI estimates that upto 5% of the triangles in a scene will cross either of the 2 tile boundaries in a 3-tile frame-buffer. That would cost both in terms of the bandwidth consumed in reading the vertex data for those triangles and in executing the vertex shaders and running the triangles through the various set-up and rasteriser engines.

Jawed

The problem of tiling is mainly outside the edram itself, it has a cost on the shared 22.4 GB/s bandwidth of xbox360, and a cost on vertex shaders. If ATI assume 4X AA @ 720P its from 95-99 % effective performance compared to no AA, ( so the average hit in performance according to ATI is 2.5%, the worst case 5%, the best case 1% ), I dont believe in this, because if its true than no developer will ever use 2x AA on xbox360 ( in the worst scenario 5% hit performance means you pass from 30 fps for example to 28.5 fps, and from 60 fps to 57 fps, this is the worst scenario according to ATI ) which wont be the case...and you will see by yourself on xbox360 launch...
 
fouad said:
The problem of tiling is mainly outside the edram itself, it has a cost on the shared 22.4 GB/s bandwidth of xbox360, and a cost on vertex shaders. If ATI assume 4X AA @ 720P its from 95-99 % effective performance compared to no AA, ( so the average hit in performance according to ATI is 2.5%, the worst case 5%, the best case 1% ), I dont believe in this, because if its true than no developer will ever use 2x AA on xbox360 ( in the worst scenario 5% hit performance means you pass from 30 fps for example to 28.5 fps, and from 60 fps to 57 fps, this is the worst scenario according to ATI ) which wont be the case...and you will see by yourself on xbox360 launch...

Once again fouad... you make a statement with no backup. Indulge us... why will we see that it "wont be the case" at the 360 launch. You have proven you dont understand the theory behind how Xenos works and very few people have seen it in practice... so how do YOU know?
 
blakjedi said:
fouad said:
The problem of tiling is mainly outside the edram itself, it has a cost on the shared 22.4 GB/s bandwidth of xbox360, and a cost on vertex shaders. If ATI assume 4X AA @ 720P its from 95-99 % effective performance compared to no AA, ( so the average hit in performance according to ATI is 2.5%, the worst case 5%, the best case 1% ), I dont believe in this, because if its true than no developer will ever use 2x AA on xbox360 ( in the worst scenario 5% hit performance means you pass from 30 fps for example to 28.5 fps, and from 60 fps to 57 fps, this is the worst scenario according to ATI ) which wont be the case...and you will see by yourself on xbox360 launch...

Once again fouad... you make a statement with no backup. Indulge us... why will we see that it "wont be the case" at the 360 launch. You have proven you dont understand the theory behind how Xenos works and very few people have seen it in practice... so how do YOU know?

Well, it wont be the case, because you will see some xbox360 games at launch running at only 2x AA @ 720P, and not 4x AA, I think this is a proof that the hit in performance on xbox360 when passing to 4x AA @ 720 is more than 1-5 %, because if its 1-5 %, than every xbox360 game will have 4x AA. Dont you agree ?
 
It's very apparent as I pointed out earlier, that fouad's entire premise is 'if 4xAA were that 'cheap', devs would be using it and ATi wouldn't need to encourage them' with a basis of this 'theory' founded purely in skepticism and no technical know-how.

I suggest leaving him/her to his/her theories and let this one slide...
 
Shifty Geezer said:
It's very apparent as I pointed out earlier, that fouad's entire premise is 'if 4xAA were that 'cheap', devs would be using it and ATi wouldn't need to encourage them'

This is part of the truth, and I dont find any problem of my logic, if you have a problem with this, please tell me whats the problem ?


with a basis of this 'theory' founded purely in skepticism and no technical know-how.

This is not true,
though I cant prove technically that 4X AA @ 720p will hit the performance more than 5%, but nor anyone could prove that its 5% or less. so whats the problem to refer our debate to developers ?
 
fouad said:
Well, it wont be the case, because you will see some xbox360 games at launch running at only 2x AA @ 720P, and not 4x AA, I think this is a proof that the hit in performance on xbox360 when passing to 4x AA @ 720 is more than 1-5 %, because if its 1-5 %, than every xbox360 game will have 4x AA. Dont you agree ?

So your've switched arguments completely once again? Now it's that 4xAA can't possibly be only 5% hit because otherwise it would be the standard and not 2xAA?

Man, this must be the 5th time you've changed your argument in this thread alone.

Did you ever consider that MS doesn't want to force developers to take a hit if they don't want to? 5% is still 5%.

If you are a developer struggling with framerates, and you are only hitting 24FPS while needing 30FPS, then that "little" 5% hit might mean alot. It's one step closer to achieving your target framerate.

Games like Heavenly Sword i think would be an example of this, the game is running extremely slow and the graphics don't need AA, they look excellent as is with no AA at all, so this is a situation where 2xAA would surely suffice, and the extra 5% framerate boost would be appreciated.

So long story short, it's entirely concievable that MS is requireing 2xAA because it has literally no impact(<1%) however 4xAA does have an impact, however small, they are leacing that option open for developers.

Makes sense to me.

p.s. we can speculate all we want but ATI has stated 4xAA is 95% efficient. So, unless you have PROOF that they are lying, your argument is nothing more than baseless skepticism and doubt.

We take Nvidia's word when they talk about RSX, and IBM's word when the release CELL benchmarks. If you start claiming companies are lying, where do you stop?
 
Turning off the AA doesn't increase the peak fill-rate in Xenos.
Who said anything about peak fillrate changing as a result of antialiasing? I'm talking about the fact that it IS a fixed limit. If you have 7 or 8 accumulated passes all antialiased vs. all non-antialiased, you will be fillrate-bound. There's no such thing as AA that doesn't require more fillrate to perform... unless someone finds a way to scale up the framebuffer contents in such a way as to get correct predicted results and then downsample the end results.

And anytime there's a claim that it's "impossible" to be bound by a particular limit, that claim is invariably a 100% lie. If 4 GP/sec sounds like a lot, just wait long enough and someone will be limited by it.

Either way, it's not like Xenos is the weakest link in the box. That would have to be the CPU<->memory junction, the way things look right now.
 
ShootMyMonkey said:
Turning off the AA doesn't increase the peak fill-rate in Xenos.
Who said anything about peak fillrate changing as a result of antialiasing? I'm talking about the fact that it IS a fixed limit. If you have 7 or 8 accumulated passes all antialiased vs. all non-antialiased, you will be fillrate-bound.

There will be no difference in the fill-rate limit between AA and non-AA.

The parent->daughter bandwidth, 32GB/s, is enough for 4GP/s including 4xAA sample data.

Thereafter all of the extreme bandwidth demands for AA are taken care of inside the EDRAM unit, that's what 256GB/s is there for.

It makes no difference to the fill-rate.

Read the Xenos article.

Jawed
 
scooby_dooby said:
fouad said:
Well, it wont be the case, because you will see some xbox360 games at launch running at only 2x AA @ 720P, and not 4x AA, I think this is a proof that the hit in performance on xbox360 when passing to 4x AA @ 720 is more than 1-5 %, because if its 1-5 %, than every xbox360 game will have 4x AA. Dont you agree ?

So your've switched arguments completely once again? Now it's that 4xAA can't possibly be only 5% hit because otherwise it would be the standard and not 2xAA?

Man, this must be the 5th time you've changed your argument in this thread alone.
NO, I am not switching arguments, I am adding arguments.

Did you ever consider that MS doesn't want to force developers to take a hit if they don't want to? 5% is still 5%.

If you are a developer struggling with framerates, and you are only hitting 24FPS while needing 30FPS, then that "little" 5% hit might mean alot. It's one step closer to achieving your target framerate.

Games like Heavenly Sword i think would be an example of this, the game is running extremely slow and the graphics don't need AA, they look excellent as is with no AA at all, so this is a situation where 2xAA would surely suffice, and the extra 5% framerate boost would be appreciated.

So long story short, it's entirely concievable that MS is requireing 2xAA because it has literally no impact(<1%) however 4xAA does have an impact, however small, they are leacing that option open for developers.

Makes sense to me.

p.s. we can speculate all we want but ATI has stated 4xAA is 95% efficient. So, unless you have PROOF that they are lying, your argument is nothing more than baseless skepticism and doubt.

We take Nvidia's word when they talk about RSX, and IBM's word when the release CELL benchmarks. If you start claiming companies are lying, where do you stop?

Its not 24 fps, but 28.5 fps at the worst case, this is impossible.

And do you believe on the 1080p 4x AA gameplay is here. chart fom NVIDIA ?!!
Did you believed the real time toy story from SONY ?!!
...etc

please answer me.
 
Back
Top