Is AF a bottleneck for Xenos?

Jawed said:
"Average" 170KB of L2 cache per thread, plus 16KB instruction and 16KB data cache = 202KB of cache per thread, with support for data to be read from memory direct into L1 without consuming L2, and for data to be written direct into L2 (aka cache locking) and be consumed by the GPU without touching memory.

The L2 cache is 8-way, so some threads will have more while others have less.

If the XB360 cache model was a naive as you paint it, then maybe it would be in trouble.

Jawed

Not to mention that it's an in-order processor too. Cache is there to hide latency and one could argue an in-order processor needs to hide latency better than an OOOe processor and it would need more cache. I'm also fully aware that more cache doesn't necessarily mean better too, but, heck, you have HT P4's with 2MBs. That's 1MB per thread...

Only my opinion...
 
Jaws said:
Not to mention that it's an in-order processor too. Cache is there to hide latency and one could argue an in-order processor needs to hide latency better than an OOOe processor and it would need more cache. I'm also fully aware that more cache doesn't necessarily mean better too, but, heck, you have HT P4's with 2MBs. That's 1MB per thread...

Only my opinion...

And I've been out of the loop for a while, but weren't both the cell ppe's and xenon cores' caches higher latency than usual lowering performance in these?
 
Jaws said:
Not to mention that it's an in-order processor too. Cache is there to hide latency and one could argue an in-order processor needs to hide latency better than an OOOe processor and it would need more cache. I'm also fully aware that more cache doesn't necessarily mean better too, but, heck, you have HT P4's with 2MBs. That's 1MB per thread...
As I always like to point out at this juncture, often beaten by single-threaded A64s with 512KB of cache.

And with 6 threads on Xenon, each thread is proceeding at an effective 1.6GHz. This obviously paints quite a different picture in terms of cycles of latency with L1 or L2 misses. A 41-cycle L1 miss in Xenon becoms a 21-cycle miss for a single thread if a core is running two threads symmetrically. And if a pre-fetch is setup correctly, this miss affects that thread only, with the other thread carrying on regardless.

It's up to the devs to tweak their code using pre-fetching, hardware-thread priorities, cache-line data alignment, data-tiling, blah blah in order to maximise performance and minimise miss-induced stalls or flushes.

It's pretty pointless saying the caches are too small without quantifying it - and we're utterly in the dark on that.

A doubling of cache size generally brings 10% more performance - but again, that's a rule of thumb that doesn't necessarily account for code being rewritten specifically to match the cache sizes under consideration.

Considering how big the dies are, modern GPUs really do have piddly amounts of cache. That's partly because they use vast register files (just a kind of memory accessed with a different set of patterns), but also because the useful lifetime of texels is pretty limited over the duration of a frame render.

Jawed
 
Jawed said:
As I always like to point out at this juncture, often beaten by single-threaded A64s with 512KB of cache.

No need to point it out to me because I've already mentioned more doesn't mean better. But I suspect it's for the benefit of the reader...

And with 6 threads on Xenon, each thread is proceeding at an effective 1.6GHz. This obviously paints quite a different picture in terms of cycles of latency with L1 or L2 misses. A 41-cycle L1 miss in Xenon becoms a 21-cycle miss for a single thread if a core is running two threads symmetrically. And if a pre-fetch is setup correctly, this miss affects that thread only, with the other thread carrying on regardless.

Yes, multi-threading is there to hide the latency too. But a G5's L2 latency is ~ 11 cycles...

It's up to the devs to tweak their code using pre-fetching, hardware-thread priorities, cache-line data alignment, data-tiling, blah blah in order to maximise performance and minimise miss-induced stalls or flushes.

I agree and maybe they are struggling...

It's pretty pointless saying the caches are too small without quantifying it - and we're utterly in the dark on that.

I believe my point was clear. I even said "SEEMS".

The 1MB cache for XeCPU for 6 threads across 3 cores *seems* to little to me and a likely cause for cache thrashing. Point being cache thrashing causes increased and unpredictable B/W demand for XeCPU in an UMA, therefore Xenos being B/W starved, therefore lacking AF... the thread topic...
 
zidane1strife said:
And I've been out of the loop for a while, but weren't both the cell ppe's and xenon cores' caches higher latency than usual lowering performance in these?

I can't remember what the CELLs PPE is but from the recent leak, the XeCPU memory latency is 525 cycles and the G5 ~ 205 cycles...
 
^^ you were referring to the caches latencies,

G5 and XeCPU L2 hit latency, 11 and 39 cycles respectively...(from leak)...
 
The document I have supercedes the leak.

L1 miss is 41 cycles and L2 miss is 610 cycles - both are minima.

L1 miss is longer on cores 1 and 2 because core 0 is closer to L2.

All L2 misses are dependent on other workloads that L2 might be servicing as well as memory controller contention.

Jawed
 
Guilty Bystander said:
Good for him Ã￾ have the final game running on my Xbox 360 right now.
When you're crouched in the grass and you use the scope then GRAW only runs like 10-15fps.

sorry to go off topic slightly here again but....

Maybe what you were experiencing was network lag.

I tried (8 hours yesterday) to replicate this "slow-down" and it was smooth as silk; in grass, scoped, crawling, firing, with smoke, explosions... you name it.

Now it does *by design* slow down your ability to move your weapon (aim) when zoomed in with a scope. Perhaps that's what you were noticing?
 
ROG27 said:
Larger caches and buffers in the GPU are going to be necessary in a closed-box system constrained by 128-bit memory interface . In order to compensate for the latency, larger caches built into the RSX's pipelines will help keep things flowing in the RSX and between the RSX and CELL. If like in PCs the memory interface was 256-bit wide (or greater), this wouldn't be necessesary and more of the transistor budget would be allocated to core logic. But because it is important that the console GPUs take easily to a die shrink for cost reduction purposes (thus the 128-bit memory interface), this isn't the case.

IMO this will be the high-level feature set of the RSX:

-8 single issue vertex shader units
-24 dual issue pixel shader units
-larger than typical caches found in PC parts
-128-bit memory interface with GDDR3 memory
-128-bit interface with CELL
-logic that allows for the enabling of lockstepping between shader units and SPEs
-DMA controller
-FlexIO
-550 Mhz internal clock speed

In your dream world maybe...
RSX will save us ALL!!!

Seriously guys.... these are first gen 360 titles... why don't we wait a little while longer to start bitching about stupid crap. Does the lack of AF make the games not enjoyable??? To me, no. Find me a game, IMHO, that looks as good as PGR3 or GRAW that currently on the market...
 
persiannight said:
In your dream world maybe...
RSX will save us ALL!!!

Are you serious? That is quite a modest guesstimate of what the RSX might be and it would only be logical that it would be that way seeing the way the PS3's architecture is set up. I assumed hardware developers realized early on that the 128-bit interface with memory was going to be a bottleneck and that they designed the GPU with larger caches to try and "hide" the latency. Enabling lockstepping between shader units and SPEs makes sense because of the massive amount of parallezation. A DMA controller in the GPU might make sense if the GPU is supposed to make direct calls to main memory (the XDR pool)...I'm unsure whether or not the GPU has can Bypass CELL in the process or not. Maybe someone more knowledgable can clear that up. We know FlexIO is going to be implemented on the RSX.

Thanks Jaws for clearing the CELL/RSX interface up for me. As for the Vertex and Pixel Shader Units, I was thinking and writing two different things. I meant to say single ALU for Vertex Shader Unit and dual ALU for Pixel Shader Unit.
 
JUst being pedantic....

But Bigger caches do not hide latency, they provide reduced latency for data that's in them.

Also A bigger cache would not overcome bandwidth defecit.
It would potentially allow you to to sustain more textures accessed in a cache friendly fashion (although this would be dependant on cahce architecture as much as size) or have better performance on random access patterns.
It would also be useful if you had increased latency from your memory pool, since you'd require a larger number of oustanding requests to pre-fill cache lines.

The reason that you don't need really huge caches for textures is that regular texture accesses are entirely predictable. The chip can prefetch all the data it's going to need, and that data is reused many times over a very short time span, so it doesn't lock the cache down.
Start not mipmapping, setting negative LOD Bias or writing shaders that do random indirections and all bets are off. A larger cache my limit the cost o these things.
 
Tap In said:
sorry to go off topic slightly here again but....

Maybe what you were experiencing was network lag.

I tried (8 hours yesterday) to replicate this "slow-down" and it was smooth as silk; in grass, scoped, crawling, firing, with smoke, explosions... you name it.

Now it does *by design* slow down your ability to move your weapon (aim) when zoomed in with a scope. Perhaps that's what you were noticing?

Thanks.
 
The Question:
persiannight said:
Does the lack of AF make the games not enjoyable?
The Answer:
Beyond3D Forum > Other 3D > Console Talk > Console Technology
 
Titanio said:
If it fundamentally boils down to bandwidth, you're not going to find any more over time (unfortunately).
nAo said:
C'mon Dave, have you checked all the racing games out there?
Obviously I can't be 100% use they're being clever but it does not take a rocket scientst to not waste tons of bandwith on using max aniso (road markings are bloody perfect in those shots) on all the road.
I don't know why you guys are going on and on about bandwidth. Anisotropic filtering consumes clock cycles but not much bandwidth unless you're using uncompressed textures. Since you're consuming clock cycles, you have more bandwidth per pixel anyway.

Whether AF is enabled or not, Xenos can only do 16 bilinear samples per clock. With >2xAF (actual) applied per pixel, only 2 of the 4 texels used in each sample won't be shared with neighbouring pixels. At 4-bits per compressed texel, that's 8GB/s peak.

nAo, your observation that geometry is used for the road lines is quite insightful though, and looks very likely because the grain within the road lines and markers definately gets blurred as if without aniso. It seems pretty smart because roads are quite low angle and will take a big hit with AF. However, I have played racing games that don't do this and have very blurry road lines.

Still, it would be nice if the grain in the road was sharp everywhere. Even if only one of the textures used AF, and they used a stretched texture to reduce the impact, that would be nice.

(Aside: nAo, I was hoping you'd address my post in the other bandwidth thread, if you have time)
 
ERP said:
JUst being pedantic....

But Bigger caches do not hide latency, they provide reduced latency for data that's in them.

Also A bigger cache would not overcome bandwidth defecit.
Thank you!

Soooo many people here think caches will save the bandwidth problems of RSX. The only way it helps is if the entire texture fits in the cache (unlikely for any reasonably detailed texture); moreover, texture bandwidth won't be the limiting factor anyway, especially if devs use compressed textures very extensively.

The primary purpose of texture caches is to make sure data isn't loaded redundantly when accessed by a local group of pixels. This way the 4 samples needed by a bilinear filter doesn't need 4 times the BW of point sampling, but instead only a few percent more BW. Graphics cards get pretty close to the ideal minimum texture bandwidth: (texture data in a region) / (pixels in that region) * (pixel rate for a shader).
 
Last edited by a moderator:
Mintmaster said:
I don't know why you guys are going on and on about bandwidth.

Well, if you can give us a better potential explanation as to why it's just not being used in many games, do tell us?

I'm surprised myself because the theoretical bandwidth requirement I've seen quoted isn't THAT high. Which is why I'm wondering if things are so tight in terms of bw that even a small amount extra would be the straw that broke the camel's back in many of these cases. The cost could be small, yet still too much, if BW was sufficiently scarce.
 
Last edited by a moderator:
Titanio said:
Well, if you can give us a better potential explanation as to why it's just not being used in many games, do tell us?

ERP already explained this.

If you're running behind schedule, and you've got performance problems, and you don't have time to figure out what the real problem is and fix it, you just start turning things off to make your ship date.

AF is one easy thing to turn off, and even if it gets you just 1% back, that's 1% closer to your ship criteria than you were before.
 
aaaaa00 said:
ERP already explained this.

If you're running behind schedule, and you've got performance problems, and you don't have time to figure out what the real problem is and fix it, you just start turning things off to make your ship date.

AF is one easy thing to turn off, and even if it gets you just 1% back, that's 1% closer to your ship criteria than you were before.
But why is it turned off in pr screenshots for games that might've or might've had months to go before going gold.
 
aaaaa00 said:
ERP already explained this.

If you're running behind schedule, and you've got performance problems, and you don't have time to figure out what the real problem is and fix it, you just start turning things off to make your ship date.

AF is one easy thing to turn off, and even if it gets you just 1% back, that's 1% closer to your ship criteria than you were before.

This is all true, but it all fundamentally ties back to performance, which must have technical reasoning behind it, which I think is what we're trying to expose here, no?

Do you get that x% back because you've alleviated bandwidth, or something else..? And regardless, doesn't turning x, y and z off have to relieve the same bound that may be actually caused by a or b in order to have an impact? For example (and to keep with the bw theme), "something in our cpu code was causing excessive bandwidth consumption but instead of figuring that out, we turned off some other bw consumers like AF to relieve that bound"? For whatever reason a bound has been created in these titles that encourages AF to be turned off - or so it would seem- and I guess to find an explanation for that it's more interesting to look at where that bound is rather than what's causing it (be it poor coding or the 'productive demands' of the game, so to speak).

That said, it seems to be too much of a trend for each of these titles to have turned it off because they were blindly trying to get performance up to an acceptable level to meet a deadline (IMO).
 
Last edited by a moderator:
Back
Top