Is AF a bottleneck for Xenos?

Discussion in 'Console Technology' started by tema, Mar 11, 2006.

  1. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,992
    Likes Received:
    137
    Not to mention that it's an in-order processor too. Cache is there to hide latency and one could argue an in-order processor needs to hide latency better than an OOOe processor and it would need more cache. I'm also fully aware that more cache doesn't necessarily mean better too, but, heck, you have HT P4's with 2MBs. That's 1MB per thread...

    Only my opinion...
     
  2. zidane1strife

    Banned

    Joined:
    Feb 27, 2002
    Messages:
    899
    Likes Received:
    1
    Location:
    End of time
    And I've been out of the loop for a while, but weren't both the cell ppe's and xenon cores' caches higher latency than usual lowering performance in these?
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    As I always like to point out at this juncture, often beaten by single-threaded A64s with 512KB of cache.

    And with 6 threads on Xenon, each thread is proceeding at an effective 1.6GHz. This obviously paints quite a different picture in terms of cycles of latency with L1 or L2 misses. A 41-cycle L1 miss in Xenon becoms a 21-cycle miss for a single thread if a core is running two threads symmetrically. And if a pre-fetch is setup correctly, this miss affects that thread only, with the other thread carrying on regardless.

    It's up to the devs to tweak their code using pre-fetching, hardware-thread priorities, cache-line data alignment, data-tiling, blah blah in order to maximise performance and minimise miss-induced stalls or flushes.

    It's pretty pointless saying the caches are too small without quantifying it - and we're utterly in the dark on that.

    A doubling of cache size generally brings 10% more performance - but again, that's a rule of thumb that doesn't necessarily account for code being rewritten specifically to match the cache sizes under consideration.

    Considering how big the dies are, modern GPUs really do have piddly amounts of cache. That's partly because they use vast register files (just a kind of memory accessed with a different set of patterns), but also because the useful lifetime of texels is pretty limited over the duration of a frame render.

    Jawed
     
  4. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,992
    Likes Received:
    137
    No need to point it out to me because I've already mentioned more doesn't mean better. But I suspect it's for the benefit of the reader...

    Yes, multi-threading is there to hide the latency too. But a G5's L2 latency is ~ 11 cycles...

    I agree and maybe they are struggling...

    I believe my point was clear. I even said "SEEMS".

    The 1MB cache for XeCPU for 6 threads across 3 cores *seems* to little to me and a likely cause for cache thrashing. Point being cache thrashing causes increased and unpredictable B/W demand for XeCPU in an UMA, therefore Xenos being B/W starved, therefore lacking AF... the thread topic...
     
  5. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,992
    Likes Received:
    137
    I can't remember what the CELLs PPE is but from the recent leak, the XeCPU memory latency is 525 cycles and the G5 ~ 205 cycles...
     
  6. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,992
    Likes Received:
    137
    ^^ you were referring to the caches latencies,

    G5 and XeCPU L2 hit latency, 11 and 39 cycles respectively...(from leak)...
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    The document I have supercedes the leak.

    L1 miss is 41 cycles and L2 miss is 610 cycles - both are minima.

    L1 miss is longer on cores 1 and 2 because core 0 is closer to L2.

    All L2 misses are dependent on other workloads that L2 might be servicing as well as memory controller contention.

    Jawed
     
  8. Tap In

    Legend

    Joined:
    Jun 5, 2005
    Messages:
    6,382
    Likes Received:
    65
    Location:
    Gravity Always Wins
    sorry to go off topic slightly here again but....

    Maybe what you were experiencing was network lag.

    I tried (8 hours yesterday) to replicate this "slow-down" and it was smooth as silk; in grass, scoped, crawling, firing, with smoke, explosions... you name it.

    Now it does *by design* slow down your ability to move your weapon (aim) when zoomed in with a scope. Perhaps that's what you were noticing?
     
  9. persiannight

    Newcomer

    Joined:
    Dec 30, 2004
    Messages:
    35
    Likes Received:
    0
    In your dream world maybe...
    RSX will save us ALL!!!

    Seriously guys.... these are first gen 360 titles... why don't we wait a little while longer to start bitching about stupid crap. Does the lack of AF make the games not enjoyable??? To me, no. Find me a game, IMHO, that looks as good as PGR3 or GRAW that currently on the market...
     
  10. ROG27

    Regular

    Joined:
    Oct 27, 2005
    Messages:
    572
    Likes Received:
    4
    Are you serious? That is quite a modest guesstimate of what the RSX might be and it would only be logical that it would be that way seeing the way the PS3's architecture is set up. I assumed hardware developers realized early on that the 128-bit interface with memory was going to be a bottleneck and that they designed the GPU with larger caches to try and "hide" the latency. Enabling lockstepping between shader units and SPEs makes sense because of the massive amount of parallezation. A DMA controller in the GPU might make sense if the GPU is supposed to make direct calls to main memory (the XDR pool)...I'm unsure whether or not the GPU has can Bypass CELL in the process or not. Maybe someone more knowledgable can clear that up. We know FlexIO is going to be implemented on the RSX.

    Thanks Jaws for clearing the CELL/RSX interface up for me. As for the Vertex and Pixel Shader Units, I was thinking and writing two different things. I meant to say single ALU for Vertex Shader Unit and dual ALU for Pixel Shader Unit.
     
  11. ERP

    ERP
    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    JUst being pedantic....

    But Bigger caches do not hide latency, they provide reduced latency for data that's in them.

    Also A bigger cache would not overcome bandwidth defecit.
    It would potentially allow you to to sustain more textures accessed in a cache friendly fashion (although this would be dependant on cahce architecture as much as size) or have better performance on random access patterns.
    It would also be useful if you had increased latency from your memory pool, since you'd require a larger number of oustanding requests to pre-fill cache lines.

    The reason that you don't need really huge caches for textures is that regular texture accesses are entirely predictable. The chip can prefetch all the data it's going to need, and that data is reused many times over a very short time span, so it doesn't lock the cache down.
    Start not mipmapping, setting negative LOD Bias or writing shaders that do random indirections and all bets are off. A larger cache my limit the cost o these things.
     
  12. Lysander

    Regular

    Joined:
    Sep 3, 2005
    Messages:
    532
    Likes Received:
    5
    Thanks.
     
  13. pipo

    Veteran

    Joined:
    Jun 8, 2005
    Messages:
    2,628
    Likes Received:
    30
    Found this little tidbit, make of it what you want. :)

    http://interviews.teamxbox.com/xbox/1190/Xbox-360-Interview-Todd-Holmdahl/p4/
     
  14. Farid

    Farid Artist formely known as Vysez
    Veteran Subscriber

    Joined:
    Mar 22, 2004
    Messages:
    3,844
    Likes Received:
    108
    Location:
    Paris, France
    The Question:
    The Answer:
    Beyond3D Forum > Other 3D > Console Talk > Console Technology
     
  15. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I don't know why you guys are going on and on about bandwidth. Anisotropic filtering consumes clock cycles but not much bandwidth unless you're using uncompressed textures. Since you're consuming clock cycles, you have more bandwidth per pixel anyway.

    Whether AF is enabled or not, Xenos can only do 16 bilinear samples per clock. With >2xAF (actual) applied per pixel, only 2 of the 4 texels used in each sample won't be shared with neighbouring pixels. At 4-bits per compressed texel, that's 8GB/s peak.

    nAo, your observation that geometry is used for the road lines is quite insightful though, and looks very likely because the grain within the road lines and markers definately gets blurred as if without aniso. It seems pretty smart because roads are quite low angle and will take a big hit with AF. However, I have played racing games that don't do this and have very blurry road lines.

    Still, it would be nice if the grain in the road was sharp everywhere. Even if only one of the textures used AF, and they used a stretched texture to reduce the impact, that would be nice.

    (Aside: nAo, I was hoping you'd address my post in the other bandwidth thread, if you have time)
     
  16. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Thank you!

    Soooo many people here think caches will save the bandwidth problems of RSX. The only way it helps is if the entire texture fits in the cache (unlikely for any reasonably detailed texture); moreover, texture bandwidth won't be the limiting factor anyway, especially if devs use compressed textures very extensively.

    The primary purpose of texture caches is to make sure data isn't loaded redundantly when accessed by a local group of pixels. This way the 4 samples needed by a bilinear filter doesn't need 4 times the BW of point sampling, but instead only a few percent more BW. Graphics cards get pretty close to the ideal minimum texture bandwidth: (texture data in a region) / (pixels in that region) * (pixel rate for a shader).
     
    #96 Mintmaster, Mar 12, 2006
    Last edited by a moderator: Mar 12, 2006
  17. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Well, if you can give us a better potential explanation as to why it's just not being used in many games, do tell us?

    I'm surprised myself because the theoretical bandwidth requirement I've seen quoted isn't THAT high. Which is why I'm wondering if things are so tight in terms of bw that even a small amount extra would be the straw that broke the camel's back in many of these cases. The cost could be small, yet still too much, if BW was sufficiently scarce.
     
    #97 Titanio, Mar 12, 2006
    Last edited by a moderator: Mar 12, 2006
  18. aaaaa00

    Regular

    Joined:
    Jul 24, 2002
    Messages:
    790
    Likes Received:
    23
    ERP already explained this.

    If you're running behind schedule, and you've got performance problems, and you don't have time to figure out what the real problem is and fix it, you just start turning things off to make your ship date.

    AF is one easy thing to turn off, and even if it gets you just 1% back, that's 1% closer to your ship criteria than you were before.
     
  19. zidane1strife

    Banned

    Joined:
    Feb 27, 2002
    Messages:
    899
    Likes Received:
    1
    Location:
    End of time
    But why is it turned off in pr screenshots for games that might've or might've had months to go before going gold.
     
  20. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    This is all true, but it all fundamentally ties back to performance, which must have technical reasoning behind it, which I think is what we're trying to expose here, no?

    Do you get that x% back because you've alleviated bandwidth, or something else..? And regardless, doesn't turning x, y and z off have to relieve the same bound that may be actually caused by a or b in order to have an impact? For example (and to keep with the bw theme), "something in our cpu code was causing excessive bandwidth consumption but instead of figuring that out, we turned off some other bw consumers like AF to relieve that bound"? For whatever reason a bound has been created in these titles that encourages AF to be turned off - or so it would seem- and I guess to find an explanation for that it's more interesting to look at where that bound is rather than what's causing it (be it poor coding or the 'productive demands' of the game, so to speak).

    That said, it seems to be too much of a trend for each of these titles to have turned it off because they were blindly trying to get performance up to an acceptable level to meet a deadline (IMO).
     
    #100 Titanio, Mar 12, 2006
    Last edited by a moderator: Mar 13, 2006
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...