Alternative AA methods and their comparison with traditional MSAA*

Discussion in 'Rendering Technology and APIs' started by mitran, Nov 15, 2009.

  1. ihamoitc2005

    Veteran

    Joined:
    Sep 21, 2005
    Messages:
    1,181
    Likes Received:
    15
    Cache miss

    This I agree my friend. But it is the domino-effect of massive L2 occupation (in small L2) by this "tiles" for cache miss by other threads that I worry.
     
  2. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    You don't get streaming... neither the fact that bandwidth for moving 7MB is not the limiting factor especially over a time superior to 10ms.
     
  3. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    http://forum.beyond3d.com/showpost.php?p=1433406&postcount=443

    If the problem size is large enough, they can still spread the load to many many processors, hence massively parallelizable. But there is inherent data dependency due to the order imposed. You may need to write any intermediate or final result back before fetching new data.

    I didn't say the bandwidth is the issue. I'm just saying you can't plug SPU numbers into 360 to derive more 360 numbers. There may be assumptions that are broken. The algorithm/implementation may also be different on different architecture after optimization. e.g., The GPU MLAA paper seems to tweak the original algorithm to make it run fast on GPUs (There are some precomputed factors involved, but I'm not sure what they really are).

    If I remember correctly, the L1 cache/LocalStore is a few times faster than L2 cache. The 4ms (or 20ms) figures in SPU MLAA is achieved under that kind of hardware environment. They also use the DMA to load the LS asynchronously and completely under programmer's control. So the SPE can go ahead and do some other stuff at the right time.
     
  4. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Ok I give up believe whatever you want... it won't change the fact that MLAA is not possible on Xenon because it doesn't have the horse power to do so...
    Patsu what you say makes no sense sorry to be straight but that's it.
    How can you say it's not possible to calculate time to move a buffer based on bandwidth?
    You think MLAA is bandwith limited?
    Obviously pixels are link still you have to split the frame buffer. If the data were large enough with a lot of dependency it would not scale well. You're idea of data parallelism is messed up.

    SPU doing something else while applying MLAA that makes no sense either imho, especially if you consider double buffering as a possibility.
     
    #644 liolio, Jul 9, 2010
    Last edited by a moderator: Jul 9, 2010
  5. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    You can. But moving a buffer one-off is not MLAA. If you want to derive bandwidth usage for MLAA, you would need to look at the MLAA implementation and its access pattern.

    See my posts above, especially "I didn't say the bandwidth is the issue."

    When there is an imposed order in the data, it means that the CPU will need to write result back into the main memory before processing the depending data. This may trigger some cache coherency logic for the L2 cache (even if it's locked ?), hence slowing down memory access compared to the totally separate and cache-logic free LocalStore.

    As for SPU doing something else, I meant the SPU can do something else while the DMA controller is fetching data into LocalStore. On a regular CPU, it must always go through the same automatic L1->L2->main memory chain. Sometimes, it'd have to wait for the memory subsystem to come back, especially when performing cache logic.
     
  6. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    But which "memory"? What do they consider the neighborhood of pixel? You think a pixel would be compare to every single of a 48x48 pixels tiles (half xenon L1 data cache)? I think not, so you know that the tile size could be really tiny and fit the L1 data cache (most likely the relevant tile could be way tinier than the 64x64 I stated).
    I still can't see the difference between internal bandwidth and external bandwidth. Neither you understand data locality.
    Sorry that's completely wrong.
    You don't know how coherency is handle in Xenon as other modern processor it's neither inclusive or exclusive it's something cleverer. I don't know either it's not public (neither it's public for Intel or AMD). But it's clear you didn't know who what inclusive or exclusive L1 & L2 cache.
    Say it's inclusive, one line of L1 is modified, so L2 will have to be modified. in our case it's "free". Cpu still continue load data from I$ and D$.
    There could be contentions but not in case of data parallel workload , only one core work one define set of data, L2 value won't get overwritten by another resource.

    Overall there is no magic about the SPU, they are like old processor which didn't had cache.
    I friendly advice you to find the old stuffs from arstechnica about CPU, pipelining, etc.
    They put it really nicely, I managed to get through with few knowledge before hand.

    Coherent memory space and cache doesn't mean that developers controls nothing (especially on a "primitive" CPU as xenon that is in order, with no speculative execution, etc.), it works by it-self, etc. It just helps (but cost silicon, power, etc.).
     
  7. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    I have no idea. The GPU implementation uses a 512x512 precomputed area table texture to filter lines up to 512 pixels for example. You'll have to look at the individual MLAA algorithms.

    It is true that modern CPUs have various write-back, write-through or even by-pass schemes for caches. I believe the PowerPC family uses a MERSI scheme. I believe under this scheme, you pay when you read a "dirty" location.

    They replaced the traditional memory hierarchy with LocalStore to save space/heat, speed up access, and avoid the "memory wall" problem (for multi-core access). In the process, they trade ease-of-programming away.
     
  8. assen

    Veteran

    Joined:
    May 21, 2003
    Messages:
    1,377
    Likes Received:
    19
    Location:
    Skirts of Vitosha
    OK, gotta admit I was wrong on that. Looking forward to using what you and the SCEE team did one day in the future.
     
  9. Neb

    Neb Iron "BEAST" Man
    Legend

    Joined:
    Mar 16, 2007
    Messages:
    8,391
    Likes Received:
    3
    Location:
    NGC2264
    Would be interesting to know how AAA solutions for current 360 games relates to the Intel MLAA or similar. There is the GOW "MLAA" (I put quotes becouse AFAIK it is a derivate of Intel MLAA for better or worse with no public papers out). Btw didn't the original Intel MLAA run like a dog at 120ms on Cell at initial start? How would things look for 360 regarding MLAA or "MLAA" with some work, maybe yes maybe no? :smile:
     
  10. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Interesting :) It's strange is that from the presentation linked on Dave's blog?
    Interesting too, I did a search
    for those interested, MESI and MERSI protocols, this should not hurts either
    I never read about that :)
    So let try to find out what would happen to us.
    *The data is present in both L1 and L2
    So the cache lines should be in S state, so clean readable and writable.

    *Then the CPU request a "read for ownership"
    So L1 cache line is read and L2 cache line set to invalid.

    *Now the line is in the M state and is different from RAM value.
    The line is still readable but the hardware has to block access to that line in the main RAM.
    Data in RAM has to be updated

    From the third link I gave above.
    Supposedly MESRI protocol use a write-back policy. But we know for sure only for G4.
    In that pdf I read that Xenon is MESI & write-though with lots of store buffering.

    *So we have two cases:
    -The cache line in only read, nobody try to over write it. At some point when write -back happen it will be changed to the E state (so writable or can be evicted if needed).
    In that case everything is fine the "cost" for coherency is free, I mean you pay it through higher latencies to your caches. For coherency traffic to the main RAM you have 5.4GB/s.

    -the cpu wants to overwrite the cache line and the say cache line is still in M state... the execution should stall till the state change to E.
    Here it gets costly as you lose cycles.
    from the same link as above:

    So that's where it's tricky and maybe someone could shine in :)
    In the Xenon when is the cache line state is changed?
    At the moment the data are put 1) in the "store buffer" or 2) when they leave it (to the RAM)?
    1) looks nice, the execution can resume and as long as snooping works properly it should be OK.

    Overall if caches works properly, I can't see coherency being what prevent Xenon to achieve MLAA @720p/30fps. From what I understand you always pay the cost for coherency through higher latencies (or depending on how Xenon works ), extra silicon, power consumption.
    Tile size could be a problem (it should not exceed L1 data cache).

    I still put my bet on raw horse power has the limiting factor.
     
  11. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Someone posted the implementation history of GoW MLAA here but I can't find it at the moment. I suppose I could write an Intel MLAA that runs dogslow on Cell too but it'd be missing the point. Once an implementation is optimized for Cell (20ms for 1 SPU), the final performance numbers will most likely be tied to that platform's characteristics.

    Other architectures (GPU or regular CPU) will have to find their own ways to do MLAA. The GPU one seems to take some shortcut, but can run on GPU cores generously. I'm curious how that modified MLAA could turn out on Cell too.


    EDIT:
    In the first place, I didn't say which is the contributing factor. I just don't think that plugging the performance numbers from Cell to calculate 360 bottleneck is the right thing to do. FWIW, I think everything on Cell contributed together to make MLAA possible:
    http://forum.beyond3d.com/showpost.php?p=1445386&postcount=622

    So, the computational power is indeed one of the major factors.
     
  12. Neb

    Neb Iron "BEAST" Man
    Legend

    Joined:
    Mar 16, 2007
    Messages:
    8,391
    Likes Received:
    3
    Location:
    NGC2264
    It seems so, so far for an unoptimised version with messy code. But what makes you think the GOW "MLAA" doesn't use shortcuts, has the GOW MLAA papers been released? Same for other methods.
     
  13. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Ah... nicely set up for someone in the know to take the question. Thank you. :)
     
  14. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    And ruin all the fun? I couldn't possibly do that, now can I?
     
  15. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Well I've been reading quiet some stuffs about coherency and I'm really happy you brang info on the coherency protocol :)
    I've been reading stuffs and as I did I pass by this post (T_B's one) and just came to check something I forgot in regard to Xenon, which must not help either horrendous L1 latencies.
    Crap 16cyles, that's L2 level...
     
  16. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    21,738
    Likes Received:
    7,406
    Location:
    ಠ_ಠ
    On a side note, I can't remember if I mentioned it already, but there is a sample in the XDK for an edge filter/AA. The gamefest presentation only briefly mentioned it a couple years back, but I wonder if that's what AvP ended up using.
     
  17. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Oh crap, the Jackalope didn't take the bait. I should have been more subtle. :runaway:
     
  18. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    Shortcuts are generally welcome when the results don't have (too many) negative side-effects, and at the moment the thing about GWAA isn't just that it's fast enough to be used, but also that it has the best IQ. This can't be said of the current GPU implementation that is taking shortcuts to get it to run quickly but not produce results of the same quality.
     
  19. DeanA

    Newcomer

    Joined:
    Oct 26, 2005
    Messages:
    244
    Likes Received:
    36
    Location:
    Cambridge, UK
    No.. no you can't!

    :D
     
  20. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    I'll just thicken my skin and take it as a compliment. :p
    [size=-2]Thank you <3 XXOO[/size]
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...