RSX: Vertex input limited? *FKATCT

Discussion in 'Consoles' started by Hypx, Dec 26, 2006.

  1. Rash'

    Newcomer

    Joined:
    Aug 7, 2006
    Messages:
    23
    Likes Received:
    0
    I'm confused. If the workings of RSX are that well know, then why are there disagreements about what does and doesn't work well on the GPU?

    Personally, I think it's because different developers have different creative solutions to the hurdles any hardware presents. What some describe as an impassable problem may just be a issue that needs a different creative resolution.

    This was the point I was fundamentally trying to make, which I think you overlooked because of the "radical design" remark. I accept your point that maybe the choice of word wasn't appropriate for all aspects of the PS3 design, but is it not premature to make comparisons on hardware that clearly developers, whether they be first, second or third party haven't fully come to terms with?
     
  2. Fredrik

    Newcomer

    Joined:
    Jan 1, 2007
    Messages:
    3
    Likes Received:
    0
    I don't think they will re-evaluate their OS requirements. IMO 32MB of RAM over 512MB will only make small noticeable difference in texture quality; this may be compensated with a higher poly count. What is now seen as a problem that penalizes the console may be in the future a big sales point: it's likely that they reserved that much because they have big things in mind. And having as much processor power and RAM set aside may let them do amazing stuff, and that could distinguish their console.

    I would say that in the short term we will see that SPE used to stream content to the PSP, that could allow the player having a PS3 game running on PS3 and play it on their PSP (with some sort of location free software). That is something that has been largely hinted by Sony. If I had to bet on more stuff coming for that GameOS I would presume they have something in mind related to the EyeToy2. I don't know if it could be made, but I suppose they could have a windowed videochat running on top of a PS3 game while you are playing. That's a more close game experience than pure online play, you see your friend playing as if he was sitting next to you.

    I'm not a dev, so I don't know how many thinks can be made using a SPE and 64MB of RAM, but I guess there's enough to do some little amazing things.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    If instruction bandwidth were the key issue, the outcome would be a less clear win for the SPE, since a performance-critical branch misprediction that missed cache would likely only miss cache once.

    If that branch is in some hot code that is run for a long time, it would remain in cache, and the longer latency of the local store would in the end turn out to be a performance loss for the SPE.

    On the other hand, things get fuzzier still if the hot code exceeds the size of the L1 instruction cache, in which case the slower L2 plays a factor, depending on just how often the L1 I cache misses.

    It seems from current trends that it's minimizing data load latency that's more important.

    Maybe he meant 50+ ns, which sounds reasonable.
     
  4. TurnDragoZeroV2G

    Regular

    Joined:
    Nov 14, 2005
    Messages:
    583
    Likes Received:
    23
    Location:
    Who knows...
    I think there's far more agreement than disagreement concerning what does and doesn't work well on RSX. On the other hand, there's plenty of disagreement as to how big a deal all of it is.
     
  5. Laa-Yosh

    Laa-Yosh I can has custom title?
    Legend Subscriber

    Joined:
    Feb 12, 2002
    Messages:
    9,568
    Likes Received:
    1,452
    Location:
    Budapest, Hungary
    I think you've got this wrong, the topic was that even though a BluRay disc can hold more assets - models, textures, levels, animations, sounds etc. - it still wouldn't make a game look better on the PS3, because the bottleneck is the smaller amount of available system RAM from which the game can display/use these assets.
     
  6. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,521
    Likes Received:
    852
    Slide 23 of this suggests main memory-to-SPE latency of ~170ns for blocking read access (ie. load-to-use latency). That seems crazy high.

    Just doing an inter-SPE DMA transfer is quite costly at ~100ns. Latency from communicating with the main memory modules themselves looks like ~70ns, which is reasonable I suppose.

    The slides make it all the more clear that you would want to operate out of the LS and *only* the LS.

    Cheers

    edit: Or were you talking about PPU I$ latencies? If so, move along, nothing to see.
     
  7. rounin

    Veteran

    Joined:
    Sep 21, 2005
    Messages:
    1,251
    Likes Received:
    20
    I don't know. I would rather wait before saying that : what happens if doing so allows them to do some really amazing things later on.
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The quoted section contained two separate points.
    By being deterministic and optimized for larger DMA transfers, the LS can lower the average apparent memory latency for the data sets that can be broken down properly.
    A hundred loads from the LS at 6 cycles every time after a block fetch is better than a cache-unfriendly stream of loads that can take hundreds of cycles each, or use so many prefetches that it slaughters instruction bandwidth.

    For instructions, the LS may be slightly less optimal than a good fast I cache. Since in-orders worry about data latency more than instruction fetch latency, it is probably not as important that the LS has a higher latency.

    The 50 ns portion was me trying to interpret what DeanoC meant by "~50+ cycles". If he's comparing the SPE to other chips, the A64 can get best-case latencies in the neighborhood of 50 ns.
     
  9. Tahir2

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,978
    Likes Received:
    86
    Location:
    Earth
    Hmm, joker has admitted he is new to PS3 coding and there are developers here that have been having a stab at the behemoth that is PS3 for a bit longer. I remember some of them stating quite clearly PS3 requires a rethink to coding and some experienced problems that seem to be resolved now.

    I would describe these complaints by joker454 as exploratory steps into the world of PS3 .. in the most respectful and nicest way possible of course. :)

    And Xenos might be superior in certain ways to RSX but these devices do not act on their own in complex systems.

    Edit: replying to a post that has been already deleted but I am keeping this post as I think it might explain the reason for some of the complaints. No disrespect intended of course.
     
  10. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,521
    Likes Received:
    852
    Sorry, you (you and Crossbar) got me confused by discussing I$ misses in a SPE context. You'd need to initiate a DMA-request to load more code into the LS, and hence a "miss" *is* a data-dependency, - on the i-stream.

    I think you're right, your core is already hosed by the mispredict penalty, the extra latency of the initial i-fetch after a mispredict is probably in the noise.

    The only place I could see the 6 cycle LS latency have a significant impact on i-stream accesses is that it sets the lower bound for distance between software-BTB priming and branching.

    It's probably easier to ask DeanoC directly since 50ns doesn't rhyme with anything. Mayhaps he meant 50+ instruction penalty, equivalent to 25 cycles of dual-issue/commit (which seems to high though).

    Cheers
     
  11. ShootMyMonkey

    Veteran

    Joined:
    Mar 21, 2005
    Messages:
    1,177
    Likes Received:
    71
    Well, if it were up to me, I wouldn't really give a damn about having all sorts of ancillary functions running at the same time as a game and would have preferred to just leave the resident OS be a kernel and nothing more (at least while a game is running). But I guess some people might have some desire to keep a webpage of cheat codes open at the same time they're playing the game or something or whatever. And I can imagine they might take a small hit on account of electing not to lock down to all nature of proprietary peripherals and thereby needing a more formal driver layer for various things... but I doubt that's a huge drain.

    All the same, I think it boils down to being unable to plan ahead of time what they intend to throw in feature-wise. They're overestimating so that they have breathing room. While I can believe there's room for them to decrease the memory requirements (and just have those later games require a certain update to be installed), I doubt it will ever happen. Of course, you could just as easily argue that Microsoft has the opposite problem in that if Sony develops some sort of killer app on PS3, 360's OS may not have the necessary memory space to follow suit. Again, something that I doubt will ever happen.
     
  12. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,435
    Likes Received:
    263
    Without thinking much about it I'm not sure what RSX could be doing to be more efficient here? Is Shifty's memory faulty or does someone have an explaination?
     
  13. ShootMyMonkey

    Veteran

    Joined:
    Mar 21, 2005
    Messages:
    1,177
    Likes Received:
    71
    Perhaps he's thinking of vertices not getting reprocessed as often since RSX has its rather large post-transform caches? Well, it's true that if you partition the streams nicely enough to keep re-accessing verts that are sitting in the cache, then you can get significantly better vertex throughput out of it than the raw vertex shading performance would suggest, and RSX's caches being a few times larger than Xenos' means your odds are a little better. There's not a whole lot else I could think of that sounds anything similar to what Shifty was saying.
     
  14. Crossbar

    Veteran

    Joined:
    Feb 8, 2006
    Messages:
    1,821
    Likes Received:
    12
    I downloaded some jpeg-libs from intel just for fun.

    And you are right, from running some rough benchmarks it seems that the Pentium perform much better if the output fits within the cache. My Pentium 4 at 3.2 MHz (Presler, 1 MB cache) decompressed a 15 kB jpeg file into a 250 kB (24-bit) bitmap in 2.8 ms. You could probably do better on an SPU with some hand tuned code taking advantage of the huge register file.

    Yes, you really would like to decompress it into a DXTC texture, perhaps there are better compressions schemes for that. Huffman (edit: or LZW or some other lossless compression, probably with some custom optimisations,for the particular DXTC in question) encoding of a DXTC texture with some repetetive colours could perhaps give a good result, both with regard to compression rate and decompression speed?
     
  15. Kryton

    Regular

    Joined:
    Oct 26, 2005
    Messages:
    273
    Likes Received:
    8
    This is probably more appropriate in the Cell programming thread, but did the Intel library make use of SSE? JPEG has the quantization operation which could be tweaked nicely if it isn't vectorized already (on Intel chips).
     
  16. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Was that standard jpeg (not some 2000 wavelet variant?). It's just curious - PS2 decodes a JPEG of that size (I assumed ~295*295 pixels based on your 24bit size), in close to 0.8ms - granted it's a hardware decoder - but it's also over 6 years old now.
    I imagine SPE should absolutely fly at DCT macroblock decoding though, I'm sure someone will gave a go at it sooner or later.

    That depends on how much they overestimated and how much was result of code like
    Code:
    char *KenIsGreat =  new char[1024*1024*16]; //never remove!
    For what's worth - they did "unreserve" half of the kernel reserved space on PSP eventually.
     
  17. Crossbar

    Veteran

    Joined:
    Feb 8, 2006
    Messages:
    1,821
    Likes Received:
    12
    Yes, the intel library contains tweked code for almost every single intel CPU, so you can be pretty sure it used the SSE instruction set.

    The mods can feel free to move this to the Cell programming thread, if they think it's appropriate. :)
     
  18. Crossbar

    Veteran

    Joined:
    Feb 8, 2006
    Messages:
    1,821
    Likes Received:
    12
    Yeah, just a standard jpeg. Maybe I give the jp2 version a go later today if I have the time.
     
  19. popper

    Newcomer

    Joined:
    Jul 22, 2006
    Messages:
    69
    Likes Received:
    3
    for what its werth theres been PPC/Altivec DCT macroblock decoding in x264, ffmpeg, mplayer/mencoder for while now (dont know if its profiled though), so it shouldnt be to hard to port/patch that to use spe's, just a little effort and time, plus users get to benefit if you did it.
     
  20. Crossbar

    Veteran

    Joined:
    Feb 8, 2006
    Messages:
    1,821
    Likes Received:
    12
    Might as well do that once I get my hands on a Cell in a couple of months, I enjoy tweaking code. I've found the Cell performance thread really enjoyable. :)

    Is that code available at Sourceforge or elsewhere?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...