RSX: Vertex input limited? *FKATCT

Of the GPUs and CPUs on both consoles I think many would agree with me that RSX is the least radical design; it probably is also the best known in regards to what it can do, and what works well and what doesn't. Just my 2 cents on that :smile:
I'm confused. If the workings of RSX are that well know, then why are there disagreements about what does and doesn't work well on the GPU?

Personally, I think it's because different developers have different creative solutions to the hurdles any hardware presents. What some describe as an impassable problem may just be a issue that needs a different creative resolution.

This was the point I was fundamentally trying to make, which I think you overlooked because of the "radical design" remark. I accept your point that maybe the choice of word wasn't appropriate for all aspects of the PS3 design, but is it not premature to make comparisons on hardware that clearly developers, whether they be first, second or third party haven't fully come to terms with?
 
Last edited by a moderator:
[...]
Now it is 512mb (minus 60mb?) vs 512mb (minus 20mb?) with differences in memory organisation, pluses and minuses in bus speeds, and 7 usable cpus vs 3 cores. (There is also the possibility sony could re-evaluate the OS footprint needed while games were running and future titles might find from release 1.xx onwards, they have more memory. Firmware upgrade could be enforced as it is on PSP by bundling with the game).
[...]

I don't think they will re-evaluate their OS requirements. IMO 32MB of RAM over 512MB will only make small noticeable difference in texture quality; this may be compensated with a higher poly count. What is now seen as a problem that penalizes the console may be in the future a big sales point: it's likely that they reserved that much because they have big things in mind. And having as much processor power and RAM set aside may let them do amazing stuff, and that could distinguish their console.

I would say that in the short term we will see that SPE used to stream content to the PSP, that could allow the player having a PS3 game running on PS3 and play it on their PSP (with some sort of location free software). That is something that has been largely hinted by Sony. If I had to bet on more stuff coming for that GameOS I would presume they have something in mind related to the EyeToy2. I don't know if it could be made, but I suppose they could have a windowed videochat running on top of a PS3 game while you are playing. That's a more close game experience than pure online play, you see your friend playing as if he was sitting next to you.

I'm not a dev, so I don't know how many thinks can be made using a SPE and 64MB of RAM, but I guess there's enough to do some little amazing things.
 
Yeah, you are correct, the quote concerned "a variable memory access patterns" meaning plenty of cache misses. I was actually thinking of branch misses to non-cached code when I wrote that, but failed to describe it.
If instruction bandwidth were the key issue, the outcome would be a less clear win for the SPE, since a performance-critical branch misprediction that missed cache would likely only miss cache once.

If that branch is in some hot code that is run for a long time, it would remain in cache, and the longer latency of the local store would in the end turn out to be a performance loss for the SPE.

On the other hand, things get fuzzier still if the hot code exceeds the size of the L1 instruction cache, in which case the slower L2 plays a factor, depending on just how often the L1 I cache misses.

It seems from current trends that it's minimizing data load latency that's more important.

Actually the 50+ cycle number suggested by DeanoC sounds a bit low for access to main RAM, and still too high to be the latency figure of the level 2 cache. Maybe some NDA margins in there? :smile:
Maybe he meant 50+ ns, which sounds reasonable.
 
I'm confused. If the workings of RSX are that well know, then why are there disagreements about what does and doesn't work well on the GPU?

I think there's far more agreement than disagreement concerning what does and doesn't work well on RSX. On the other hand, there's plenty of disagreement as to how big a deal all of it is.
 
by saying 'You can stream of a HDD on the 360 so that isn't in the PS3's favor. Blu-Ray won't increase graphic fidelity' are you refering to AVC/VC-1 streaming video content?

I think you've got this wrong, the topic was that even though a BluRay disc can hold more assets - models, textures, levels, animations, sounds etc. - it still wouldn't make a game look better on the PS3, because the bottleneck is the smaller amount of available system RAM from which the game can display/use these assets.
 
It seems from current trends that it's minimizing data load latency that's more important.

Maybe he meant 50+ ns, which sounds reasonable.

Slide 23 of this suggests main memory-to-SPE latency of ~170ns for blocking read access (ie. load-to-use latency). That seems crazy high.

Just doing an inter-SPE DMA transfer is quite costly at ~100ns. Latency from communicating with the main memory modules themselves looks like ~70ns, which is reasonable I suppose.

The slides make it all the more clear that you would want to operate out of the LS and *only* the LS.

Cheers

edit: Or were you talking about PPU I$ latencies? If so, move along, nothing to see.
 
Sony must reduce its os memory usage in order to be competitive , if you think about it, 64mb are all the memory the xbox had !

I don't know. I would rather wait before saying that : what happens if doing so allows them to do some really amazing things later on.
 
Slide 23 of this suggests main memory-to-SPE latency of ~170ns for blocking read access (ie. load-to-use latency). That seems crazy high.

Just doing an inter-SPE DMA transfer is quite costly at ~100ns. Latency from communicating with the main memory modules themselves looks like ~70ns, which is reasonable I suppose.

The slides make it all the more clear that you would want to operate out of the LS and *only* the LS.

Cheers

edit: Or were you talking about PPU I$ latencies? If so, move along, nothing to see.

The quoted section contained two separate points.
By being deterministic and optimized for larger DMA transfers, the LS can lower the average apparent memory latency for the data sets that can be broken down properly.
A hundred loads from the LS at 6 cycles every time after a block fetch is better than a cache-unfriendly stream of loads that can take hundreds of cycles each, or use so many prefetches that it slaughters instruction bandwidth.

For instructions, the LS may be slightly less optimal than a good fast I cache. Since in-orders worry about data latency more than instruction fetch latency, it is probably not as important that the LS has a higher latency.

The 50 ns portion was me trying to interpret what DeanoC meant by "~50+ cycles". If he's comparing the SPE to other chips, the A64 can get best-case latencies in the neighborhood of 50 ns.
 
Hmm, joker has admitted he is new to PS3 coding and there are developers here that have been having a stab at the behemoth that is PS3 for a bit longer. I remember some of them stating quite clearly PS3 requires a rethink to coding and some experienced problems that seem to be resolved now.

I would describe these complaints by joker454 as exploratory steps into the world of PS3 .. in the most respectful and nicest way possible of course. :)

And Xenos might be superior in certain ways to RSX but these devices do not act on their own in complex systems.

Edit: replying to a post that has been already deleted but I am keeping this post as I think it might explain the reason for some of the complaints. No disrespect intended of course.
 
Last edited by a moderator:
The quoted section contained two separate points.
By being deterministic and optimized for larger DMA transfers, the LS can lower the average apparent memory latency for the data sets that can be broken down properly.
A hundred loads from the LS at 6 cycles every time after a block fetch is better than a cache-unfriendly stream of loads that can take hundreds of cycles each, or use so many prefetches that it slaughters instruction bandwidth.

Sorry, you (you and Crossbar) got me confused by discussing I$ misses in a SPE context. You'd need to initiate a DMA-request to load more code into the LS, and hence a "miss" *is* a data-dependency, - on the i-stream.

For instructions, the LS may be slightly less optimal than a good fast I cache. Since in-orders worry about data latency more than instruction fetch latency, it is probably not as important that the LS has a higher latency.

I think you're right, your core is already hosed by the mispredict penalty, the extra latency of the initial i-fetch after a mispredict is probably in the noise.

The only place I could see the 6 cycle LS latency have a significant impact on i-stream accesses is that it sets the lower bound for distance between software-BTB priming and branching.

The 50 ns portion was me trying to interpret what DeanoC meant by "~50+ cycles". If he's comparing the SPE to other chips, the A64 can get best-case latencies in the neighborhood of 50 ns.

It's probably easier to ask DeanoC directly since 50ns doesn't rhyme with anything. Mayhaps he meant 50+ instruction penalty, equivalent to 25 cycles of dual-issue/commit (which seems to high though).

Cheers
 
I don't think they will re-evaluate their OS requirements. IMO 32MB of RAM over 512MB will only make small noticeable difference in texture quality; this may be compensated with a higher poly count. What is now seen as a problem that penalizes the console may be in the future a big sales point: it's likely that they reserved that much because they have big things in mind. And having as much processor power and RAM set aside may let them do amazing stuff, and that could distinguish their console.
Well, if it were up to me, I wouldn't really give a damn about having all sorts of ancillary functions running at the same time as a game and would have preferred to just leave the resident OS be a kernel and nothing more (at least while a game is running). But I guess some people might have some desire to keep a webpage of cheat codes open at the same time they're playing the game or something or whatever. And I can imagine they might take a small hit on account of electing not to lock down to all nature of proprietary peripherals and thereby needing a more formal driver layer for various things... but I doubt that's a huge drain.

All the same, I think it boils down to being unable to plan ahead of time what they intend to throw in feature-wise. They're overestimating so that they have breathing room. While I can believe there's room for them to decrease the memory requirements (and just have those later games require a certain update to be installed), I doubt it will ever happen. Of course, you could just as easily argue that Microsoft has the opposite problem in that if Sony develops some sort of killer app on PS3, 360's OS may not have the necessary memory space to follow suit. Again, something that I doubt will ever happen.
 
wasn't it also hinted that it was more effecient in selecting which vertices to actually render?
Without thinking much about it I'm not sure what RSX could be doing to be more efficient here? Is Shifty's memory faulty or does someone have an explaination?
 
Without thinking much about it I'm not sure what RSX could be doing to be more efficient here? Is Shifty's memory faulty or does someone have an explaination?
Perhaps he's thinking of vertices not getting reprocessed as often since RSX has its rather large post-transform caches? Well, it's true that if you partition the streams nicely enough to keep re-accessing verts that are sitting in the cache, then you can get significantly better vertex throughput out of it than the raw vertex shading performance would suggest, and RSX's caches being a few times larger than Xenos' means your odds are a little better. There's not a whole lot else I could think of that sounds anything similar to what Shifty was saying.
 
[maven];901278 said:
IME this would largely depend on how well you can get your entropy decoding scheme to run on the SPU, all the other elements of any form of transform coding should work very well on the SPU (as long as your data is tiled appropriately for the size of your local storage). Which reminds me that I still wanted to work on decompressing directly into DXTn...
I downloaded some jpeg-libs from intel just for fun.

And you are right, from running some rough benchmarks it seems that the Pentium perform much better if the output fits within the cache. My Pentium 4 at 3.2 MHz (Presler, 1 MB cache) decompressed a 15 kB jpeg file into a 250 kB (24-bit) bitmap in 2.8 ms. You could probably do better on an SPU with some hand tuned code taking advantage of the huge register file.

Yes, you really would like to decompress it into a DXTC texture, perhaps there are better compressions schemes for that. Huffman (edit: or LZW or some other lossless compression, probably with some custom optimisations,for the particular DXTC in question) encoding of a DXTC texture with some repetetive colours could perhaps give a good result, both with regard to compression rate and decompression speed?
 
Last edited by a moderator:
I downloaded some jpeg-libs from intel just for fun.

And you are right, from running some rough benchmarks it seems that the Pentium perform much better if the output fits within the cache. My Pentium 4 at 3.2 MHz (Presler, 1 MB cache) decompressed a 15 kB jpeg file into a 250 kB (24-bit) bitmap in 2.8 ms. You could probably do better on an SPU with some hand tuned code taking advantage of the huge register file.

Yes, you really would like to decompress it into a DXTC texture, perhaps there are better compressions schemes for that. Huffman (edit: or LZW or some other lossless encoding) encoding of a DXTC texture with some repetetive colours could perhaps give a good result, both with regard to compression rate and decompression speed?

This is probably more appropriate in the Cell programming thread, but did the Intel library make use of SSE? JPEG has the quantization operation which could be tweaked nicely if it isn't vectorized already (on Intel chips).
 
Crossbar said:
And you are right, from running some rough benchmarks it seems that the Pentium perform much better if the output fits within the cache. My Pentium 4 at 3.2 MHz (Presler, 1 MB cache) decompressed a 15 kB jpeg file into a 250 kB (24-bit) bitmap in 2.8 ms.
Was that standard jpeg (not some 2000 wavelet variant?). It's just curious - PS2 decodes a JPEG of that size (I assumed ~295*295 pixels based on your 24bit size), in close to 0.8ms - granted it's a hardware decoder - but it's also over 6 years old now.
I imagine SPE should absolutely fly at DCT macroblock decoding though, I'm sure someone will gave a go at it sooner or later.

ShootMyMonkey said:
I doubt it will ever happen
That depends on how much they overestimated and how much was result of code like
Code:
char *KenIsGreat =  new char[1024*1024*16]; //never remove!
For what's worth - they did "unreserve" half of the kernel reserved space on PSP eventually.
 
Last edited by a moderator:
This is probably more appropriate in the Cell programming thread, but did the Intel library make use of SSE? JPEG has the quantization operation which could be tweaked nicely if it isn't vectorized already (on Intel chips).
Yes, the intel library contains tweked code for almost every single intel CPU, so you can be pretty sure it used the SSE instruction set.

The mods can feel free to move this to the Cell programming thread, if they think it's appropriate. :)
 
for what its werth theres been PPC/Altivec DCT macroblock decoding in x264, ffmpeg, mplayer/mencoder for while now (dont know if its profiled though), so it shouldnt be to hard to port/patch that to use spe's, just a little effort and time, plus users get to benefit if you did it.
 
Last edited by a moderator:
for what its werth theres been PPC/Altivec DCT macroblock decoding in x264, ffmpeg, mplayer/mencoder for while now (dont know if its profiled though), so it shouldnt be to hard to port/patch that to use spe's, just a little effort and time, plus users get to benefit if you did it.

Might as well do that once I get my hands on a Cell in a couple of months, I enjoy tweaking code. I've found the Cell performance thread really enjoyable. :)

Is that code available at Sourceforge or elsewhere?
 
Back
Top