All very fascinating information, thanks Nammo.
- No, that's using DRAM as the source and the on-chip line-buffer as destination.
- 7 cycles is the fastest real games went, indeed using that Atari-recommended trick where src=RISC SRAM and dest=DRAM.
That's 5 cycle/pixel texture-mapping blit and a 2 cycle/pixel copy blit.
So, it sounds like if you pair internal SRAM with the DRAM in any context it's just 5 cycles. I didn't think the 7 cycle figure included preparing the SRAM with the data. 2 cycles there is with the blitter in phrase mode right? So 8 cycles per phrase? Then you can save some time here if you use 8bpp pixels? Was that what Zero 5 did?
The RISC SRAM is single-ported and the stealing slows down both the RISC and blitter. That makes #2 slow in practice - the normal way halts the RISC to blit one line, then halts the blitter while the RISC sets up the next line.
When you mention the blitter halting the GPU this mean for the entire duration of the blit, or just for the 1 out of 5 cycles per pixel where it should be accessing the SRAM?
The major constraint of approach #1 is that you're racing the beam. You can comfortably hit 256x240 at 60FPS, which is phenomenal for the Jaguar, but you need a sorted list of spans. In other words, a Quake-style span-sorting hidden-surface-removal engine. I imagined running that on the DSP (Jerry), but I only got as far as a brute force sorter that could handle a few polygons, not the hundreds I dreamed of. Oh, well. I'm no Carmack
The major advantage of approach #1 is optimal use of the internal buses. The RISC and blitter both run in parallel at max theoretical speed (see point #3). The GPU has a dedicated bus to the blitter, so it can actually start setting blitter registers before the current span is done, without interrupting the blit in progress. With careful timing, the blitter never has to stop and span-setup/rasterization is fully pipelined in the GPU.
That's a very interesting approach.
If you could transform and sort into spans something like 1000 polygons at 15Hz it'd create results that are much better than anything I've seen on the Jaguar. You'd be wasting most of the 60Hz fillrate repeating the same spans over and over again, though. It's a compelling exercise to work out exactly how many triangles the DSP could Y-sort and span intersect.
I take it there's no way you could use Z-buffering with the line buffer target, at least not without adding a second pass of the OPL from line buffer to line buffer to get rid of the Z data?
Looking again at the #2 scenario above I wonder if instead of copying to GPU SRAM as a temporary buffer for textures you can copy to the line buffers (the portion that's not visible on a low resolution display) or the palette RAM. That would stop the blitter from conflicting with the GPU during blits and if you manage to fit multiple rows of textures into those buffers you could schedule a few blits back to back.
But then the problem with using the line buffer this way is that the OPL needs to read from it 50% of the scanlines. I guess you'd need to synchronize the blits right after the flip and make sure you don't run over the period where it flips again, then synchronize again with the next flip (and move to the next buffer). The overhead could be pretty bad. But the palette RAM could be free if you're not using palettes anywhere, assuming it's design to allow fast writes.
Side note: I designed a flash cart (Skunkboard) that enables 5 cycle textures from ROM to DRAM. Unfortunately, I used 2007 technology to do it. With 1994 ROMs, you could probably do 8 cycle texturing that way, without slowing your RISC, so it might still be a net win. I never heard of anyone doing that.
When you say 1994 ROMs, do you mean ones that could have been used then or actual Jaguar ROMs that were used? It seems like texturing from ROM would (usually) be a no brainer if there's no performance hit, and even moreso if there's a performance advantage!
I don't know the PS1 except from Sony's inflated specs. But, I do know the OP is the best-optimized part of the Jaguar. With only a few objects per line, it can hit 3200 pixels per scan line including CLUT lookups when rendering 4bpp or 8bpp images in 16-bit RGB/CRY mode. The NeoGeo gets 1536.
PS1's peak theoretical fillrate is 2 pixels/cycle for flat shaded triangles and sprites, 1 pixel/cycle for shaded/textured triangles (so scaled/rotated sprites) and 0.5 pixels/cycle for alpha blended shaded/textured triangles. It has a 27% higher clock speed than Jaguar, and unlike the OPL it can get meaningful fillrate all of the time and not just when scanlines are being displayed.
But the big downside is that textures and the framebuffer share the same memory space so texture cache misses decrease the fill rate by a lot. There's also bound to be some memory overhead in advancing the destination to the next framebuffer row (which is a whole DRAM page away)
What makes it even harder to make a comparison is that PS1's performance isn't even a constant throughout the life of the system. Rather early on Sony redesigned the GPU to use SGRAM instead of VRAM, which completely changed the performance. I've heard from Mednafen's author that texture cache misses took some 50% longer on the VRAM version, which makes sense given that SGRAM can have two banks open simultaneously. It also had a faster read/modify/write turnaround which affected practical alpha blending performance.
As you probably know, increasing the number of sprites/objects lowers performance. Developers who really pushed the OP (like Minter's D2K) did so by sorting their object lists into buckets to improve the pixel-to-object processing ratio. Minter once said he experimented with GPU RAM for the OP list to reduce page misses - no idea if that stuck or not. Some special effects halve performance, such as blending/semi-transparency - but I'm sure the PS1 has similar issues.
I always assumed splitting the object list into separate bins was standard course of action for Jaguar games. That seemed like the whole point behind the branch objects.