Atari Jaguar architecture discussion

I've been digging into some forum posts about Jaguar and apparently it's possible for the blitter to render from the GPU's internal SRAM instead of system DRAM. There's only 4KB and that already has to be used for the GPU's code and data, and I'm sure this will slow down the GPU's execution, but it's at least possible.

AtariAge poster kskunk, who appears to be extremely knowledgeable of Atari hardware, made an excellent post comparing the Jaguar and PS1's 3D capabilities:

http://atariage.com/forums/topic/143667-street-fighter-iii-for-jag/page-2#entry1751151

(to be fair there are some points that are too generous for PS1, but I'll save that for another time)

Here he gave a figure of 3.8 Mpixel/s for the blitter texturing from GPU RAM. If accurate, that means 7 cycles per pixel. I'm trying to see if there are any other sources for this value.
 
Nothing could have saved it with that game pad that looks like a TV remote. Oh well, it's just another in the long line of failed consoles.
 
Nothing could have saved it with that game pad that looks like a TV remote. Oh well, it's just another in the long line of failed consoles.

The funny thing is that telephone-style numpad was used in all sorts of pre-NES controllers. Atari 5200, Intellivision, Intellivision II, Colecovision, Emerson Arcadia 2001, RCA Studio II, and Radofin 1292 all had it (as well as the Video Touch Pad add-on for the 2600). You'd think Atari would have realized 10 years later that the idea was a dead-end but I guess outdated relic from the early 80s was their entire company image at that point. It just needed a terrible non-centering analog stick.
 
AtariAge poster kskunk, who appears to be extremely knowledgeable of Atari hardware, made an excellent post comparing the Jaguar and PS1's 3D capabilities:
I'm flattered you cited my post. I spent more time than I should developing hardware and software for the Jaguar.

The Jaguar is a fun console for hobbyist hackers, since Atari's collapse led to everything leaking out. Not just developer docs, but internal docs, game code, boot ROM and CD BIOS source, the code signing RSA key. Schematics for everything, even unreleased things, and best of all: The HDL source code for both custom chips. (Plus the half-finished Jaguar 2 ASICs.)

So, you can basically hack the Jaguar all the way to its DNA. By snooping around the HDL, I figured out the speed limit of the blitter: 5 cycles/pixel. The constraints are annoying, but you can free the RISC's on-die bus for pipelined rasterization, even at that speed. (Just don't touch external RAM.)

I also found an even faster full-screen rotation hack by running the blitter into the sprite engine.

There are more undocumented tricks, test modes, and surprises in the HDL. Most of them are semi-useful for tech demos, but too delicate or constraining for real games. In the Jaguar, the timing of several systems is exquisitely buggy - so, they stipulate limitations that slow you down. With cycle-accurate timing you can skip over bugs and limits, but you can easily end up with no CPU time for the game. There are other weird things they don't tell you about, like twice the documented CLUT RAM - internally split into odd/even banks to speed up the sprite engine - undocumented graphics modes - I could go on, but that's what AtariAge is for.

That summer of hacking made me wish I could get the HDL for other consoles. We know there are secrets inside them. (What became of the SNES's NES compatibility mode?) In the Jaguar HDL, I found commented-out features that didn't make it, mentions of bugs and fixes and plans that didn't work out.

Anyway, I tried to write a 3D engine for the Jaguar, and it was a great experience. I highly recommend it to all dead-console fanboys: No need to complain about lazy developers cheating you, when you can show everyone what the hardware can do. (Also a great way to learn humility - they weren't as lazy as you think.)
 
I'm flattered you cited my post. I spent more time than I should developing hardware and software for the Jaguar.

Wow, I'm glad I was able to summon you here with it ;) I've been kind of binging Jaguar info (not for the first time...) and your posts on Jaguar, Lynx, and Panther are definitely some of the best on AtariAge. And that's saying a lot given the crowd there.

The Jaguar is a fun console for hobbyist hackers, since Atari's collapse led to everything leaking out. Not just developer docs, but internal docs, game code, boot ROM and CD BIOS source, the code signing RSA key. Schematics for everything, even unreleased things, and best of all: The HDL source code for both custom chips. (Plus the half-finished Jaguar 2 ASICs.)

That's quite an unusual privilege and I'm not aware of another console where that has happened, at least not with any possibility of legitimacy (or not being sued into dust if it were revealed). Emulator authors would love to legitimate get their hands on that sort of thing. I heard about the HDL, and even that there were plans to port the Jaguar 2 design to an FPGA and complete it. It still sounded pretty ambitious, I would have been surprised if that were completed.

So, you can basically hack the Jaguar all the way to its DNA. By snooping around the HDL, I figured out the speed limit of the blitter: 5 cycles/pixel. The constraints are annoying, but you can free the RISC's on-die bus for pipelined rasterization, even at that speed. (Just don't touch external RAM.)

So does that mean 5.32 MPixel/s if using GPU SRAM as both source and destination? Was the earlier number of 7 cycles correct if using GPU SRAM as source and main DRAM as destination? And is it right that the blitter accessing GPU SRAM will steal cycles from the GPU, or is the SRAM dual ported?

Of course the issue of using the GPU SRAM for a destination buffer is that it (probably?) can't stay there forever.

I also found an even faster full-screen rotation hack by running the blitter into the sprite engine.

You mean setting the visible line buffer as the blitter destination, or something even hackier than that? I figured the GPU IRQ object was for this purpose.

Anyway, I tried to write a 3D engine for the Jaguar, and it was a great experience. I highly recommend it to all dead-console fanboys: No need to complain about lazy developers cheating you, when you can show everyone what the hardware can do. (Also a great way to learn humility - they weren't as lazy as you think.)

I definitely believe it. It'd still be interesting to see the best that could be pulled off; I wonder what the best homebrew thus far is. I was kind of perplexed at how much homebrew devs were messing with running the GPU out of main RAM.

I've been thinking a lot about the 2D capabilities of the device too. It's sometimes heralded as this 2D powerhouse, some were even saying that it's a lot better than Saturn. But I think the peak fillrate from the Object Processor isn't really that different from what the PS1's is, at least in theory. On the other hand, I think the PS1 could be severely hampered by sprite blits that have to stream from cache, because of shared DRAM penalties...
 
So does that mean 5.32 MPixel/s if using GPU SRAM as both source and destination? Was the earlier number of 7 cycles correct if using GPU SRAM as source and main DRAM as destination? And is it right that the blitter accessing GPU SRAM will steal cycles from the GPU, or is the SRAM dual ported?
  1. No, that's using DRAM as the source and the on-chip line-buffer as destination.
  2. 7 cycles is the fastest real games went, indeed using that Atari-recommended trick where src=RISC SRAM and dest=DRAM.
    That's 5 cycle/pixel texture-mapping blit and a 2 cycle/pixel copy blit.
  3. The RISC SRAM is single-ported and the stealing slows down both the RISC and blitter.
    That makes #2 slow in practice - the normal way halts the RISC to blit one line, then halts the blitter while the RISC sets up the next line.
The major constraint of approach #1 is that you're racing the beam. You can comfortably hit 256x240 at 60FPS, which is phenomenal for the Jaguar, but you need a sorted list of spans. In other words, a Quake-style span-sorting hidden-surface-removal engine. I imagined running that on the DSP (Jerry), but I only got as far as a brute force sorter that could handle a few polygons, not the hundreds I dreamed of. Oh, well. I'm no Carmack.

The major advantage of approach #1 is optimal use of the internal buses. The RISC and blitter both run in parallel at max theoretical speed (see point #3). The GPU has a dedicated bus to the blitter, so it can actually start setting blitter registers before the current span is done, without interrupting the blit in progress. With careful timing, the blitter never has to stop and span-setup/rasterization is fully pipelined in the GPU.

Side note: I designed a flash cart (Skunkboard) that enables 5 cycle textures from ROM to DRAM. Unfortunately, I used 2007 technology to do it. With 1994 ROMs, you could probably do 8 cycle texturing that way, without slowing your RISC, so it might still be a net win. I never heard of anyone doing that.

You mean setting the visible line buffer as the blitter destination, or something even hackier than that?
Hackier. My friend was working on a Rad Mobile style engine using the sprite scaling hardware, and I had some funny timing where the blitter grabbed lines out of the linebuffer just as they were rendered, skewing them on their way to DRAM, then kicking the OP to replace that line with the skewed version just before it was displayed. I even saved some DRAM by using a rolling skew-buffer that was shorter than the screen height, since we never tilted more than 30 degrees. (Sorry, it's been a while, I probably forgot something.)

But I think the peak fillrate from the Object Processor isn't really that different from what the PS1's is, at least in theory. On the other hand, I think the PS1 could be severely hampered by sprite blits that have to stream from cache, because of shared DRAM penalties...
I don't know the PS1 except from Sony's inflated specs. But, I do know the OP is the best-optimized part of the Jaguar. With only a few objects per line, it can hit 3200 pixels per scan line including CLUT lookups when rendering 4bpp or 8bpp images in 16-bit RGB/CRY mode. The NeoGeo gets 1536.

As you probably know, increasing the number of sprites/objects lowers performance. Developers who really pushed the OP (like Minter's D2K) did so by sorting their object lists into buckets to improve the pixel-to-object processing ratio. Minter once said he experimented with GPU RAM for the OP list to reduce page misses - no idea if that stuck or not. Some special effects halve performance, such as blending/semi-transparency - but I'm sure the PS1 has similar issues.

As impressive as that all is, in practice it doesn't mean much. I could easily build a 1,000 sprite object list that moved 35 megapixels/second. But running interesting game logic (not to mention collisions) on 1,000 objects requires some CPU and bandwidth, which unfortunately is all used up rendering sprites.
 
Last edited:
All very fascinating information, thanks Nammo.

  1. No, that's using DRAM as the source and the on-chip line-buffer as destination.
  2. 7 cycles is the fastest real games went, indeed using that Atari-recommended trick where src=RISC SRAM and dest=DRAM.
    That's 5 cycle/pixel texture-mapping blit and a 2 cycle/pixel copy blit.

So, it sounds like if you pair internal SRAM with the DRAM in any context it's just 5 cycles. I didn't think the 7 cycle figure included preparing the SRAM with the data. 2 cycles there is with the blitter in phrase mode right? So 8 cycles per phrase? Then you can save some time here if you use 8bpp pixels? Was that what Zero 5 did?

The RISC SRAM is single-ported and the stealing slows down both the RISC and blitter. That makes #2 slow in practice - the normal way halts the RISC to blit one line, then halts the blitter while the RISC sets up the next line.

When you mention the blitter halting the GPU this mean for the entire duration of the blit, or just for the 1 out of 5 cycles per pixel where it should be accessing the SRAM?
The major constraint of approach #1 is that you're racing the beam. You can comfortably hit 256x240 at 60FPS, which is phenomenal for the Jaguar, but you need a sorted list of spans. In other words, a Quake-style span-sorting hidden-surface-removal engine. I imagined running that on the DSP (Jerry), but I only got as far as a brute force sorter that could handle a few polygons, not the hundreds I dreamed of. Oh, well. I'm no Carmack

The major advantage of approach #1 is optimal use of the internal buses. The RISC and blitter both run in parallel at max theoretical speed (see point #3). The GPU has a dedicated bus to the blitter, so it can actually start setting blitter registers before the current span is done, without interrupting the blit in progress. With careful timing, the blitter never has to stop and span-setup/rasterization is fully pipelined in the GPU.

That's a very interesting approach.

If you could transform and sort into spans something like 1000 polygons at 15Hz it'd create results that are much better than anything I've seen on the Jaguar. You'd be wasting most of the 60Hz fillrate repeating the same spans over and over again, though. It's a compelling exercise to work out exactly how many triangles the DSP could Y-sort and span intersect.

I take it there's no way you could use Z-buffering with the line buffer target, at least not without adding a second pass of the OPL from line buffer to line buffer to get rid of the Z data?

Looking again at the #2 scenario above I wonder if instead of copying to GPU SRAM as a temporary buffer for textures you can copy to the line buffers (the portion that's not visible on a low resolution display) or the palette RAM. That would stop the blitter from conflicting with the GPU during blits and if you manage to fit multiple rows of textures into those buffers you could schedule a few blits back to back.

But then the problem with using the line buffer this way is that the OPL needs to read from it 50% of the scanlines. I guess you'd need to synchronize the blits right after the flip and make sure you don't run over the period where it flips again, then synchronize again with the next flip (and move to the next buffer). The overhead could be pretty bad. But the palette RAM could be free if you're not using palettes anywhere, assuming it's design to allow fast writes.

Side note: I designed a flash cart (Skunkboard) that enables 5 cycle textures from ROM to DRAM. Unfortunately, I used 2007 technology to do it. With 1994 ROMs, you could probably do 8 cycle texturing that way, without slowing your RISC, so it might still be a net win. I never heard of anyone doing that.

When you say 1994 ROMs, do you mean ones that could have been used then or actual Jaguar ROMs that were used? It seems like texturing from ROM would (usually) be a no brainer if there's no performance hit, and even moreso if there's a performance advantage!

I don't know the PS1 except from Sony's inflated specs. But, I do know the OP is the best-optimized part of the Jaguar. With only a few objects per line, it can hit 3200 pixels per scan line including CLUT lookups when rendering 4bpp or 8bpp images in 16-bit RGB/CRY mode. The NeoGeo gets 1536.

PS1's peak theoretical fillrate is 2 pixels/cycle for flat shaded triangles and sprites, 1 pixel/cycle for shaded/textured triangles (so scaled/rotated sprites) and 0.5 pixels/cycle for alpha blended shaded/textured triangles. It has a 27% higher clock speed than Jaguar, and unlike the OPL it can get meaningful fillrate all of the time and not just when scanlines are being displayed.

But the big downside is that textures and the framebuffer share the same memory space so texture cache misses decrease the fill rate by a lot. There's also bound to be some memory overhead in advancing the destination to the next framebuffer row (which is a whole DRAM page away)

What makes it even harder to make a comparison is that PS1's performance isn't even a constant throughout the life of the system. Rather early on Sony redesigned the GPU to use SGRAM instead of VRAM, which completely changed the performance. I've heard from Mednafen's author that texture cache misses took some 50% longer on the VRAM version, which makes sense given that SGRAM can have two banks open simultaneously. It also had a faster read/modify/write turnaround which affected practical alpha blending performance.

As you probably know, increasing the number of sprites/objects lowers performance. Developers who really pushed the OP (like Minter's D2K) did so by sorting their object lists into buckets to improve the pixel-to-object processing ratio. Minter once said he experimented with GPU RAM for the OP list to reduce page misses - no idea if that stuck or not. Some special effects halve performance, such as blending/semi-transparency - but I'm sure the PS1 has similar issues.

I always assumed splitting the object list into separate bins was standard course of action for Jaguar games. That seemed like the whole point behind the branch objects.
 
Was that what Zero 5 did?
On the Jaguar, smooth and flat shaded polygons are 10-20x faster than even "fast" texture mapping like we're discussing. So, Zero 5 used textures sparingly, like all the better-performing Jaguar games. 8-bit texture mapping would be a minor gain in that context.

There've been some interesting interviews where the designers explain that in 1990 when the Jaguar was designed, smooth shaded polygons seemed like the pinnacle of realistic 3D. Texture mapping wasn't even on their radar. Wolf3D didn't exist yet.

When you mention the blitter halting the GPU this mean for the entire duration of the blit, or just for the 1 out of 5 cycles per pixel where it should be accessing the SRAM?
It's software-halted for each line blitted. The blitter is slowed down while the GPU is running (since it adds cycles in bus arbitration), so somebody must have concluded it was better to avoid arbitration.

It's a compelling exercise to work out exactly how many triangles the DSP could Y-sort and span intersect.
Oh, yeah, I worked it out. There's a bottleneck in getting the data in and out of the DSP over its slow 16-bit bus, but it's manageable. You have all day to sort things inside the DSP local RAM.

I take it there's no way you could use Z-buffering with the line buffer target
Naw, the added cycles for Z-buffering cut deep into the horizontal resolution, and there's not time for much overdraw, either.

But the palette RAM could be free if you're not using palettes anywhere, assuming it's design to allow fast writes.
Nope, it can only write 16-bits at a time, and reads don't work when the OP is not running, so you'd need to tread carefully.

When you say 1994 ROMs, do you mean ones that could have been used then or actual Jaguar ROMs that were used?
All commercial Jaguar games set the ROM cycle time to 375ns for 15 cycle/pixel texturing. 8 cycle texturing needs 188ns cycle time, which translates to access times around 135ns. 1994 ROMs could do that, but those were power hungry and possibly(?) more expensive.

In any case, the more sophisticated Jag games decompressed textures into RAM. ROM cost enough to make it worth it.

I always assumed splitting the object list into separate bins was standard course of action for Jaguar games. That seemed like the whole point behind the branch objects.

It was more common for tilemaps. Lots of games didn't have more than 20-30 moving things on the screen, so they didn't bother sorting.
 
I apologize for necro reviving this thread that is nearly 6 years dead.

One of links is now defunct so here archived one:

I want to highlight this ancient website that has alleged pictures of Jaguar 2, last updated 20 years ago:
Another spreadsheet of Hitachi's CMOS DRAM that was used in Jaguar, I have no knowledge to calculate its bandwidth.

Jaguar could have competed with Sony PlayStation and Sega Saturn if Atari only had just one more year worth of money and resources to invest in development of Jaguar and fixing hardware bugs.
Perhaps replacing Motorola 68000 with another Tom processor with some differences to TOM as to have role of dedicated CPU and take role of 68000 as manager.
Samsung offered SDRAM in 1992 and by 1993 there was mass production of it, in 1994 is when SDRAM was affordable enough as evident by use in Sega Saturn.

I have this as reference for bandwidth that FPM RAM, EDO RAM and SDRAM have, though note this timeline is when home/personal computers approximately start using such.
Again Sega Saturn used en masse SDRAM in 1994 while in source above stated as 1995 that involves home/personal computers and same repeated in PCMag.

I hope there are people still interested in Atari's Jaguar.
 
Four 64bit 512KB of DRAM and only 100MBps bandwidth doesn't sound right.
Find out it is because of Motorola 68000 they lowered frequency to match it.
Effectively halved system bandwidth and crippling further performance.

If DRAM in Jaguar could have operated at same frequency as Tom and Jerry...
...then performance would have been at very least improved by 25 percent!
 
It seems bandwidth depended on clock speed of processors since only rating there is nanoseconds for DRAM with FPM/EDO/BEDO unlike SDRAM.
Anyway there were overclocks of Tom and Jerry to nearly 40MHz with highest being 37MHz while passively cooled, 68000 was the limitation.
Vast majority of games utilize 68000 in some form hence only few could work without it then 40MHz can be forced.

If there was no 68000 then Jaguar could have had launched with Tom and Jerry clocked at 40MHz, also 68K not eating away at the bandwidth.
Another is that removing 68K would have left enough space on motherboard to lets say add two more 512KB of FPM DRAM modules in its place.
At least then if Jerry had 1MB of DRAM dedicated for itself and Tom could have 2MB all for itself, bandwidth wise would have been great.

Of course Jerry would have had to have DRAM controller too.

But that is out of scope of this thread.
 
Back
Top