This I also doubt very much, GBA had no hardware 3D acceleration to my knowledge, and a very slow main CPU. Doing 3D math in software is slow as hell unless you got some sort of on-chip vector math unit or similar like in Sega Dreamcast's SH4.
It's not that slow. Although peak poly counts were a nice marketing number, it wasn't really the bottleneck. My GBA engine could render 64k/s visible textured polys, if you add backface culling it would peak at 128k/s. But the real limitation was the fillrate.
I've done all math with 24:8 fix point math, Sin/Cos was an 512entry LUT. the only time I needed division was during perspective projection, I've used an u16 reciprocal LUT of 2048 entries (if I recall correctly). As all memory was either sram or rom, accessing it randomly wasn't that much of an issue and very predictable, so I had no worries about that.
for rasterization, I've used a divide&conquer approach. The ARM instruction set allows quite some instruction to use a 'free' shift (nowadays on phones etc. it's an extra cycle, but on GBA it was really a free thing).
thus interpolating UVs was something like
Code:
UVnew = (UVleft+UVright)>>1
I had a tiny software stack in IWRam to push one side in, which, again on ARM, was just one instruction, because memory stores/loads and increment or decrements are a free instruction
Code:
*pUVStack++=UVRight;//one instruction
another trick was to actually store U and V in the same register. there is this old trick to blend 2 RGBA8 colors together
Code:
RGBAnew = ((RGBA0&0xfefefefe)>>1)+((RGBA1&0xfefefefe)>>1);
I've done a similar thing with UVs, but spaced them in general by one bit, thus UV would lay in a 32bit register like
Code:
000000000000000UUUUUUUU0VVVVVVVV
that's why uvnew = (UVleft+UVright)>>1; can interpolate both in one instruction.
there was some &0xfffeffff cleanup afterwards, of course.
to not calculate a new texture coordinates, textures were interleaved on the U axis (like a texture atlas), that allowed me to fetch it by directly using pTexture[UVNew];
textures were stored in ROM. You could manually setup the ROM frequency, which for those homebrew dev cartridges could run on highest settings.
The only really fast memory was the IWRam which had a 32bit bus. As ARM opcodes are 32bit, that's the only place where your code would run at 16MHz, if you'd run from ROM or the 256KB ram, you'd effectively run 8MHz at best (or with Thumb instructions at 16MHz, but that's not better either). Thus the fastest place for 'fillrate' critical stuff was ruled out by binary code. The 2nd best place was to use VMem, it's 16bit, but you could not access it in 8bit, you had two write 16bit, which made me snap my rasterization to 2pixel boundaries.
I made a kind of wipeout-fzero hybride with the engine, it was fully 3d, but the track had no height (similar to the F1 game). I have somewhere a backup of it all, but the only thing online is an early version that I've used to attract some artist
http://rapsooo.tripod.com/
on the bottom are 3 vids (the first time you click for download, tripod now seems to forward you to an ad page), recorded with my mighty 320x200 webcam I think.
there was way more to it, e.g. my own movie codec. funnily I've used what you'd call nowadays a hadamard transform, but I've derived it actually from the DCT transform and applied it on 4x4 blocks. but MOST of the frame just used motion compensation. No, there was no residual encoding, it was really just either a set of 4 4x4 transformed blocks or a memcpy of a previous-frame 8x8 block. there was also no smart quantitization or zigzag encoding, I've just stored the upper-left triangle of the transformed block. All data of a frame had a limit in size which implicitly limited the amount of 4x4 hadamard blocks. which block was motion vector and which transformed was decided by the MSE to the reference block.
the frame was then either huffman or lz78 encoded (whichever led to smaller size), as there was a piece of code in the firmware-ROM that had the decompression routines for those. essentially saved quite some binary size in IWRam.
I could go on forever if someone is not sleeping yet
but my point was, polycount wasn't the limiting factor, it rather was the fillrate.