It isn't just fill and GFLOPS, it's bandwidth, it's a more flexible polygon pipeline, it's audio, storage media, and (dare I open this can of worms again) system memory. Even the GPU has more embedded memory. If you can find any metric by which the PS2 doesn't out spec the GameCube, please tell me.
I said it in an earlier post already, the CPU is a lot better at general purpose code that isn't highly data regular. I mean, if you want to pull out paper specs that don't mean anything, you could cite peak integer operations or independent loads/stores per second or something. But you really can't ignore the big difference in cache hierarchy. EE's 16KB and 8KB L1 caches are only 2-way set associate, very meager compared to Gecko's 32KB + 32KB 8-way L1 caches + 256KB L2 cache.
As for your other points:
The bandwidth numbers Sony came up with don't really make sense, since they add together peak color/depth RMWs with peak texture loads, but you can only get half the former if texturing is enabled. Both GC and PS2 claim the same texture memory/clock bandwidth, and GC has a higher GPU clock so it actually wins there. The framebuffer bandwidth is really just another side of the arguments around fillrate, they're synonymous (I don't know how much bandwidth GC's FB eDRAM has but I'm sure it's just enough to do 4 pixels/sec with RGBA + depth instead of 16 like PS2)
For the main RAM, PS2 put its controller on the CPU side while GC put it on the GPU side, so they trade advantages there, with PS2 having 2.4GB/s peak bandwidth to RAM vs GC having 1.2GBs, while GC has 2.6GB/s bandwidth to the GPU vs 1.2GB/s for PS2. IMO, GC makes the better tradeoff here, since that bandwidth is more easily utilized filling the texture buffer (which both have to do). And on the CPU side the L2 cache absorbs a lot of the bandwidth difference.
When you mention audio, let's be clear that you're talking about output format and not processing - where a small number of games had DTS. If you look at audio acceleration, PS2's SPU2 is little more than two PS1 SPUs together. At 48 voices/48KHz you'd be looking at about 35 cycles per sample for Gamecube's DSP which is almost certainly enough to do a lot more than PS2's per-sample capability, even when you factor in ADPCM (which in practice would only go to a small buffer at a rate of far less than 1:1 input to output cycles). And of course the DSP is more flexible. PS2 has the IOP as well, but it's not as if it's an additive effect; what it'd naturally be used for (sequencing instruments on and off) is overkill for it.
I'll give you storage space, that's an easy win for PS2. I won't give you system memory. There is simply some persistent state that needs to be in RAM but doesn't have to be accessed tremendously. For an extreme example, look at the stuff that'll get saved in the memory card.
Beyond the fact that it would be fillrate limited in most cases, tiling requires framebuffer alpha, and because GC/Wii can only output 24bit, if you are using 8bit for your alpha you are only getting 16bit color. That's why Cube and GC games end up with more banding. Using FSAA also makes the back buffer too large for the 2MB buffer, so games would have to drop color depth and tile to remove jaggies, and that's why so few games used it.
What makes you think tiling requires framebuffer alpha?
You don't need dest alpha if you can avoid multipass over the part you need alpha, which with TEV you probably often could.