There are 5 processors involved in PS2 graphics:
The EE (CPU) which does normal CPU things. In this case it's job is to do high-level culling of whole batches of polygons. So, decide to draw the whole batch or skip it. Batches to be drawn are linked into a queue of instructions for the VIF.
The VIF is a slightly-programmable DMA engine. Basically, it's job is to copy data from the main memory to the internal memory of the VU. In the process it can do a little rearranging of offsets, strides, packing, unpacking the source and destination buffers. The VIF also sends code to the VU1. The VU1 only has 16kb of memory total for both code and data. So, you have to stream code for specific situations kind of like switching shaders.
The VU1 receives code and a series of data chunks from the VIF. For each data chunk, the VIF can also start the code at a data-driven address. That's how you can have multiple routines in a single code chunk. The VU1's job is to prepare data for the GIF. The VU has limited (16 bit) general purpose registers and instructions. It's real power is in vector instructions that can do large volumes of math. The VU handles vertex animation, lighting, UVs (including some of the mipmap math), etc. It also handles culling individual off-screen triangles and it must manually clip triangles into sub-triangles vs. the edge of the screen. Basically, it handles everything to do with geometry. It's not a 1-in-1-out vertex setup like vertex shaders. It's "blob of bytes" in, many-triangles-out.
The GIF is another DMA engine. It reads chunks of bytes from the VU1's memory and stuff it into the registers of the GS. Like the VIF, you point it at a command queue containing a mixture of control bytes and data bytes for it to read through and interpret.
The GS is the rasterizer. It has no instruction set. You control it entirely by setting values in registers (via the GIF). There are registers to define a texture to read from the embedded DRAM. Registers to define the framebuffer in the same DRAM. Registers to define the blending/Z/interpolator actions (vertex color modes). And, a register where you stuff vertex positions. Stuff that 1 register 3 times (yes, overwriting 2 values) and the GS will rasterize a triangle into the framebuffer according to the state defined in the rest of the registers. Alternatively, there is a triangle strip mode that only requires 3 positions to get the first triangle, then 1 per triangle after that. For a given triangle, the GS can only handle a single texture, a few options for how to incorporate the vertex colors and a limited blend mode. So, if you want to use two textures, you need to draw the same triangle twice with different configurations. Fortunately, changing GS configs can be done by setting a few registers at a cost of 1 cycle each. So, the VU can transform a dozen triangle once, have the GS rasterize them, switch GS configs and rasterize them again. That's twice the pixel work for the GS, but it's pretty cheap for the VU.
So, that's a long, fun brain-dump just to say that the CPU isn't really burdened with calculating polygons. The VU1 is
Your main intuition is correct though. The VU can and should do a whole lot of fine-grained culling before sending triangles to the rasterizer. That can explain a lot of the difference in polygon culling measured in the emulators.