I think I forgot one gate when I said that. So it will be even a bit slower than a ripple carry adder. I still think it will be slightly smaller though.
Do the div-3 as done by hand, and optimize as if your life depended on it.
Don't do multiplication by the inverse.
We want: Y=X/3
Y[n] = bit n of Y, Y[0] is lsb, Y[N] is msb. (N>0)
A[N] = B[N] = 0
Y[n] = (X[n] & A[n]) | B[n];
A[n-1] = X[n] ^ Y[n];
B[n-1] = (X[n] & Y[n]) ^ A[n];
Still, division by two is a lot simpler.
I agree with your second paragraph. And to put it in another way:
As far as all data that goes over an interface is packeted in chunks that goes even up with the interface sizes, you won't get problems from half chopped up chunks. (Ehm, seemed kinda obvious when put that way.
)
All data transfered over the interface should match the interface size.
The data should match all interfaces it travels over.
But you get less flexibility on how to use the interface with swizzles and things like that if the primefactrization of the bus width contains many different primes.
Or in other words, 2^x is still best.
If you use indexed vertices (like glDrawElements() under OpenGL) you no longer access the vertex data in a sequential fashion, and cache size, efficiency and data alignment become important.
Yes, and that's a reason why raw vertex data should be of 2^x size. But even if the data for vertices in the vertex sequence aren't placed in subsequent memory addresses, they are still read in sequence, one vertex after each other. Vertices are to big to be read a bunch at each clock, like pixels/texels. And that's the important reason why it works well with a non-2^x number of VS units.
If you don't use indexed vertices you are streaming data and as such need a FIFO rather than a cache for efficient operation, and data alignment won't really matter.
Yes, unless the hardware actually still read one vertex a time. (Only optimized for indexed arrays.) I'll refer to it as a cache even if the rules for it makes it a FIFO. With non 2^x size vertices, you might get problems when the VS unit wants to read a vector from the cache. It could be misaligned, and need two clocks to be read.