Why always 2^x in the computer industry, specifically GPUs?

pcchen said:
Considering you have a three pixel pipelines design. The only viable fixed pattern is 3x1 (or 1x3) blocks...... You need a division operation for this.
Division by 3 isn't too bad (obviously worse than division by 2 though)
 
The Mathemagician once showed me a circuit he had for doing division by three. It was astonishingly simple.
 
Dividing an integer/address by a fixed power of 2 takes zero transistors in standard binary logic. A bit hard to beat, and very frequently used to save transistors all over the place (especially for memory/cache addressing).

Dividing a number x by 3 can be done in hardware as follows:
P1 = x/4 + x/16 (about 4 bits precision)
P2 = P1 + P1/16 (~8 bits precision)
P3 = P2 + P2/256 (~16 bits precision)
P4 = P3 + P3/65536 (~32 bits precision)
and so on, until you reach the desired precision. Takes 2 to 5 adders in hardware, depending on desired precision.

Similar tricks are in general available in hardware for dividing by numbers on the forms 2^n + 1 and 2^n - 1 (and, of course, such numbers in turn multiplied with powers of 2); dividing by other constant numbers generally require at least a full multiplier.

Another power-of-2 issue is that for texturing, using a texture map whose dimensions are not a power of 2 means that you cannot set up a mipmap pyramid for mipmapping/trilinear interpolation.
 
There's no reason why you can't setup a mipmap pyramid. You just have to be able to have a method that drops the fractions appropriately. It's a lot messier.
 
Which leaves the question of what such a method might look like - if the lowest mipmap level is e.g. a 5x13 texture map, what is then an appropriate size of the next mipmap level? Do you pad out the 5x13 to, say, 6x14, then generate a 3x7 mipmap level from the padded-out version, and in a similar way proceed with 2x4, 1x2 and 1x1 mipmap levels? And so, what data do you use for the padding?

In any case, addressing a mipmapped non-power-of-2 texture sounds ... incredibly hairy.
 
Pretty much. The usual formula for generating the next mip level down (which would work for power of 2 and non-power of 2) is s = (s+1)>>1. This is an iterative process - I'm sure the experts could fix it up so it wasn't.

I think 'stick to powers of 2' is still a good idea, myself.
 
If you don't mind ripple-carry-like behaviour, it's possible to do a div-3 unit that is smaller than an adder. But that's not the only problem. Even if the silicon gods gave you a 0-gate, 0-time division unit, there's still reasons to not use 3 pixel pipelines.
The fact that memory buses are 2^x sized means that a 3x1 pixel block would straddle the memory block boundries every now and then, making the logic more complex. And then there's the argument that 2x2 pixel is a very nice unit to calculate in parallel (mainly for miplevel calc).


Having textures and mipmaps in 2^x sizes makes calculations much easier. But it's actually possible to have other sizes. (Note: we lose a lot of efficiency here, so hardware generally don't like this.) In theory, mipmaps doesn't have to be exactly half the size of the next. I have even seen (software) experiments that had ~sqrt(2) length ratio between mipmaps. (It costs a lot of memory, but can give better quality.) Hmmm, that's actually possible to do with pixel shaders (at a high cost). :)


But there are things that works fine even in non-2^x sizes. These are typically things that work in series in some way.

The number of instructions in a PS/VS program.

The number of VS units.
While the VS units work in parallel, they do work on a 1-dimensional sequence of vertices. The dataset for each vertex is so large, and stored in such a way that it's probably best to read all vertex data for one vertex before starting on the next. So the memory access' to different VS units are done in sequence. There's still a 2^x rule here though. The data for each vertex should be 2^x sized for optimal performance. (So that cache lines don't straddle vertex data borders.)

The number of TMUs.
Again, units that work in parallel. But there's a serial part to it. The different TMUs load their respective caches in series. The first TMU loads a bunch (2^x) of texels, then the second TMU and so on. It's pretty much forced to do so with a UMA, since you could get a lot of page misses otherwise. And when the main memory interface is "serialized" between the the TMUs, the main reason for 2^x number of TMUs is gone.
 
Basic said:
If you don't mind ripple-carry-like behaviour, it's possible to do a div-3 unit that is smaller than an adder.
How? Just curious.
The fact that memory buses are 2^x sized means that a 3x1 pixel block would straddle the memory block boundries every now and then, making the logic more complex. And then there's the argument that 2x2 pixel is a very nice unit to calculate in parallel (mainly for miplevel calc).
A 3x1 or 3x2 pixel block architecture should, at least in principle, be possible to match up rather nicely with a 96 or 192 bit memory bus for framebuffer accesses. Although addressing of non-framebuffer data gets a bit hard in such a setup (except possibly 24-bit textures).
The number of VS units.
While the VS units work in parallel, they do work on a 1-dimensional sequence of vertices. The dataset for each vertex is so large, and stored in such a way that it's probably best to read all vertex data for one vertex before starting on the next. So the memory access' to different VS units are done in sequence. There's still a 2^x rule here though. The data for each vertex should be 2^x sized for optimal performance. (So that cache lines don't straddle vertex data borders.)
If you use indexed vertices (like glDrawElements() under OpenGL) you no longer access the vertex data in a sequential fashion, and cache size, efficiency and data alignment become important. If you don't use indexed vertices you are streaming data and as such need a FIFO rather than a cache for efficient operation, and data alignment won't really matter.
 
I think I forgot one gate when I said that. So it will be even a bit slower than a ripple carry adder. I still think it will be slightly smaller though.

Do the div-3 as done by hand, and optimize as if your life depended on it. :) Don't do multiplication by the inverse.

We want: Y=X/3
Y[n] = bit n of Y, Y[0] is lsb, Y[N] is msb. (N>0)
A[N] = B[N] = 0

Y[n] = (X[n] & A[n]) | B[n];
A[n-1] = X[n] ^ Y[n];
B[n-1] = (X[n] & Y[n]) ^ A[n];

Still, division by two is a lot simpler. :)


I agree with your second paragraph. And to put it in another way:
As far as all data that goes over an interface is packeted in chunks that goes even up with the interface sizes, you won't get problems from half chopped up chunks. (Ehm, seemed kinda obvious when put that way. :))
All data transfered over the interface should match the interface size.
The data should match all interfaces it travels over.
But you get less flexibility on how to use the interface with swizzles and things like that if the primefactrization of the bus width contains many different primes. :D Or in other words, 2^x is still best. ;)


If you use indexed vertices (like glDrawElements() under OpenGL) you no longer access the vertex data in a sequential fashion, and cache size, efficiency and data alignment become important.
Yes, and that's a reason why raw vertex data should be of 2^x size. But even if the data for vertices in the vertex sequence aren't placed in subsequent memory addresses, they are still read in sequence, one vertex after each other. Vertices are to big to be read a bunch at each clock, like pixels/texels. And that's the important reason why it works well with a non-2^x number of VS units.

If you don't use indexed vertices you are streaming data and as such need a FIFO rather than a cache for efficient operation, and data alignment won't really matter.

Yes, unless the hardware actually still read one vertex a time. (Only optimized for indexed arrays.) I'll refer to it as a cache even if the rules for it makes it a FIFO. With non 2^x size vertices, you might get problems when the VS unit wants to read a vector from the cache. It could be misaligned, and need two clocks to be read.
 
Back
Top