http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71
Finally "scalar" ALUs! Looks good on paper so far
Finally "scalar" ALUs! Looks good on paper so far
Yes, GCN does attribute interpolation with ALUs (compiler puts them in the beginning of the shader). Also GCN uses lots of ALUs for other seemingly innocent tasks, such as sampling a cube map (normalize UV vector, cubeface. cubecoord). Double rate FP16 would help a lot in games. Not all math needs to be full precision (to have any effect on resulting image quality). I also like double rate 16 bit integer math (and native support for splitting an 32 bit register between two 16 bit variables). Modern rendering engines do a lot of integer math (in compute shaders).Each of the four quad lanes in an execution engine has an FMA plus a separate ADD, so in terms of flop numbers, you should get that:
G71 at 1000 MHz should be able to do 32*12*(2+1)*1000M = 1152 GFlops/s.
While this does fall a bit short of XB1/PS4, it should generally be a bit easier to get closer to theoretical numbers with smaller wavefronts; it's also been my understanding that the GCN architecture used in these consoles has to do attribute/varying interpolation in shader cores, which also reduces the amount of GFlops actually available; G71 keeps dedicated varying-units, so that this kind of varying-interpolation doesn't take away from the GFlops number. Also, G71 has full FP16 support, using 2-component vectors in each 32-bit lane; using FP16 instead of FP32 will as such give you twice the paper-GFlops.
Also, for heterogeneous compute, G71 now has full coherency, so that CPU and GPU can work on the same data without intervening cache maintenance operations; whether this is useful for anything that a console would normally do ... time will tell, I suppose.
Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefrontWith the amount of spatial coherence typically in graphics, a SIMD4 seems like a strange choice. Imagine the amount of control logic required in a 32 core GPU! I don't think packing quads into a wavefront is a problem - only the last wavefront would be partially empty.
Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront
Is that really different to "traditional" renderers? I'm not convinced these can export to multiple RBEs from the same wave.Keep in mind that it's a tile based GPU, it wouldn't be able to pack quads from different tiles into a wavefront
I don't know if that's the case. But even if the quads in a wavefront need to go to the same RBE that's not the same as going to the same tile.Is that really different to "traditional" renderers? I'm not convinced these can export to multiple RBEs from the same wave.
So 256 quads without overdraw, or at least 64 hypothetical wavefronts of 4 quads. With each object added to the tile that gets more fragmented, I wouldn't be surprised to see utilisation well below 90% in average cases. Is that worth it? I don't know.Tiles are 32x32 pixels these days. There should be plenty of opportunity to do 4 quads at a time.
Well normalizing on the basics of clock speed and number of GPU units does not say much. Power and die size are better indication of the design efficiency.
ARM new IPs are impressive both on the CPU side and the GPU side and the A35 are not in use yet.
I wonder when the low and mid-end SOC will really moved forward, it seems market segmentation is the only reason holding their evolution at this point (I guess some twisted costumers perception wrt core number too).
It is a logical fallacy, I can't answer the clusters are not the same. Now the clock speed is likely to affect performance efficiency.How high are the chances exactly that 8 clusters clocked at 900MHz will consume less power than 6 clusters at 630+MHz (while this time the amount of TMUs/cluster being equal)?
If power consumption would be low enough, they would eventually clock even higher.It is a logical fallacy, I can't answer the clusters are not the same. Now the clock speed is likely to affect performance efficiency.
Your original post is extremely negative wrt to the giant stride ARM IPs are doing especially in the GPU realm.
So where's the "giant stride" exactly? I've commented on Huawei's claims and I severely doubt that they've made any of those numbers up or would go through any efforts to show their own products in a negative light.Furthermore there are not proper tests of a devices running that SOC.
If it does not beat Apple custom GPU efforts as well as massively optimized software stack it will be a surprise to nobody. Yet it is one hell of a jump from previous arch.
What is that based on?(while this time the amount of TMUs/cluster being equal)?
What is that based on?