Good article on emulating the gamecube gpu

Davros · Mar 17, 2014

https://dolphin-emu.org/blog/2014/03/15/pixel-processing-problems/

swaaye · Mar 17, 2014

So I guess this means that Direct3D 8.1 integer features aren't adequate for Hollywood/Flipper emulation?

Davros · Mar 17, 2014

Well they claim integer features wernt added until directx 10

Zeross · Mar 17, 2014

They're right, Dx 8 GPU used fixed-point (FX9 for nvidia and FX16 for ATI)

pjbliverpool · Mar 17, 2014

Cheers, just downloaded the latest version and it runs much better. I'm running at maximum image quality settings with full speed. Image quality at these settings is just crazy insane.

Davros · Mar 18, 2014

sort of strange but youd think integers would be supported before floating point.

MJP · Mar 18, 2014

Davros said:
sort of strange but youd think integers would be supported before floating point.

Not really, considering the typical workload of a GPU (particularly one from the era of early programmable shaders). Shaders primarily deal with things like positions, directions, and colors, and for those things you want floating point. Even on the latest Nvidia/AMD hardware the ALU's are heavily geared towards floating-point ops, and integer ops are generally 2-4x slower than their floating point counterparts.

swaaye · Mar 18, 2014

Yeah I suppose if they want to accurately emulate TEV, this is what they need to do.

Davros · Mar 18, 2014

@mjp
I mean from the even earlier days, colors aka textures were certainly integer, I would assume lights would of been integer ect

fellix · Mar 19, 2014

In those days, an FP pipeline (from end to end) was too much of a cost. The manufacturing process and die sizes didn't allow for such complete implementation and that's why the pixel portion of the pipeline kept the cheaper FX or INT formats for longer. Vertex and geometry logic were all FP for obvious reasons. Even after the first generation of D3D9 hardware, the FP pixel pipe was laced with all sorts of cheap hacks to keep the logic size in reasonable dimensions, like partial precision, non-IEEE numerical op's and so on.

DavidGraham · Mar 19, 2014

MJP said:
Even on the latest Nvidia/AMD hardware the ALU's are heavily geared towards floating-point ops, and integer ops are generally 2-4x slower than their floating point counterparts.

You mean only Int32 ops?
Shouldn't the FP32 units be able to process Int24 with the same speed?
I remember the old special function unit in AMD's VLIW5 was capable of doing both FB32 and Int32, it had a width of 40-bits.

mczak · Mar 21, 2014

swaaye said:
So I guess this means that Direct3D 8.1 integer features aren't adequate for Hollywood/Flipper emulation?

Well the d3d 8.1 "integers" (really fixed point) were not very well defined (different range, different number of fractional bits etc.). Thus if you rely on integer add wrapping around for instance as is indicated by that article on Dolphin, it won't work on other hardware.
IIRC those graphic chips were quite close to how r200 radeons work, it is possible r200 could actually emulate it well. But neither ps 1.4 nor OGL actually gave you full control over hw capabilities, though they were close. One such issue I remember is that the hw had saturate control, you could either clamp to 0/1, clamp to -8/8 (the hw range) or just let it wrap around - but the -8/8 clamp wasn't exposed (and worse the ATI_fragment_shader spec didn't mention if this should always be enabled or disabled...). But any small hardware difference could mean you can't emulate it on r200 anyway (say number of fixed point bits is different - well you possibly could account for things like that but you might run out of instructions pretty soon if you need more instructions for emulating it perfectly...). It is also quite possible flipper/hollywood exceed r200 capabilities in some areas too. Or I might be completely wrong it operates like r200 anyway

.

mczak · Mar 21, 2014

DavidGraham said:
You mean only Int32 ops?
Shouldn't the FP32 units be able to process Int24 with the same speed?

Generally yes but that relies on both the gpu actually exposing this capability (some minimal logic is still required for being able to treat Int24-in-a-32bit value as a float32), and the high level API you're using supporting it (dx10 does not). The driver shader compiler _could_ theoretically use some tricks to figuring out if your numbers can't be larger in some cases, but overall I doubt it's going to do much.
I think the 4 times hit with ints was mostly for muls (though I'd guess this is exactly what you're going to need, the small shaders are probably mostly consisting of MADs). In any case since these chips should have had a much lower ALU:anythingelse ratio than todays gpus I'm not surprised it doesn't seem to make much of a difference in performance either way.

swaaye · Mar 21, 2014

mczak said:
But any small hardware difference could mean you can't emulate it on r200 anyway (say number of fixed point bits is different - well you possibly could account for things like that but you might run out of instructions pretty soon if you need more instructions for emulating it perfectly...). It is also quite possible flipper/hollywood exceed r200 capabilities in some areas too. Or I might be completely wrong it operates like r200 anyway .

Yeah the article certainly focuses on how they want accurate emulation instead of trying to work with any more approximations. D3D8 could have been used before so it must be quite inadequate for their needs.

mczak · Mar 22, 2014

swaaye said:
Yeah the article certainly focuses on how they want accurate emulation instead of trying to work with any more approximations. D3D8 could have been used before so it must be quite inadequate for their needs.

Well any kind of d3d8 shader is simply going to use ordinary floats on modern hardware, so there's no way you can really make this work in general.

DavidGraham · Apr 11, 2014

mczak said:
Generally yes but that relies on both the gpu actually exposing this capability (some minimal logic is still required for being able to treat Int24-in-a-32bit value as a float32), and the high level API you're using supporting it (dx10 does not). The driver shader compiler _could_ theoretically use some tricks to figuring out if your numbers can't be larger in some cases, but overall I doubt it's going to do much.

Thanks.

Is this the reason why FPU SIMDs in the CPU (say 128-bit width) don't bother handle both floats and integers at the same time?

mczak · Apr 14, 2014

DavidGraham said:
Thanks.

Is this the reason why FPU SIMDs in the CPU (say 128-bit width) don't bother handle both floats and integers at the same time?

I don't really see the connection there between cpu design and d3d implementation needs but here it goes.
int and fp simd instructions require the same execution ports so you cannot issue such instructions in parallel (well you can mix them, one int simd op and one fp simd op execute just fine in parallel on a 2 pipe design like jaguar for instance, just not 2 int simd ops and 2 fp simd ops in parallel). This has a lot more to do with simplifying scheduling and sharing resources though (same 128bit data paths) rather than anything else, ultimately the pure alu bits is just a small portion. And I guess such mixed workloads aren't really all that common. Note that in fact pre-Kaveri (Steamroller) Bulldozer family indeed do not really follow that pattern, since fp ops mostly execute in pipe 0/1 whereas int simd ops execute on pipe 2/3 so for getting peak performance out of the simd cluster both fp and int ops are needed. This design was arguably using too much die space for little benefit hence Steamroller reducing the number of simd pipes from 4 to 3 with very little impact in real world code (some actually got a bit faster, some a bit slower as far as I can tell).

DavidGraham · Apr 14, 2014

Isn't that a little extreme? forgive my lack of proper knowledge on this subject, but aren't CPUs have separate Integer and FP units(especially intel's) why share the same execution port? just to simplify scheduling?

And my original question is about why can't CPUs exploit the FPU unit at it's disposal to maximize integer throughput? by running integer code on the FPU unit when fp code is not needed.

3dilettante · Apr 14, 2014

DavidGraham said:
Isn't that a little extreme? forgive my lack of proper knowledge on this subject, but aren't CPUs have separate Integer and FP units(especially intel's) why share the same execution port? just to simplify scheduling?

Adding execution ports increases complexity for scheduling and instruction issue. The scheduling logic needs to be able to send out as many operations per clock simultaneously as there are ports.
A unified scheduler has a bigger set of operations it needs to scan through, and more places to put them.
A non-unified scheduler would tend to split things right where they are now.
The gains are also limited by the ability of the core to provide operations (decode/rename) and absorb them (bypass/retire).
Adding ports adds expense and can compromise the OoO engine's clock ceiling, and without expanding the pipeline sections before and after--some of which can be expensive to scale and require even more elements of the pipeline further away to scale up--not much would come of it.

And my original question is about why can't CPUs exploit the FPU unit at it's disposal to maximize integer throughput? by running integer code on the FPU unit when fp code is not needed.

The FP unit's pipeline, latencies, and exception behavior are different from what the INT units need. It does simplify scheduling if the scheduler doesn't need to mix the requirements of the two.
Using the FP unit as an extra integer unit also means running against the separation of the register files, instruction issue, and bypass networks.
There are latency penalties when moving data between domains, but they exist because FP and INT units generally don't need to send results to each other that much.
The FP and INT units for these cores also don't have a direct means of reading from the register files for each type without an explicit operation moving the data from one to the other.

Accepting the penalty for crossing means each side has its own register file and result forwarding.
That's a bonus in terms of giving more registers without expanding register identifiers, and the register files and bypass networks can grow more independently. Two full-width domains can do much more for their own data types, while combining them can mean less performance in aggregate because a combined scheme puts pressure on expanding resources that can scale quadratically in cost.

DavidGraham · Apr 15, 2014

Thanks mczak and 3dilettante for the great answers, that was immensely informative.

Good article on emulating the gamecube gpu

Davros

swaaye

Entirely Suboptimal

Davros

Zeross

pjbliverpool

B3D Scallywag

Davros

MJP

swaaye

Entirely Suboptimal

Davros

fellix

DavidGraham

mczak

mczak

swaaye

Entirely Suboptimal

mczak

DavidGraham

mczak

DavidGraham

3dilettante

DavidGraham

Similar threads