Only as long as your algorithm has absolutely no dependency on memory performance. If there is any dependence, of course it's a matter of how much marginal improvement there is to be had in vectorising the data accesses.
Yes, you can always wait a year or two for NVidia's performance to catch-up, in the mean time you've got a cosy coding environment and all the other niceties of NVidia's solution. And with CUDA, specifically (in theory anyway), NVidia is going places that AMD won't be bothered about for a couple of years - and there's a decent chance those things will improve performance due to better algorithms, so your loss is likely lower if your problem is at all complex.
Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms (that's partly because I don't know if AMD has done lots of compute-specific tweaks - only hints and D3D11/OpenCL leave plenty of room). Gotta wait and see.
Why should it? With the addition of atomics GPUs became as general as needed IMO. It's just about performance now, better caching, lower cost fencing, lower cost atomics. The L2 should help a lot in that regard ... perhaps when constructing bounding volume hierarchies? Or building the data structure for an irregular Z-buffer?Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms
There's always a trade amongst capacity, granularity, bandwidth and latency.Theoretically this shouldn't cost anything at all, because AFAIK registers are accessed on a simple bus connecting memory locations together. But I could be wrong.
X - 2 3 7
Y - 3 4 8
Z - 1 5 9
W - 1 2 6
Larrabee's cache architecture is a bit of a murk at the moment. I expect GF100's cache is inclusive for texels at least. In ATI L2 and L1, dedicated to texels, are inclusive. Indeed, texels will appear in multiple L1s in normal rendering.Aren't the caches on Intel CPU's inclusive? That would make L1 a part of L2, so it would not be 32K L1 + 256K L2.
Larrabee's cache architecture is a bit of a murk at the moment. I expect GF100's cache is inclusive for texels at least. In ATI L2 and L1, dedicated to texels, are inclusive. Indeed, texels will appear in multiple L1s in normal rendering.
And I can't help thinking that R900 is an overhaul.
It appears that L2 is read-write shared by texels/fetches and render-targets/global-memory-resources.Memory is another problem, and here I think Fermi's advantage is even more. Fermi has 128KB L2 cache per memory channel (64 bits), which should help a lot.
I won't disagree, but it's still too early to compare.My point is, for some applications, NVIDIA may already have a performance advantage, so you don't have to wait a year or two.
I'm pessimistic.That's true. I hope when AMD release their OpenCL implementations, they can also release some hints about optimizing for RV870, like NVIDIA's performance guide. Anything will help.
For example AMD added a SAD instruction, which "makes motion estimation faster in video-encoding". That's uber-specific.Why should it? With the addition of atomics GPUs became as general as needed IMO. It's just about performance now, better caching, lower cost fencing, lower cost atomics. The L2 should help a lot in that regard ... perhaps when constructing bounding volume hierarchies? Or building the data structure for an irregular Z-buffer?
I don't think you understand what I'm saying. I'm not saying you need 4 times the register file, nor 4 times the BW.
As an example, if you have 64 batches needing 4kB of register space each (16 floats per thread), then the current design will only have to possibly access from a 8 kB subset of the 256 kB register file during any 8 cycle period. My proposal will have to access from a 32 kB subset. Both designs fetch 1 kB of data per cycle.
Theoretically this shouldn't cost anything at all, because AFAIK registers are accessed on a simple bus connecting memory locations together. But I could be wrong.
Lots of games, going back, don't scale so well with new GPUs from prior generations, so I'm less bothered by the scaling. There are games that are scaling by 80%+ on HD5870 over HD4890.Is it just the perf scaling disappointment from cypress, or the rumours of an upcoming real dx11 gpu?
Ok, I see your point. I've been looking at it too much from a micro level.The P4 has 4 instruction streams (2 threads/core), and can simultaneously execute 128b SIMD ops from each instruction stream.
I was being a facetious, sorry.How is Fermi MIMD in a way that Nehalem or a dual-core P4 isn't?
Only as long as your algorithm has absolutely no dependency on memory performance. If there is any dependence, of course it's a matter of how much marginal improvement there is to be had in vectorising the data accesses.
In double-precision, ATI's ALUs are scalar for MULs and MADs and vec2 for ADDs. GF100 at 1.5GHz will be slower for DP-ADD than HD5870.
Yes, you can always wait a year or two for NVidia's performance to catch-up, in the mean time you've got a cosy coding environment and all the other niceties of NVidia's solution. And with CUDA, specifically (in theory anyway), NVidia is going places that AMD won't be bothered about for a couple of years - and there's a decent chance those things will improve performance due to better algorithms, so your loss is likely lower if your problem is at all complex.
Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms (that's partly because I don't know if AMD has done lots of compute-specific tweaks - only hints and D3D11/OpenCL leave plenty of room). Gotta wait and see.
Broad comparison of compute at the core level:
I don't think it's been commented-on explicitly, so far in this thread, but NVidia has got rid of the out-of-order instruction despatch, which scoreboards each instruction (to assess dependency on prior instructions). Now NVidia is scoreboarding threads, which should save a few transistors, as operand-readiness is being evaluated per thread and instructions are being issued purely sequentially.
- ATI (mostly ignoring control flow processor and high level command processor)
- thread size 64
- in-order issue 5-way VLIW
- slow double-precision
- "statically" allocated register file, with spill (very slow spill?) and strand-shared registers
- large register file (256KB) + minimal shared memory (32KB) + small read-only L1 (8KB?)
- high complexity register file accesses (simultaneous ALU, TU and DMA access), coupled with in-pipe registers
- separate DMA in/out of registers instead of load-store addresses in instructions
- stack-based predication (limited capacity of 32) for stall-less control flow (zero-overhead)
- static calls, restricted recursion
- 128 threads in flight
- 8 (?) kernels
- Intel (ignoring the scalar x86 part of the core)
- thread size 16
- in-order purely scalar-issue (no separate transcendental unit - but RCP, LOG2, EXP2 instructions)
- half-throughput double-precision
- entirely transient register file
- small register file, large cache (256KB L2 + 32KB L1) (+ separate texture cache inaccessible by core), no dedicated shared memory
- medium complexity register file (3 operands fetch, 1 resultant store)
- branch prediction coupled with 16 predicate registers (zero-overhead apart from mis-predictions)
- dynamic calls, arbitrary recursion
- 4 threads in flight
- 4 kernels
- NVidia (unknown internal processor hierarchy)
- thread size of 32
- in-order superscalar issue across three-SIMD vector unit: 2x SP-MAD + special function unit (not "multi-function interpolator")
- half-throughput double-precision
- "statically" allocated register file, with spill (fast, cached?)
- medium register file + medium-sized multi-functional cache/shared-memory
- super-scalar register file accesses (for ALUs. TUs too?)
- predicate-based stall-less branching (with dedicated branch evaluation?)
- dynamic calls, arbitrary recursion
- 32 threads in flight
- 1 kernel
Jawed
Comments:
ATI's cache hierarchy has a number of peculiarities. Please correct me as I may be wrong.
1. For ATI, there is also 64kb GDS which should be usable on RV8xx.
No, some powerpoint presentations Intel released when turned into a wav file and played backwards prove without a doubt Larrabee is half rate.I'm not so sure that Larabee is actually half the rate for double precision. If that's half the vector width it doesn't sound like more than 1/4th in real time.
I think that's one of the factors why Nvidias pushing 3D-Vision and Physx as hard as they do.
I was when I thought ATI's architecture already does this, but all along I've been proposing that we fetch operands at the same rate as now. All I'm asking for is the ability to execute serially dependent instructions.You are proposing fetching from 3 addresses per clock, i.e. like Larrabee.
Brilliant post there, Jawed. One worthy of archiving.
One minor comment. Aren't the caches on Intel CPU's inclusive? That would make L1 a part of L2, so it would not be 32K L1 + 256K L2.