If a workload scales with cores well enough, then straight line performance is not the be all and end all of all things.
Which parallel workloads are unforgiving of reduced straight line perf?
Remote computation for any user interactive workloads is going to have a latency maximum. Perhaps a server used for OTOY could service 100,000 simultaneous users on a little platformer game, if the response time or frame rate were 1 frame per second. It would no longer be useful for the intended workload.
In somewhat related hardware, various web server and transaction benchmarks and by extension the consumers of such services have maximum allowable response times.
With an ISA like ARM, each instruction can do a lot more. So dual issue for ARM makes much more sense.
That's not necessarily true, and not necessarily a good thing from a silicon perspective.
High-performance ARM may internally split up some of the more anachronistic things like the built-in shift option into a separate internal operation, because it is a potential timing liability.
for cortex a9, it says it does ooo and superscalar. I dunno if it is on the market yet. I thought it was.
The products I heard rumors about were for next year.
May be branch prediction is larger than decode. But decoders are more complex than they are for risc's and afaik, ARM does not use microcode.
I know some older and simpler ARM chips did not have microcode. I do not know that is true about some more recent ones.
Once again, let's look at NV40. It has quite a big register file, but it has poor usage. Subsequent architectures improved on it a lot. The amount of data hasn't really gone down; the relative transistor budget for registers is roughly the same. But the available storage is used much more efficiently, allowing each strand to use more registers without cripling performance. So this is clear proof that architectural improvements can be a better option than just throwing more silicon at it.
Do you mean NV45 or G80?
G80 would not be a data point in your favor in terms of strands in flight or the number and size of the registers.
Same thing for texture caches, even in modern architectures. If the average latency is too high then you'll run out of threads to hide that latency.
Those caches are there more for saving bandwidth than hiding latency. The optimal thread counts and the assumed arithmetic density for the bulk of GPU work are proportioned so that a trip to DRAM is accounted for.
So some minor changes to the prefetching heuristics (speculation!) can make a big difference. Likewise an increase in size costs maybe 1% of additional die space but if that fixes the biggest bottleneck it's transistors very well spent.
That would be a different kind of speculation that what was focused on earlier. It is interesting that in many cases, once bandwidth utilization becomes high without prefetching, it is often better to disable it, at least in the case of multi-socket CPU work.
Once there is no spare bandwidth, speculation becomes a liabilility, and the upsides are much less for architectures that already tolerate a massive amount of latency. As speculation and prefetching for CPUs can easily increase bandwidth demands 2-3 times, this might be why Larrabee'shardware has not been disclosed as doing much of this.
RV770 does prefetch based on what the triangle setup hardware determines are necessary texture coordinates. It's not the stride-based prefetches CPUs do, as it actually knows what those addresses are. It still might be speculative, there might be ways to branch away for some lanes that would have consumed the fetches, I'm not versed on those particulars.
So at least RV770 speculates in a narrow sense for about 5% of the die.
The UVD decoder might have some amount, as it's a specialized MIPS-based core of some sort, I think. I haven't seen much analysis on that one.
Larrabee is leading the way. It uses the L2 cache for many different purposes. It stores unprocessed vertices, processed vertices, primitive gradients, framebuffer data, shader constants, spilled registers, etc.
Textures?
And no bottlenecks?
I don't recall the first step in eliminating bottlenecks when working with many processes is to make them all use the same resource.
There are always bottlenecks, unless the design purposefully leaves performance on the table everywhere.
None of that data poses a bottleneck any more, and storage not used by one task is automatically used to assist other tasks.
From a software standpoint, it sounds like this all comes totally free...
But when branches are killing performance because you've run out of threads, resorting to speculation doesn't sound so bad.
That's not really the reason why branches kill performance in SIMD architectures.
It has an implementation cost too, but with the things we'd like to run on the GPU becoming ever more complex there's bound to be a day when speculation is cheaper than adding more context storage.
That's a pretty fuzzy statement.
Any unspecified amount of X can be made cheaper than any unspecified addition of Y.
Factors like arithmetic throughput growth, bandwidth constraints, and power constraints would typically make this case unlikely for most reasonable scenarios.
Absolutely. And in a few years time you'll say exactly the same thing about today's architectures.
It will get said of all architectures at some point. Designs are meant to target the current workload and a small window of time in the future.
The product in question had serious problems at the outset.
I know.
The Unabridged Pentium 4 mentions 128 integer alias data registers and 128 floating-point alias data registers. But they didn't grow the physical register file when adding Hyper-Threading. Also, P6 had 40 alias data registers, in total...
The P4 as a data point would serve as one where speculation had been taken to an excessive level.
The fact that a lot of on-chip storage did not grow, and the interference of extreme speculation made performance highly unpredictable. And the later Prescott that actually fixed some of those problems was thermally limited to the point that it mostly didn't matter.
My point is that adding out-of-order execution and speculation don't force you to have a physical register file much larger than the logical register file.
The register file for a Tomasulo OoO engine is the most obvious size of context growth.
Changing to a different method still adds context, be it a central table, and other forms status tracking.
Are there methods that add less context? Yes, but they are far less effective without register renaming and would encourage the 64-128 ISA-visible registers we see anyway.
The tiny register file of x86 might be a benefit, as context switches are much less painful as a result, but it wouldn't be a benefit if it weren't for register renaming.
Last but not least, the law of diminishing returns tell GPU designers that they can't keep turning the same knobs to achieve higher performance. So one day they'll have to look at the remaining ones as well...
The primary knobs they've been looking most recently to are improving on-chip communications, divergence handling, and getting smart about memory traffic.
OoO might mitigate some of the first by making some latency hiccups more tolerable.
OoO and speculation tend to be negatives on the rest.
Anyway, unless there is some glaring evidence this will never happen, I'll leave the discussion at this. It has cost enough trees already and it looks like we're not going to reach consensus. Which is fine. After all we're speculating about speculation.
As you wish. The thread's a bit off on a tangent, anyway.