The addressing logic for an entire vector. With gather/scatter as supposedly implemented by Larrabee, you only need to compute one full address per cycle. The rest only requires comparing the upper bits of the offsets to check whether they translate to the same cache line, and using the lower bits for addressing within the cache line.It does slow down in conflict cases. I am not clear on the latter claim. Both banking (pseudo-dual porting if an AMD CPU) and multiporting involve performing addressing on more than one access per cycle. What is the lots of additional logic?
Yes, there absolutely has to be a reason, but design constraints is just one of many possible reasons. So again, by itself, something not being available yet is absolutely never an arguments against it.There must be reasons why it is not available. Innovation is usually made as people work around problems and constraints. It is difficult to predict innovations by ignoring those constraints.
Because at the software level you don't know in advance which vectors to load and how to shuffle them. When sampling 16 texels they could all be located in one vector so you just need one vector load and one shuffle operation, or they could all be further apart and require 16 loads and shuffles.If there is locality within the address space, why can't a few wide loads and shuffling the values around suffice?
Of course gather/scatter can be implemented in software too, but you absolutely won't achieve a peak throughput of 1 vector per cycle.
Because they can't be separated. Low utilization for a dedicated rasterizer is fine as long as it is tiny and offers high peak performance. Likewise, not using all lanes in arithmetic vector operations is an acceptable compromise. And finally, requiring extra cycles to access multiple cache lines in a gather or scatter operation is fine as long as the hardware cost is reasonable and peak performance makes up for it.We're on a utilization kick here, why are we now pulling peak performance (at the expense of utilization) into the argument?
What did you expect, a netlist?This has been asserted, not substantiated.
I've indicated that it requires only relatively simple functionality. So instead of counter-asserting it with nothing at all, please get me some real counter-arguments why it wouldn't be feasible.
I know, but it begs the question whether doubling the number of ports would have been too expensive. Note that these architectures can perform 2 FMA operations for every load/store. That's 256 bits of ALU input/output data for every 32 bits of load/store bandwidth. Correct me if I'm wrong, but that seems like it could be a severe bottleneck. Larrabee has twice the L1 bandwidth.My statement was a shot at implementing a specialized load on a generic architecture, not a specialized design.
We can get much better utilization of generic hardware with a stream of scalar loads.
To prove/disprove what point?I was curious about what settings you used to arrive at your numbers.
Anyway, I'm short on time (too busy working on ANGLE), but feel free to test it yourself with the public evaluation demo. It runs Crysis 2 fine, and by running the benchmark with different RAM latencies you can evaluate how much out-of-order execution, prefetch and Hyper-Threading compensate for it. I'm curious about the exact results myself, although I'm quite confident that the effect of increasing RAM latency will be small.
Sure, but it's free of banking conflicts. And for gather/scatter it might be a necessary compromise to keep the throughput close to 1. You need to weigh the cost against the gains.True multi-porting is more expensive than banking, since it increases the size of the storage cells and adds word lines.
In the case of SNB, it already has dual-ported L1 caches, so two 128-bit gather/scatter units (instead of a bigger 256-bit unit) would allow to complete the operations in 4 cycles worst case when all data is in L1. And with a best case of 1 cycle it means the average throughput for accesses with high locality (like texture sampling) would be excellent. In my opinion it would make the IGP redundant.
Fast forward ten years. What good would a power efficient GPU with 100,000 ALUs be if you can't realistically reach good utilization? It seems well worth it to me to spend part of that area on techniques that speed up single-threaded performance.In the absence of consideration for power, area, and overall performance within those constraints, probably true.
Why? Just let the software keep track of what version of load/store instruction has to be used.How would this be implemented? It sounds like it would need some kind of stateful load-store unit to know the proper mapping over varying formats and tiling schemes.
One generic implementation is to interleave the lower address bits:
... b11 b10 b9 b8 b7 b5 b6 b3 b4 b1 b2 b0