If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#51 | |||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
That said, some reviews report that Intel puts a lot of load on the CPU while rendering 3D graphics: CPU Usage in Graphics. Some even claim all geometry shaders execute on the CPU. In any case to objectively compare pure software rendering against the IGP, I don't think we can neglect the many roles the CPU still plays for assisting the IGP. Unfortunately I don't have a Sandy Bridge system myself so I can't provide any accurate numbers. Quote:
Quote:
Just look at the sheer computing power. An i7-2600 can do 218 GFLOPS (not counting in any turbo mode). At 800x600, that's a staggering 450,000 floating-point operations per pixel per second, or a budget of 15,000 operations per pixel at 30 frames per second. Currently a lot of this power goes to waste though because of the lack of gather/scatter (forcing some memory accesses to be serial scalar operations), and because the API demands certain detours. Quote:
|
|||||
|
|
|
|
|
#52 | |||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
Besides, it's a chicken-and-egg problem. There aren't many truly scalable multi-threaded applications yet because there's still a fairly low percentage of quad-core systems. But that's going to change in the next few years. Also note that there are very few consumer GPGPU applications, for the exact same sort of reason (few DX10+ capable systems). Developers simply won't invest into something that is not likely to pay off. But that doesn't mean we can't start looking at the sort of technology that will be most interesting for the future. Given that the CPU is ahead of the IGP in processing power (and there's more to come with FMA), but lacks some efficiency, it makes sense to add gather/scatter support, lower power consumption with FMA-1024, and replace the IGP with more CPU cores. Quote:
Quote:
Quote:
|
|||||
|
|
|
|
|
#53 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Yes, SIMD is underutilized, and the number one reason is that compilers have a really hard time parallelizing code. And that's because ever scalar operation has a vector equivalent, except for load/store! Support for gather/scatter would fix that.
|
|
|
|
|
|
#54 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,308
|
If an hypothetical compiler is able to generate gather/scatter instructions for a given code sequence then it would also be able to replace those instructions (if not supported) with loads and stores, it's not really rocket science. Performance might be less optimal but it's certainly not the lack of gather/scatter instructions in some ISAs making the life of certain parallelizing compilers hard.
__________________
[twitter] More samples, we need more samples! [Dean Calver] First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
#55 | |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
I didn't say gather/scatter support would make the compiler's life less hard per se, but it would make it a whole lot more effective. Currently a lot of effort into auto-vectorization simply goes to waste because the lack of gather/scatter negates the results. Also note that it's not getting any better. FMA support will make Intel's architecture capable of 32 floating-point operations per cycle per core. Compared to the 18 instructions it takes to gather 8 values, that's like driving an F1 car with the parking brakes on. AVX and FMA make the serial load/store bottleneck appear four times narrower. So it's clear that something needs to be done if they want this wide SIMD ISA to be utilized more and offer a return on their investment. Fortunately Intel researchers appear to realize this too: Atomic Vector Operations on Chip Multiprocessors Note that gather/scatter units with a maximum throughput of 1 vector every cycle are considered perfectly feasible. And with AVX-1024 executed in four cycles on 256-bit execution units they'd get the same SIMD width as NVIDIA, while reducing the out-of-order execution overhead by a factor four. FMA increases performance/Watt as well. It's all within reach. So the question isn't whether or not this will one day be added to CPU architectures. The question is what will GPU manufacturers do to compete with it? AMD is in a nice position because it can add these features to its CPU line too while at the same time offering GPUs that continue to target hardcore gamers. NVIDIA appears to be forced to sacrifice some graphics performance to increase GPGPU efficiency. Project Denver has the potential to conquer some desktop/laptop CPU market space, but they have a lot of catching up to do to design something like this, compensate for the process disadvantage, and get developers to program for it. The ARM architecture and NVIDIA's experience with throughput computing could result in a killer platform though. |
|
|
|
|
|
|
#56 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#57 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
|
The SIMD model simulated in the paper adds a second memory pipeline right next to the LSU, and adds an additional data path to check the contents of the load/store queues.
The core is assumed to be in-order, and the memory hierarchy expands the check process somewhat. It does assume a very heavily banked L2, and a directory-based coherence protocol with a smaller number of states than what is customary. Only one can be issued at a time per-thread, and the operations are blocking. I think there are significant barriers to implementing the scheme as described on a high-speed OoO design with a different memory pipeline coupled with heavy memory speculation.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#58 | |
|
Senior Member
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,017
|
Quote:
And how much would it improve the performance if the memory accesses cannot be coalesced? If they can, it should perform very well with the already existing loads, isn't it? And if it can't be coalesced, I would look up how much performance this costs for GPUs for instance (occupying the ALUs isn't a problem in such cases). SRAM or DRAM arrays don't get more ports and a higher bandwidth just because an ISA supports gather/scatter
__________________
x: RCP_sat R2.x, R1.y y: RCP_sat ____, R1.y z: RCP_sat ____, R1.y |
|
|
|
|
|
|
#59 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
|
There is an instruction cache benefit if a bunch of loads can be represented by a single gather, even if internally the chip just ran a little microcoded loop and spat out scalar loads in sequence. Perhaps the scatter/gather could run through a similar process as some of the string operations that run in microcode.
Being able to coalesce would save on the number of cache accesses and shave off cycles. How aggressive the implementation is in pursuing coalescing opportunities and how well it can broadcast and permute from cache line to vector lane would determine how complex the memory pipeline would be. We may need to agree on what kind of implementation we are speculating on before guessing at numbers.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#60 |
|
Senior Member
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,017
|
My point was that gather/scatter alone isn't going to win the game.
Of course one needs less space in the instruction cache for instance. But that gives you a performance benefit probably just in the single percentage range. And one occupies a lot less execution resources, true. But again, this won't be a game changer, even if it would buy 30% or even 50% performance on average (which it probably won't do). What would help is a cache/memory structure which can handle a lot of simultaneous requests to different addresses, i.e. a cache with let's say 8 read ports or something like that. So where is the missing performance factor of 4 to 8 to low end GPUs/IGPs supposed to come from? Not from gather/scatter additions ito the ISA in my opinion. It makes things simpler and also a bit faster, but not that much. By the way, texture units are great things, especially the accompanying specialized caches. Nvidia didn't remove the separate L1 texture caches and did not integrate them into the general purpose L1/local memory. And even Larrabee had texture units/cache. If I would have to guess the reason ...
__________________
x: RCP_sat R2.x, R1.y y: RCP_sat ____, R1.y z: RCP_sat ____, R1.y |
|
|
|
|
|
#61 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#62 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
|
The relatively small code expansion that resulted from transitioning from x86 to x86-64 was enough to negate the doubling of the register file. The difference either way was a few percent, yes.
The situation with a scatter/gather could involve 8, 16, or 32 instructions being folded into one, so the benefits could be stronger in performance critical code. The benefits would probably be stronger in an OoO processor than the in-order architecture used in the paper. One important consideration in an OoO implementation is that the scatter/gather should not be blocking. The penalty in the case that it remains blocking may be enough to make the OoO engine counterproductive in scatter/gather heavy code. The disambiguation hardware would be negated, and the core could not generate MLP from other instructions in the stream while the scatter/gather blocks issue. The debate about the nature of the memory subsystem has come up in other threads.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#63 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Sure, but the point was that Intel is well aware that "SIMD efficiency is compromised in the presence of irregular data access patterns", and is researching gather/scatter implementations. Given that AVX and FMA make the bottleneck worse, I'd be surprised if they haven't started giving it even more attention. The experience with Larrabee probably also helps bring it to a real product sooner rather than later.
|
|
|
|
|
|
#64 | ||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Of course it's not all that simple in practice, but I hope this illustrates that the potential goes far beyond a few ten percent. Even a simple serial implementation of gather/scatter would be a "game changer" when combined with AVX and FMA. Quote:
The situation is very different for the CPU. There's lots of available die space to deliver 'adequate' graphics. And people need a CPU anyway, so competitive graphics performance alone does not determine commercial success. There are clearly different trade offs to be made, and personally I think it makes sense to keep it low risk and simply add some form of gather/scatter support. Adding texture lookup instructions to the ISA seems like a bad idea to me anyway since it's very specialized and graphics is still evolving. At some point I'm expecting GPUs to also feature programmable texture filtering, and things like address calculations and perspective projection have already moved to the shader cores on some architectures. With a software renderer the majority of time is currently spent gathering texels, and besides, even with dedicated texture units there would still be a great need for gather/scatter in the rest of the graphics pipeline, not to mention for other applications. |
||
|
|
|
|
|
#65 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,646
|
Regarding gather/scatter, this paper, detailing Tarantula, a vector extension to the Alpha architecture, is a very good read.
Highlights: 8192 bit vector registers, holding 128 doubles. Vbox loads and stores access the L2 directly (bypassing L1) The L2 is pseudo multi-ported, with 16-ways supporting up to 16 accesses in parallel. Addresses from gathering loads/scattering stores are grouped to minimized way/bank conflicts. Aggregate bandwidth of 512 bytes/cycle (256 read, 256 write) for stride-1 accesses. All this in 2002. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
#66 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
|
It was put on paper in 2002, at least. It would have gone on a 65nm process, which was not due for some time.
It would have been interesting to see if Tarantula would have met some of the projections, two process nodes past where EV8 would have been. First Intel at 90nm and AMD at 65nm had significant problems with high-performance chips in the meantime. EV8 was a power monster at 130nm, and it was not confirmed EV8 would hit all its targets within 250W let alone scale perfectly for two problematic nodes. Looking at the specs, we see numbers for some things like the registers, interconnect, and L2 that are several times more per-core than entire chips get all the way down to 22nm.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#67 | |
|
Senior Member
|
Quote:
For LRB1, it wouldn't matter if the scatter/gather was blocking, as it had lots of threads to hide latency with and it was in order anyway. The cost of non blocking scatter/gather is interesting to look at. What additional complications might arise while implementing it vis-a-vis in order scatter-gather? I guess non blocking gather and blocking scatter would be easier to make. It is interesting that even ARM's T604 GPU IP doesn't implement scatter/gather as they said that making these units is hard (or something like that). |
|
|
|
|
|
|
#68 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
|
Blocking could negate the benefits of having gather. If no operations can reorder around it, a single gather element that drops to the L2 would force a whole memory pipeline stall for ten or so cycles in the case of a fast L2 like Sandy Bridge. 20 or more for an L2 like that of Bulldozer. Given the importance of the L/S pipes. If an optimized loop shaves off 10 instructions using gather and thus buys 2-5 cycles, the tens of stall cycles would add the penalty right back and then some.
I think it shouldn't be that restrictive for an OoO core. A speculative memory pipeline capable of reordering loads around stores would need to behave differently with a coalescing scatter/gather. If N addresses are coalesced into a single cache line scatter, each of the N addresses must be broken out and individually tracked by the alias predictor and the store queue populated on a per-address basis to allow proper serialization. Coalesced gather needs per-address checks of the store queue and aliasing predictor. Forwarding may be applied here as well. The worst and best-case scenarios for the memory pipeline and scatter/gather are mostly opposite. For the SGU, an operation that coalesces into a single cycle and single cache access is best. This is the worst-case for the speculation and forwarding hardware, which speculates with at most 2 addresses. This can wind up kneecapping scatter/gather at far below its peak, or forcing a multiplication of the memory pipeline's speculation resources far beyond what the scalar case needs. Perhaps it can be changed so that the pipeline is pessimistic when dealing with coalesced ops. The cache line's base address would be flagged as being for a scatter/gather and anything in that range will register as a conflict regardless of the actual address. This would add an additional small check per prediction and would add some amount of storage per predictor entry. There could be a separate scatter/gather alias predictor, but that's adding a second check in tightly timed memory pipeline. The most simple alternative besides not speculating is to globally shut down memory speculation when a scatter/gather is in-flight, but that would compromise the benefit of the operation significantly.
__________________
Dreaming of a .065 micron etch-a-sketch. Last edited by 3dilettante; 17-May-2011 at 23:00. |
|
|
|
|
|
#69 |
|
Regular
|
A scatter/gather implemented either with serialization or a banked L2 is far too high latency to make speculative execution of it ever to make sense IMO. Just use vertical multithreading.
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
#70 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,646
|
Quote:
The solution, IMO, is to use a weak memory ordering model for gather/scatter. Alpha used a weak memory ordering model. For x86 I'd mark pages in the page table to use a weak memory ordering model (I'd make 'vector' pages bigger too). That way the vector loads and stores could bypass the normal scalar load/store unit. As long as loads and stores can't alias, gathers and scatters can proceed out of order. If you compiler can't prove that loads and stores won't alias, it would have to insert memory fences (effectively serializing accesses). Cheers
__________________
I'm pink, therefore I'm spam Last edited by Gubbi; 18-May-2011 at 09:53. |
|
|
|
|
|
|
#71 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#72 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#73 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,646
|
Quote:
Why would you do that? Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#74 |
|
French frog
Join Date: Jun 2005
Location: France
Posts: 4,259
|
First thanks for the answers that made me a bit less dumb
I may dare some other questions tho. Disclaimer those questions assume that it can be relevant to use a CPU core as the basis of a design intended to achieve high throughput and and data parallel workloads (which doesn't look like an idea a lot of people agree with). Do I get the arguments properly. Case 1: something CPU based has no chance to be competitive no matter when. Case 2: given time there's a chance *case 2.1: it's even possible within a highsingle thread performance CPU: ** From I read it could prove really difficult ** If it happens it would have a significant costs, in perfs per mm˛ / Watts which has a chance to send us back to case 1. * case 2.2: simple narrow CPU and multi threading is the way to go. ** Still costly and non trivial ** As CPU "standard" perf would crumble it lowers the interest of using a CPU core in the first place ** Mapping wider vectors to narrower SIMD units could help a lot (hiding latencies for ops as scatter gather). ** Doing scatter/gather from L2 would help not paying an extra price on L1 (which would more complex, slower, more power hungry, etc.). Using the previous point + SMT there is ways to hide the extra latecies. ** Doing texturing work is out of question. Here it comes, the one thousand billion dollars Do I get properly the different opinions as I don't even try to read the various links that are way to advanced for me? A real question, is texturing really impossible (I mean with acceptable performances). There's nothing that can be add to an already specialized CPU core as a larrabee one that could help? Could an tiny read only cache akin to those find in GPU help? Could the scalar pipeline do the calculations? Or really no matter what you add to the design (tex units aside...
__________________
What's trying to be a bunch of presentations Sebbbi about virtual texturing Blessed is Leatrix Latency Fix |
|
|
|
|
|
#75 | |
|
Senior Member
|
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|