Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 16-May-2011, 01:22   #51
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by liolio View Post
I've a couple of "honest" questions. Some here are real software developers as Nick others seems to know their fair share either about hardware/micro-electronic and software, I'm just a geek, so no offence
I'm actually a computer engineer with a minor in embedded systems. But no offence taken.
Quote:
*What is the cost of running ShiftShader "itself" on the CPU? Is it in the same ball park as running the HD3000 drivers? Or higher, if yes significantly?
Good question. The vast majority of execution time goes to dynamically generated processing routines. The rest is divided between some 'fixed-function' processing, format conversions, and the actual 'driver' and API functionality. The latter two (which is what I assume you meant by SwiftShader "itself") are really thin layers. There's a very short path between the application and starting the actual calculations.

That said, some reviews report that Intel puts a lot of load on the CPU while rendering 3D graphics: CPU Usage in Graphics. Some even claim all geometry shaders execute on the CPU.

In any case to objectively compare pure software rendering against the IGP, I don't think we can neglect the many roles the CPU still plays for assisting the IGP. Unfortunately I don't have a Sandy Bridge system myself so I can't provide any accurate numbers.
Quote:
Is swiftShader optimized for AVX already?
No.
Quote:
What are your expectations in regard to for example 3Dmark06 if it were implement if not straight to the metal using various libraries? How close do you think it would come to the IGP/HD3000?
Hard to say. If I recall correctly it uses some blur filters which could be implemented way more efficiently with custom vector code instead of lots of texture lookups. But I'm sure that by having a full overview of the rendering process at an application level, there's a lot more that can be optimized by departing from the legacy graphics pipeline.

Just look at the sheer computing power. An i7-2600 can do 218 GFLOPS (not counting in any turbo mode). At 800x600, that's a staggering 450,000 floating-point operations per pixel per second, or a budget of 15,000 operations per pixel at 30 frames per second. Currently a lot of this power goes to waste though because of the lack of gather/scatter (forcing some memory accesses to be serial scalar operations), and because the API demands certain detours.
Quote:
Basically do you think that it would be possible achieve for a quad-cores the "same" result as with an IGP+dual cores.
With gather/scatter, FMA and AVX-1024 support, yes, I'm convinced that the IGP would be a waste of silicon. It might take many more years for gather/scatter support to be implemented though, so quad-cores are probably outdated by then. But given that the CPU is already ahead of the IGP in GFLOPS, FMA will double it again, the IGP is limited by bandwidth, and graphics itself is getting more generic, I think it's very doubtful that the IGP can outrun its fate.
Nick is offline   Reply With Quote
Old 16-May-2011, 02:10   #52
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
Sure, all you need is encoding space and the generalized shuffle networks don't cost a dime of L1D latency. Or area.
It's only two instruction encodings, not a big deal. And gather/scatter can have higher L1 latency than other memory accesses. Especially with AVX-1024 on 256-bit execution units that latency is easily hidden. And area shouldn't be too much of an issue either given that LRB3 apparently has wider shuffle networks and more cores.
Quote:
Where are these apps which scale with cpu cores and vector width, but not with GPU's, let alone scaling even more with GPU's? Really, where are they?
Why exclude applications which scale with the GPU? Every single GPGPU application is a really nice example of something that would greatly benefit from gather/scatter and extra cores.

Besides, it's a chicken-and-egg problem. There aren't many truly scalable multi-threaded applications yet because there's still a fairly low percentage of quad-core systems. But that's going to change in the next few years. Also note that there are very few consumer GPGPU applications, for the exact same sort of reason (few DX10+ capable systems). Developers simply won't invest into something that is not likely to pay off. But that doesn't mean we can't start looking at the sort of technology that will be most interesting for the future. Given that the CPU is ahead of the IGP in processing power (and there's more to come with FMA), but lacks some efficiency, it makes sense to add gather/scatter support, lower power consumption with FMA-1024, and replace the IGP with more CPU cores.
Quote:
Pointless as these costs apply to sw rendering as well.
Only partially, and it shifts the balance. If for instance the IGP itself consumes 20 Watt and the rest of the system consumes 30 Watt during rendering, then a total power consumption of 70 Watt for pure software rendering isn't all that bad. Some would incorrectly compare the 20 Watt against 70 Watt, while it's really 50 Watt versus 70 Watt. And when you look at the potential for doing more with less the balance can totally tip in favor of generic software.
Quote:
While that would be nice, that road has a lot of stumbling blocks. Games are made for consoles these days, with a few PC specific features used. With no console with a flexi arch around, who'll invest that much for one chip out of three?
The Xbox 360 has three CPU cores, the PlayStation 3 has Cell. Over the course of their existence game developers have started to use all this power, and the same multi-threaded engines were deployed on the PC as well. So even if the next generation of consoles don't have a fully homogeneous architecture, it's still quite likely that they'll sport more cores and continue to advance multi-threaded game development in the PC market as well.
Quote:
You expect Office to speed up with scatter/gather. Or Windows?
Absolutely. Any non-trivial codebase has loops which can be auto-vectorized a lot more efficively with gather/scatter.
Nick is offline   Reply With Quote
Old 16-May-2011, 02:16   #53
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by entity279 View Post
The vector part of our x86 CPUs has always seemed underutilized to me, by consumer apps at least. It's not always trivial to code for either, and you need to give the compiler hints.
Yes, SIMD is underutilized, and the number one reason is that compilers have a really hard time parallelizing code. And that's because ever scalar operation has a vector equivalent, except for load/store! Support for gather/scatter would fix that.
Nick is offline   Reply With Quote
Old 16-May-2011, 04:32   #54
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,308
Default

Quote:
Originally Posted by Nick View Post
Yes, SIMD is underutilized, and the number one reason is that compilers have a really hard time parallelizing code. And that's because ever scalar operation has a vector equivalent, except for load/store! Support for gather/scatter would fix that.
If an hypothetical compiler is able to generate gather/scatter instructions for a given code sequence then it would also be able to replace those instructions (if not supported) with loads and stores, it's not really rocket science. Performance might be less optimal but it's certainly not the lack of gather/scatter instructions in some ISAs making the life of certain parallelizing compilers hard.
__________________
[twitter]
More samples, we need more samples! [Dean Calver]
First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 16-May-2011, 12:38   #55
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by nAo View Post
If an hypothetical compiler is able to generate gather/scatter instructions for a given code sequence then it would also be able to replace those instructions (if not supported) with loads and stores, it's not really rocket science. Performance might be less optimal but it's certainly not the lack of gather/scatter instructions in some ISAs making the life of certain parallelizing compilers hard.
Less optimal is a huge understatement. Emulating a 256-bit gather operation takes 18 instructions! Even a nearly braindead hardware implementation of it could have reduced it to two parallel sets of 4 serial load operations without occupying any ALU pipelines. That would have been "less optimal". Today's situation is just horrible.

I didn't say gather/scatter support would make the compiler's life less hard per se, but it would make it a whole lot more effective. Currently a lot of effort into auto-vectorization simply goes to waste because the lack of gather/scatter negates the results.

Also note that it's not getting any better. FMA support will make Intel's architecture capable of 32 floating-point operations per cycle per core. Compared to the 18 instructions it takes to gather 8 values, that's like driving an F1 car with the parking brakes on. AVX and FMA make the serial load/store bottleneck appear four times narrower. So it's clear that something needs to be done if they want this wide SIMD ISA to be utilized more and offer a return on their investment. Fortunately Intel researchers appear to realize this too:

Atomic Vector Operations on Chip Multiprocessors

Note that gather/scatter units with a maximum throughput of 1 vector every cycle are considered perfectly feasible. And with AVX-1024 executed in four cycles on 256-bit execution units they'd get the same SIMD width as NVIDIA, while reducing the out-of-order execution overhead by a factor four. FMA increases performance/Watt as well. It's all within reach.

So the question isn't whether or not this will one day be added to CPU architectures. The question is what will GPU manufacturers do to compete with it? AMD is in a nice position because it can add these features to its CPU line too while at the same time offering GPUs that continue to target hardcore gamers. NVIDIA appears to be forced to sacrifice some graphics performance to increase GPGPU efficiency. Project Denver has the potential to conquer some desktop/laptop CPU market space, but they have a lot of catching up to do to design something like this, compensate for the process disadvantage, and get developers to program for it. The ARM architecture and NVIDIA's experience with throughput computing could result in a killer platform though.
Nick is offline   Reply With Quote
Old 16-May-2011, 15:50   #56
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
Less optimal is a huge understatement. Emulating a 256-bit gather operation takes 18 instructions! Even a nearly braindead hardware implementation of it could have reduced it to two parallel sets of 4 serial load operations without occupying any ALU pipelines. That would have been "less optimal". Today's situation is just horrible.

I didn't say gather/scatter support would make the compiler's life less hard per se, but it would make it a whole lot more effective. Currently a lot of effort into auto-vectorization simply goes to waste because the lack of gather/scatter negates the results.

Also note that it's not getting any better. FMA support will make Intel's architecture capable of 32 floating-point operations per cycle per core. Compared to the 18 instructions it takes to gather 8 values, that's like driving an F1 car with the parking brakes on. AVX and FMA make the serial load/store bottleneck appear four times narrower. So it's clear that something needs to be done if they want this wide SIMD ISA to be utilized more and offer a return on their investment. Fortunately Intel researchers appear to realize this too:

Atomic Vector Operations on Chip Multiprocessors

Note that gather/scatter units with a maximum throughput of 1 vector every cycle are considered perfectly feasible. And with AVX-1024 executed in four cycles on 256-bit execution units they'd get the same SIMD width as NVIDIA, while reducing the out-of-order execution overhead by a factor four. FMA increases performance/Watt as well. It's all within reach.

So the question isn't whether or not this will one day be added to CPU architectures. The question is what will GPU manufacturers do to compete with it? AMD is in a nice position because it can add these features to its CPU line too while at the same time offering GPUs that continue to target hardcore gamers. NVIDIA appears to be forced to sacrifice some graphics performance to increase GPGPU efficiency. Project Denver has the potential to conquer some desktop/laptop CPU market space, but they have a lot of catching up to do to design something like this, compensate for the process disadvantage, and get developers to program for it. The ARM architecture and NVIDIA's experience with throughput computing could result in a killer platform though.
That's a 3 year old paper using a sw simulator. Doesn't mean much for/against a real product.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 16-May-2011, 17:07   #57
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
Default

The SIMD model simulated in the paper adds a second memory pipeline right next to the LSU, and adds an additional data path to check the contents of the load/store queues.

The core is assumed to be in-order, and the memory hierarchy expands the check process somewhat. It does assume a very heavily banked L2, and a directory-based coherence protocol with a smaller number of states than what is customary.
Only one can be issued at a time per-thread, and the operations are blocking.

I think there are significant barriers to implementing the scheme as described on a high-speed OoO design with a different memory pipeline coupled with heavy memory speculation.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 16-May-2011, 18:48   #58
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,017
Default

Quote:
Originally Posted by Nick View Post
Less optimal is a huge understatement. Emulating a 256-bit gather operation takes 18 instructions! Even a nearly braindead hardware implementation of it could have reduced it to two parallel sets of 4 serial load operations without occupying any ALU pipelines.
One simple question:
And how much would it improve the performance if the memory accesses cannot be coalesced? If they can, it should perform very well with the already existing loads, isn't it? And if it can't be coalesced, I would look up how much performance this costs for GPUs for instance (occupying the ALUs isn't a problem in such cases). SRAM or DRAM arrays don't get more ports and a higher bandwidth just because an ISA supports gather/scatter .
Gipsel is offline   Reply With Quote
Old 16-May-2011, 19:14   #59
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
Default

There is an instruction cache benefit if a bunch of loads can be represented by a single gather, even if internally the chip just ran a little microcoded loop and spat out scalar loads in sequence. Perhaps the scatter/gather could run through a similar process as some of the string operations that run in microcode.

Being able to coalesce would save on the number of cache accesses and shave off cycles. How aggressive the implementation is in pursuing coalescing opportunities and how well it can broadcast and permute from cache line to vector lane would determine how complex the memory pipeline would be.
We may need to agree on what kind of implementation we are speculating on before guessing at numbers.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 16-May-2011, 20:06   #60
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,017
Default

My point was that gather/scatter alone isn't going to win the game.

Of course one needs less space in the instruction cache for instance. But that gives you a performance benefit probably just in the single percentage range. And one occupies a lot less execution resources, true. But again, this won't be a game changer, even if it would buy 30% or even 50% performance on average (which it probably won't do). What would help is a cache/memory structure which can handle a lot of simultaneous requests to different addresses, i.e. a cache with let's say 8 read ports or something like that. So where is the missing performance factor of 4 to 8 to low end GPUs/IGPs supposed to come from? Not from gather/scatter additions ito the ISA in my opinion. It makes things simpler and also a bit faster, but not that much.

By the way, texture units are great things, especially the accompanying specialized caches. Nvidia didn't remove the separate L1 texture caches and did not integrate them into the general purpose L1/local memory. And even Larrabee had texture units/cache. If I would have to guess the reason ...
Gipsel is offline   Reply With Quote
Old 16-May-2011, 20:40   #61
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Gipsel View Post
My point was that gather/scatter alone isn't going to win the game.

Of course one needs less space in the instruction cache for instance. But that gives you a performance benefit probably just in the single percentage range. And one occupies a lot less execution resources, true. But again, this won't be a game changer, even if it would buy 30% or even 50% performance on average (which it probably won't do). What would help is a cache/memory structure which can handle a lot of simultaneous requests to different addresses, i.e. a cache with let's say 8 read ports or something like that. So where is the missing performance factor of 4 to 8 to low end GPUs/IGPs supposed to come from? Not from gather/scatter additions ito the ISA in my opinion. It makes things simpler and also a bit faster, but not that much.

By the way, texture units are great things, especially the accompanying specialized caches. Nvidia didn't remove the separate L1 texture caches and did not integrate them into the general purpose L1/local memory. And even Larrabee had texture units/cache. If I would have to guess the reason ...
Well, the magic pixie dust in scatter gather is supposed to make MS word go faster...
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 16-May-2011, 20:58   #62
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
Default

The relatively small code expansion that resulted from transitioning from x86 to x86-64 was enough to negate the doubling of the register file. The difference either way was a few percent, yes.

The situation with a scatter/gather could involve 8, 16, or 32 instructions being folded into one, so the benefits could be stronger in performance critical code.
The benefits would probably be stronger in an OoO processor than the in-order architecture used in the paper.

One important consideration in an OoO implementation is that the scatter/gather should not be blocking.
The penalty in the case that it remains blocking may be enough to make the OoO engine counterproductive in scatter/gather heavy code. The disambiguation hardware would be negated, and the core could not generate MLP from other instructions in the stream while the scatter/gather blocks issue.


The debate about the nature of the memory subsystem has come up in other threads.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 17-May-2011, 00:39   #63
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by rpg.314 View Post
That's a 3 year old paper using a sw simulator. Doesn't mean much for/against a real product.
Sure, but the point was that Intel is well aware that "SIMD efficiency is compromised in the presence of irregular data access patterns", and is researching gather/scatter implementations. Given that AVX and FMA make the bottleneck worse, I'd be surprised if they haven't started giving it even more attention. The experience with Larrabee probably also helps bring it to a real product sooner rather than later.
Nick is offline   Reply With Quote
Old 17-May-2011, 00:42   #64
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by Gipsel View Post
So where is the missing performance factor of 4 to 8 to low end GPUs/IGPs supposed to come from? Not from gather/scatter additions ito the ISA in my opinion. It makes things simpler and also a bit faster, but not that much.
Let's take the example of a piecewise parabolic approximation of a function. This essentially requires two FMA operations, and three table lookups. Using SSE, this takes 28 instructions for 4 values. With AVX and support for FMA and gather/scatter, it would take 5 instructions for 8 values. So it could potentially be up to 11 times faster.

Of course it's not all that simple in practice, but I hope this illustrates that the potential goes far beyond a few ten percent. Even a simple serial implementation of gather/scatter would be a "game changer" when combined with AVX and FMA.
Quote:
By the way, texture units are great things, especially the accompanying specialized caches. Nvidia didn't remove the separate L1 texture caches and did not integrate them into the general purpose L1/local memory. And even Larrabee had texture units/cache. If I would have to guess the reason ...
Like I said before, Larrabee tried to compete against high-end GPUs. Texture units were a necessity, even though they're of little or no use for anything other than graphics. As we all witnessed, the failure as a GPU meant a commercial disaster.

The situation is very different for the CPU. There's lots of available die space to deliver 'adequate' graphics. And people need a CPU anyway, so competitive graphics performance alone does not determine commercial success. There are clearly different trade offs to be made, and personally I think it makes sense to keep it low risk and simply add some form of gather/scatter support.

Adding texture lookup instructions to the ISA seems like a bad idea to me anyway since it's very specialized and graphics is still evolving. At some point I'm expecting GPUs to also feature programmable texture filtering, and things like address calculations and perspective projection have already moved to the shader cores on some architectures. With a software renderer the majority of time is currently spent gathering texels, and besides, even with dedicated texture units there would still be a great need for gather/scatter in the rest of the graphics pipeline, not to mention for other applications.
Nick is offline   Reply With Quote
Old 17-May-2011, 09:52   #65
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,646
Default

Regarding gather/scatter, this paper, detailing Tarantula, a vector extension to the Alpha architecture, is a very good read.

Highlights:
8192 bit vector registers, holding 128 doubles.
Vbox loads and stores access the L2 directly (bypassing L1)
The L2 is pseudo multi-ported, with 16-ways supporting up to 16 accesses in parallel.
Addresses from gathering loads/scattering stores are grouped to minimized way/bank conflicts.
Aggregate bandwidth of 512 bytes/cycle (256 read, 256 write) for stride-1 accesses.

All this in 2002.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 17-May-2011, 17:20   #66
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
Default

It was put on paper in 2002, at least. It would have gone on a 65nm process, which was not due for some time.

It would have been interesting to see if Tarantula would have met some of the projections, two process nodes past where EV8 would have been. First Intel at 90nm and AMD at 65nm had significant problems with high-performance chips in the meantime. EV8 was a power monster at 130nm, and it was not confirmed EV8 would hit all its targets within 250W let alone scale perfectly for two problematic nodes.

Looking at the specs, we see numbers for some things like the registers, interconnect, and L2 that are several times more per-core than entire chips get all the way down to 22nm.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 17-May-2011, 17:39   #67
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by 3dilettante View Post
One important consideration in an OoO implementation is that the scatter/gather should not be blocking.
The penalty in the case that it remains blocking may be enough to make the OoO engine counterproductive in scatter/gather heavy code. The disambiguation hardware would be negated, and the core could not generate MLP from other instructions in the stream while the scatter/gather blocks issue.
In that case, scatter/gather without careful (and possibly expensive) re-layout of data structures would just defeat OoO in applications that were written without the forethought of vectorization.

For LRB1, it wouldn't matter if the scatter/gather was blocking, as it had lots of threads to hide latency with and it was in order anyway.

The cost of non blocking scatter/gather is interesting to look at. What additional complications might arise while implementing it vis-a-vis in order scatter-gather? I guess non blocking gather and blocking scatter would be easier to make.

It is interesting that even ARM's T604 GPU IP doesn't implement scatter/gather as they said that making these units is hard (or something like that).
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 17-May-2011, 22:48   #68
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,259
Default

Blocking could negate the benefits of having gather. If no operations can reorder around it, a single gather element that drops to the L2 would force a whole memory pipeline stall for ten or so cycles in the case of a fast L2 like Sandy Bridge. 20 or more for an L2 like that of Bulldozer. Given the importance of the L/S pipes. If an optimized loop shaves off 10 instructions using gather and thus buys 2-5 cycles, the tens of stall cycles would add the penalty right back and then some.
I think it shouldn't be that restrictive for an OoO core.


A speculative memory pipeline capable of reordering loads around stores would need to behave differently with a coalescing scatter/gather.

If N addresses are coalesced into a single cache line scatter, each of the N addresses must be broken out and individually tracked by the alias predictor and the store queue populated on a per-address basis to allow proper serialization.
Coalesced gather needs per-address checks of the store queue and aliasing predictor. Forwarding may be applied here as well.

The worst and best-case scenarios for the memory pipeline and scatter/gather are mostly opposite.
For the SGU, an operation that coalesces into a single cycle and single cache access is best.
This is the worst-case for the speculation and forwarding hardware, which speculates with at most 2 addresses. This can wind up kneecapping scatter/gather at far below its peak, or forcing a multiplication of the memory pipeline's speculation resources far beyond what the scalar case needs.

Perhaps it can be changed so that the pipeline is pessimistic when dealing with coalesced ops. The cache line's base address would be flagged as being for a scatter/gather and anything in that range will register as a conflict regardless of the actual address. This would add an additional small check per prediction and would add some amount of storage per predictor entry. There could be a separate scatter/gather alias predictor, but that's adding a second check in tightly timed memory pipeline.
The most simple alternative besides not speculating is to globally shut down memory speculation when a scatter/gather is in-flight, but that would compromise the benefit of the operation significantly.
__________________
Dreaming of a .065 micron etch-a-sketch.

Last edited by 3dilettante; 17-May-2011 at 23:00.
3dilettante is offline   Reply With Quote
Old 18-May-2011, 00:17   #69
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,320
Send a message via ICQ to MfA
Default

A scatter/gather implemented either with serialization or a banked L2 is far too high latency to make speculative execution of it ever to make sense IMO. Just use vertical multithreading.
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote
Old 18-May-2011, 09:27   #70
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,646
Default

Quote:
Originally Posted by 3dilettante View Post
A speculative memory pipeline capable of reordering loads around stores would need to behave differently with a coalescing scatter/gather.
As you point out gather/scatter doesn't gel well with modern speculative load/store units. The advantage of gather/scatter is that you generate a fuckton of parallel memory requests. If you were to dump these in your normal speculating scalar load/store pipe, the structures used to track requests would be large and therefore slow.

The solution, IMO, is to use a weak memory ordering model for gather/scatter. Alpha used a weak memory ordering model. For x86 I'd mark pages in the page table to use a weak memory ordering model (I'd make 'vector' pages bigger too). That way the vector loads and stores could bypass the normal scalar load/store unit. As long as loads and stores can't alias, gathers and scatters can proceed out of order. If you compiler can't prove that loads and stores won't alias, it would have to insert memory fences (effectively serializing accesses).

Cheers
__________________
I'm pink, therefore I'm spam

Last edited by Gubbi; 18-May-2011 at 09:53.
Gubbi is offline   Reply With Quote
Old 18-May-2011, 16:59   #71
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by MfA View Post
A scatter/gather implemented either with serialization or a banked L2 is far too high latency to make speculative execution of it ever to make sense IMO. Just use vertical multithreading.
IOW, use a gpu.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 18-May-2011, 17:01   #72
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Gubbi View Post
As you point out gather/scatter doesn't gel well with modern speculative load/store units. The advantage of gather/scatter is that you generate a fuckton of parallel memory requests. If you were to dump these in your normal speculating scalar load/store pipe, the structures used to track requests would be large and therefore slow.

The solution, IMO, is to use a weak memory ordering model for gather/scatter. Alpha used a weak memory ordering model. For x86 I'd mark pages in the page table to use a weak memory ordering model (I'd make 'vector' pages bigger too). That way the vector loads and stores could bypass the normal scalar load/store unit. As long as loads and stores can't alias, gathers and scatters can proceed out of order. If you compiler can't prove that loads and stores won't alias, it would have to insert memory fences (effectively serializing accesses).

Cheers
With a weak memory model, you can basically forget trivially autovectorizing stuff.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 19-May-2011, 08:00   #73
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,646
Default

Quote:
Originally Posted by rpg.314 View Post
With a weak memory model, you can basically forget trivially autovectorizing stuff.
Huh? Do you want a powerful vector ISA with gather/scatter to run C spaghetti ?

Why would you do that?

And 'trivially autovectorizing', isn't that an oxymoron ?

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 19-May-2011, 15:31   #74
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,259
Default

First thanks for the answers that made me a bit less dumb

I may dare some other questions tho. Disclaimer those questions assume that it can be relevant to use a CPU core as the basis of a design intended to achieve high throughput and and data parallel workloads (which doesn't look like an idea a lot of people agree with).
Do I get the arguments properly.

Case 1: something CPU based has no chance to be competitive no matter when.

Case 2: given time there's a chance
*case 2.1: it's even possible within a highsingle thread performance CPU:
** From I read it could prove really difficult
** If it happens it would have a significant costs, in perfs per mm˛ / Watts which has a chance to send us back to case 1.
* case 2.2: simple narrow CPU and multi threading is the way to go.
** Still costly and non trivial
** As CPU "standard" perf would crumble it lowers the interest of using a CPU core in the first place
** Mapping wider vectors to narrower SIMD units could help a lot (hiding latencies for ops as scatter gather).
** Doing scatter/gather from L2 would help not paying an extra price on L1 (which would more complex, slower, more power hungry, etc.). Using the previous point + SMT there is ways to hide the extra latecies.
** Doing texturing work is out of question.

Here it comes, the one thousand billion dollars Are bankers indeed banksters? Do I get properly the different opinions as I don't even try to read the various links that are way to advanced for me?

A real question, is texturing really impossible (I mean with acceptable performances). There's nothing that can be add to an already specialized CPU core as a larrabee one that could help?
Could an tiny read only cache akin to those find in GPU help?
Could the scalar pipeline do the calculations?
Or really no matter what you add to the design (tex units aside... ) it's a lost cause.
liolio is offline   Reply With Quote
Old 19-May-2011, 17:49   #75
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Gubbi View Post
Huh? Do you want a powerful vector ISA with gather/scatter to run C spaghetti ?

Why would you do that?

And 'trivially autovectorizing', isn't that an oxymoron ?

Cheers
Because, apparently there are are lots of legacy apps out there that could use scatter gather but wouldn't fit with GPU's.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 16:18.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.