SSE4, future processors and GPGPU thoughts

You really seem to have issues with understanding some basic concepts of electronics and physics.

For example, don't you realize that multiple CPUs on one board would:

1. Increase the cost of the board because of:
a) Much more expensive signal routing (probably more layers needed)
b) More elements required for decoupling of 4 sockets instead of 1 (solid or aluminium capacitors)
c) More power MOSFETs for powering 4 CPUs instead of 1
d) Bigger surface needed to accomodate all the elements
e) Next to inpractical synchronization for memory access

2. Increase the cost of the whole system because of:
a) mainboard layout change needs new case layout
b) You will need to install 4 coolers instead of 1 (also think of heat and noise which goes with power required for 1-cycle DOT you are so obsessed with)
c) RAM would have to be much faster (like 10 times today bandwidth)

3. Increase the complexity of software design. Not to mention that they would have to develop the tools for programming that monstruosity first. Don't count on free compilers there.

Furthermore, you are ignoring one simple fact in your requests for more powerfull instructions -- GPU is much better at random access to memory than the CPU. It is because it has multiple smaller texture caches and the data layout in memory is optimized for certain types of accesses (data order is not linear AFAIK). On the CPU side you have 10 times slower RAM (8.5GB/sec or less versus ~85GB/sec on 8800GTX) and one big cache, both delivering greatest bandwidth in simple streaming (linear RAM access) operations. It is something that can't just be changed overnight.

Yep, that was that I originally thought... then I saw this:
http://en.wikipedia.org/wiki/Torrenza
http://www.dailytech.com/article.aspx?newsid=2642
http://pc.watch.impress.co.jp/docs/2006/0713/kaigai287.htm

You can also use the upcoming Hypertransport HTX to plug all these coprocessors or math cards like the ClearSpeed...
http://www.dailytech.com/article.aspx?newsid=1276
http://en.wikipedia.org/wiki/HyperTransport
http://www.hypertransport.org/docs/wp/HTX_wp_final.pdf

Notice this is good, because the motherboard has only 2-5 HT3.0 slots ( not 900 ZIF sockets )... But you can plug an auxiliary card with multiple multi-core-CPUs and multiple sockets ( again, like the clearSpeed ). You can use a PPU, GPU, raytracing hw card or whatever there and talks directly with the main CPU in the same manner as a coprocessor.

Btw Intel is preparing the same thing with the code name of "Geneseo" but is a bit different because uses the PCI-X to plug the coprocessor cards

About the costs... One capacitor costs 0.1$, the routing is cheap, the layers yes, increase a bit the cost... But all that compared with the 1200$ of a quad-core CPU are nothing... I think what increases the costs really is the developement team not the manufacturing and packaging... And with these coprocessors the final verilog/VDHL is much less so what is important, the devel cost, will be greatly reduced because is easier to develop a 30M transistors CPU than a 800M one ( and you can save silicon and make more CPUs per disc )

However I must admit all this can be a bit vaporware... because the Torrenza/Geneseo are planned for 2008-2010 and could not be a success. We will see with the time.

And about physics... I don't think to integrate 900M of transistors in a CPU will be good... In fact, the G80/R600 are going to be the last super-transistor-numbers GPUs... See this:

http://www.short-media.com/extendednews.php?n=5417
http://www.pro-networks.org/forum/viewstory.php?t=85588

About coolers... Forget them... A smaller GPU like the VirgeDX does NOT require a cooler.. The G80 needs a cooler because uses 800M of transistors... Future CPUs are going to use nanotubes inside the chip packagement as cooling system:

http://news.zdnet.co.uk/emergingtech/0,1000000183,39147421,00.htm
http://www.hardwaresecrets.com/news/713
http://www.frostytech.com/permalink.cfm?NewsID=54604
http://www.xbitlabs.com/news/coolers/display/20040326082724.html
http://www.physorg.com/news12109.html

The new Quantum Well transistors based on indium antimonide (InSb) can help too:
http://business.pcauthority.com.au/print.aspx?CIID=59355

Btw, see the Clearspeed card I mentioned before can do 25GFLOPS per CPU and it doesnt use a cooler ( because goes at 250Mhz and 0.80 )

About the design software... I think is much better to debug a 3M lines verilog CPU than a 90-core CPU...
Also I'm not sure how many cores we will be able to integrate into a CPU into the near future... but we will reach the limit and then the coprocessor idea can be good. PCB routing debug for multiple sockets will be a bit painful though, but I think simpler than a 900M transistor CPU.

About the GPU vs CPU I think Uttar explained very well... and I can agree.. GPUs will be faster by now ( need to wait to see because 4 clearSpeed/Cell cards using HT3.0 can wipe all... and have to wait CUDA and CTM performace too ). But that's not the question really... the question is why we had to wait 10 years to get a decent basic DOT instruction when all the developers were requesting it ( and finally they gave us the reason including it.. late... but better late than never ) and the GeForce3/SuperH4 years ago had it ( so technically was possible ).

Hey btw, I found this the other day
http://www.vr-zone.com/?i=4415

It appears AMD gonna include SSE4A too... good news.
 
Last edited by a moderator:
About the costs... One capacitor costs 0.1$, the routing is cheap, the layers yes, increase a bit the cost... But all that compared with the 1200$ of a quad-core CPU are nothing... I think what increases the costs really is the developement team not the manufacturing and packaging... And with these coprocessors the final verilog/VDHL is much less so what is important, the devel cost, will be greatly reduced because is easier to develop a 30M transistors CPU than a 800M one ( and you can save silicon and make more CPUs per disc )
From the standpoint of a board manufacturer, $0.1 extra for a capacitor over a production run of millions of boards is signficant. The extra cost of layers is a fixed cost over millions of units.

Board manufacturers see no revenue from the sale of a $1200 quad core or 4 $300 smaller chips, so they can't eat the cost for more sockets just because it sells more CPUs.

Niche market boards don't just cost an extra ten cents, they charge serious cash for specialty boards, and many-socket boards do not have the economies of scale that consumer boards do.

About coolers... Forget them... A smaller GPU like the VirgeDX does NOT require a cooler.. The G80 needs a cooler because uses 800M of transistors... Future CPUs are going to use nanotubes inside the chip packagement as cooling system:
A little speculative, considering that we still can't mass-produce quality nanotubes, much less make enough for the scale of chip manufacturing.
 
Notice this is good, because the motherboard has only 2-5 HT3.0 slots ( not 900 ZIF sockets )... But you can plug an auxiliary card with multiple multi-core-CPUs and multiple sockets ( again, like the clearSpeed ). You can use a PPU, GPU, raytracing hw card or whatever there and talks directly with the main CPU in the same manner as a coprocessor.

2-5 slots is too much given the current power consumption, heat and space consuming figures. Unless we are talking about server space but I thought you were speaking of desktop CPUs.

Btw Intel is preparing the same thing with the code name of "Geneseo" but is a bit different because uses the PCI-X to plug the coprocessor cards

IMO, the best way would be to have external PCI-Express links using optic cables. That way you could interface literally anything (serial bus, lower costs, etc). But unfortunately nobody asks me...

About the costs... One capacitor costs 0.1$, the routing is cheap, the layers yes, increase a bit the cost... But all that compared with the 1200$ of a quad-core CPU are nothing...

As it has been already said:

3dilettante said:
Board manufacturers see no revenue from the sale of a $1200 quad core or 4 $300 smaller chips, so they can't eat the cost for more sockets just because it sells more CPUs.

I would add that the board manufacturers would actually like to sell more boards, not less boards per CPU like they would in case they make one which has several sockets.

easier to develop a 30M transistors CPU than a 800M one ( and you can save silicon and make more CPUs per disc )

First, 45M of transistors is roughly 1MB of L2 cache nowadays. That chip would be useless. Second, when 800M one has a defect, they turn off one quarter of the cache and ship it as a cheaper/slower part. When 30M chip has a defect it most likely doesn't work at all so it goes to trash. Third, more chips translates to more time and money for testing, binning, packaging and transport.

However I must admit all this can be a bit vaporware... because the Torrenza/Geneseo are planned for 2008-2010 and could not be a success. We will see with the time.

That sounds more sane.

About coolers... Forget them... A smaller GPU like the VirgeDX does NOT require a cooler..

That is true but it is severely limited in what it can do. And VirgeDX doesn't even have programmable pipeline.

The G80 needs a cooler because uses 800M of transistors...

It is 681M, not 800M. Limit for the 90nm process was 700M, that is why I/O chip got thrown out of it.

Future CPUs are going to use nanotubes inside the chip packagement as cooling system

There is quite some IF and WHEN in that and you know it.

I'd rather cast my vote for corona discharge cooling in a two years or so.

Btw, see the Clearspeed card I mentioned before can do 25GFLOPS per CPU and it doesnt use a cooler ( because goes at 250Mhz and 0.80 )

But you have to have optimized software specifically targeted at that. Not going to happen. If something is too specialized it is doomed to failure because by the time it gets widely adopted it is usually superseeded by something way faster/better.

About SSE4, I think that it finally rounds up the instruction set. There is one instruction which is sorely missing and I don't think they will consider late additions.

We need SIMD MOVPS instruction which can dereference four pointers in another XMM register and fetch four floats from four completely different memory addresses. Now that would be usefull.
 
2-5 slots is too much given the current power consumption, heat and space consuming figures. Unless we are talking about server space but I thought you were speaking of desktop CPUs.
See my new power source... I switch it on and all my city's apple goes down! Mwahaha!
http://www.hothardware.com/news.aspx#news3284

IMO, the best way would be to have external PCI-Express links using optic cables. That way you could interface literally anything (serial bus, lower costs, etc).
Do you mean like SATA cables? Or optical fiber cables like network?

That is true but it is severely limited in what it can do. And VirgeDX doesn't even have programmable pipeline
This FPGA doesn't use a cooler ( uses 10W, 400Mhz and barely reaches 40 degs ) . Btw the Virtex4 uses 50k LCs and is used as an advanced coprocessor here:
http://www.theregister.co.uk/2006/04/21/drc_fpga_module/page2.html

It is 681M, not 800M. Limit for the 90nm process was 700M, that is why I/O chip got thrown out of it.
Ouch! I was thinking in the Vega 2 CPU
http://www.azulsystems.com/press/032706_vega2.htm
but I said G80... Well, change that for a "a lot of transistors"

I'd rather cast my vote for corona discharge cooling in a two years or so.
The G80 weight is near a kilogram with the hybrid cool system... Last time I plugged it into the slot almost wipes the weak plastic connector... Then I saw my old Geforce 1 with a small heat sink and the thoghts started hahah!

There is one instruction which is sorely missing and I don't think they will consider late additions.

We need SIMD MOVPS instruction which can dereference four pointers in another XMM register and fetch four floats from four completely different memory addresses. Now that would be usefull.
Yep but might be hard to implement due to the syncronization and burst block mode
 
Last edited by a moderator:
See my new power source... I switch it on and all my city's apple goes down! Mwahaha!
http://www.hothardware.com/news.aspx#news3284

And your bank account goes the same way too. :devilish:

Do you mean like SATA cables? Or optical fiber cables like network?

I mean optical cable but using PCI-Express interface since it is serial in nature and does not require too many wires. You could even use one fiber and several different laser wavelenghts. Since PCI-Express interface and software is pretty simple to implement it would be very popular. You could have external boxes, one video card per box with its own power supply, and just optic link to the computer for data. No more taxing main PSU by power hungry components. Furthermore, I would make that optical link daisy-chainable like SCSI, so each device would have its own in/out and passthrough for other devices. Since mainboard wouldn't need any slots then you could use more space on the board for more RAM and of course more PCI-Express lanes in the chipset. Now if only someone would actually consider such an idea...

This FPGA doesn't use a cooler ( uses 10W, 400Mhz and barely reaches 40 degs ) . Btw the Virtex4 uses 50k LCs and is used as an advanced coprocessor here:
http://www.theregister.co.uk/2006/04/21/drc_fpga_module/page2.html

Still it is too specialized and development is expensive.

The G80 weight is near a kilogram with the hybrid cool system... Last time I plugged it into the slot almost wipes the weak plastic connector... Then I saw my old Geforce 1 with a small heat sink and the thoghts started hahah!

Well I own one and I am happy with it. It is true that it doesn't pair well with cheap mainboards, cases and PSU but then again who wants such a card will get everything else to support it properly first like I did.

Yep but might be hard to implement due to the syncronization and burst block mode

I disagree. I believe that it would still have major advantage to have specialized instruction for gather because CPU could see those four reads as if they are coming together and prioritize them properly.

The way it is now, if you issue four separate reads and combine them and one of them gets rescheduled because there is another independent read before it, the whole operation will be stalled. That could be avoided if those four reads were treated as atomic.

Gather would be most often used in various interpolation algorithms, so most of the time values would be adjacent or even overlap. That means many internal optimizations would be possible to accomplish efficient reading and combining of the values into single SSE register.

Unfortunately there is one problem with that -- it would become obsolete after shift to 64 bits is completed since it would require four 64-bit pointers in source SSE register for dereferencing. You could always use two registers for pointers but that would be clumsy to work with.

I wonder in how many cycles would CRC32 instruction execute?

I also wonder why not MD5 instead? CRC32 is still popular but a bit aged. Actually that just goes to demonstrate what I was talking about -- by the time you implement special features they get superseeded.
 
It is 681M, not 800M. Limit for the 90nm process was 700M, that is why I/O chip got thrown out of it.

Just because Shtal claimed this factoid doesn't automatically mean it's true. (More the contrary really, but we digress. ;) )

There are probably a bunch of technical and economical reasons to have an external I/O chip, but there being such a limit is not one them. If you can make a chip of 680M transistors there's no reason why you couldn't make one of, say, 750M. (At a yield cost, of course.)
 
Last edited by a moderator:
Unfortunately there is one problem with that -- it would become obsolete after shift to 64 bits is completed since it would require four 64-bit pointers in source SSE register for dereferencing. You could always use two registers for pointers but that would be clumsy to work with.
Using two registers would not be bad from a programming point of view. Use rsi as the base pointer and treat the four 32-bit integers in the SSE register as offsets. In the worst case you have to swap pointers in and out of rsi but that executes in parallel anyway.

The hardware implementation would be a bitch though. Just for this instruction they'd have to quadruple the number of memory load units. An alternative would be to issue the loads consecutively but then you can't issue another load for several clock cycles and it wouldn't be significantly faster than inserting every value yourself with SSE4.

Maybe two load units would be a workable compromise though, and/or sharing load units between cores...
 
Using two registers would not be bad from a programming point of view. Use rsi as the base pointer and treat the four 32-bit integers in the SSE register as offsets. In the worst case you have to swap pointers in and out of rsi but that executes in parallel anyway.

The hardware implementation would be a bitch though. Just for this instruction they'd have to quadruple the number of memory load units. An alternative would be to issue the loads consecutively but then you can't issue another load for several clock cycles and it wouldn't be significantly faster than inserting every value yourself with SSE4.

Maybe two load units would be a workable compromise though, and/or sharing load units between cores...

It would save on decode slots and scheduling being filled up by all the separate loads and all the instructions needed for component gathering.


But:

It'd also be a pain because 4 independent memory reads means 4 independent walks through the memory subystem.

That could mean in the worst case four times the entries needed in the load/store queue (or a load/store pipeline that is 1/4 as effective in high-use situations), four times the number of TLB accesses, four real cache ports, four scratch pad entries for every outstanding gather operation, two rename dependencies for an OoO core, a couple cycles to sync up the various loads, four times the checking needed for memory aliasing, four times the likelihood the op will stall.

Serializing the loads would reduce the need for cache ports, but would leave the rest of the hardware as complex as before.

All that would make it prohibitive to implement it on a core that also needs to target general performance. A more specialized core would likely use it, or if they really have that much silicon to spare, a specialized memory unit in addition to the standard units that can be called in, probably at a latency cost.
 
Last edited by a moderator:
four times the number of TLB accesses, four real cache ports,

Yeah, four potential TLB miss traps, 4 caches misses. With the risk that the seperate loads evict each other's lines in the cache.

Some fairly nasty pathological performance degradation possible.

For all its warts x86 is actually pretty clean when it comes to memory operands (ie. only one per instruction). Which is nice because you'll only get one cache/TLB miss per instruction (discounting unaligned access).

Cheers
 
You all seem to forget that those four loads would in most cases be from adjacent or even overlapping memory locations because it would be used for interpolation.

It means that after first load, the rest would most likely come from the L1 cache for free or in the worst case from adjacent cache line which would be prefetched by then due to hardware prefetcher.

Moreover, decoding one instruction is faster than decoding eight+ of them -- four loads plus four pointer operations (which may induce more loads and arithmetic operations themselves) plus all the shuffling to put the data in the right place in the SSE register. I believe that in most cases it would be drastically faster in terms of clock cycles than the equivalent code.
 
You all seem to forget that those four loads would in most cases be from adjacent or even overlapping memory locations because it would be used for interpolation.
Silicon can't assume that. The latency associated with checking for a worst-case memory crap-out must be applied 100% of the time.

Moreover, decoding one instruction is faster than decoding eight+ of them -- four loads plus four pointer operations (which may induce more loads and arithmetic operations themselves) plus all the shuffling to put the data in the right place in the SSE register. I believe that in most cases it would be drastically faster in terms of clock cycles than the equivalent code.

The question is: how much slower does the more complicated load hardware make the clock cycles? Memory queues are on the critical timing path for the entire core, and a change in memory behavior this drastic would require redesigning a lot of the memory hierarchy.

For a general x86 core, it may not be much of a win. A more specific core might change things, if the target workload can make use of the gather operations.
 
The question is: how much slower does the more complicated load hardware make the clock cycles?

I wouldn't change the load hardware at all. It can stay the same and in optimal cases (adjacent or overlapping loads) it will work better than separate instructions and in worst cases it won't work any worse than the separate instructions do.
 
I wouldn't change the load hardware at all. It can stay the same and in optimal cases (adjacent or overlapping loads) it will work better than separate instructions and in worst cases it won't work any worse than the separate instructions do.

If current load hardware can't load non-contiguous elements into a single SSE register, using another SSE register as the address register, then how can implementing the instruction not change load hardware?

Is this just another microcoded instruction that translates into a dozen instructions in the back end?
 
If current load hardware can't load non-contiguous elements into a single SSE register, using another SSE register as the address register, then how can implementing the instruction not change load hardware?

Is this just another microcoded instruction that translates into a dozen instructions in the back end?

Because the loads could still be issued serially just like they are issued now. In any case that should still translate into less uops for said instruction.

Compare this:

Code:
	mov		esi, dword ptr [pix] ; base
	mov		eax, dword ptr [ip] ; offset
	movd		xmm0, [esi + eax]
	mov		edx, dword ptr [ip + 4]
	movd		xmm1, [esi + edx]
	unpcklps	xmm0, xmm1
	mov		eax, dword ptr [ip + 8]
	movd		xmm2, [esi + eax]
	mov		edx, dword ptr [ip + 12]
	movd		xmm3, [esi + edx]
	unpcklps	xmm2, xmm3
	movhps		xmm0, xmm2

And this:

Code:
	mov		esi, dword ptr [pix] ; base
	gmovps		xmm0, xmmword ptr [ip]

I would really love to have something like that.
 
I don't know anything about the internals of a typical x86 microcode engine, so I can't speak as to whether it can reduce the number of uops just by being microcoded.

Since the microcode path blocks the standard decoders from issuing, I only see the advantage in a reduced code footprint.

The first advantage may be useful, though I don't know if it's useful enough to justify an entire instruction that will take about as long to execute either way.

As a single instruction, are the various loads assumed to be atomic?

The gather operation does fiddle a fair amount with x86 instruction semantics.
 
I wouldn't change the load hardware at all. It can stay the same and in optimal cases (adjacent or overlapping loads) it will work better than separate instructions and in worst cases it won't work any worse than the separate instructions do.
How adjacent or overlapping would your loads really be? What application are you targetting with this?

One other compromise I see is that the hardware would be able to load four 32-bit values from L1 cache in parallel, but L2 cache and RAM access is still serial. This way the hardware cost might not be prohibitive and the latency of the most frequent accesses goes down a lot. I'm just not sure if L1 caches at 3+ GHz can have four read ports. But Athlons have dual-ported L1 caches if I recall correctly so it would only add one clock cycle of latency for the whole instruction. At other times it can do two loads from separate instructions.

Yeah this could work, if they see enough use for it...
 
How adjacent or overlapping would your loads really be? What application are you targetting with this?

I repeated that numerous times -- main use would be for interpolation. I mean adjacent like "one right next to the other" or sometimes even overlapping (reading the same value into two registers) and in the worst case different cache line access would be needed which would be prefetched by the hardware prefetcher anyway as soon as the previous one is accessed. In any case such gather would rarely cross cache line much less page boundary. So forget cache and TLB misses.

In any case, I am not asking for an instruction which will be used to gather sparse values which are 2GB apart. Sure you could use it for that too, but I believe that the performance would not be any worse than with the code I wrote above.

One other compromise I see is that the hardware would be able to load four 32-bit values from L1 cache in parallel

Not only that, but in cases where you have two equal pointers you could load value once and copy register contents which would reduce bandwidth use.

Yeah this could work, if they see enough use for it...

At least my 3D reconstruction code would surely have great use from it.
 
I repeated that numerous times -- main use would be for interpolation. I mean adjacent like "one right next to the other" or sometimes even overlapping (reading the same value into two registers) and in the worst case different cache line access would be needed which would be prefetched by the hardware prefetcher anyway as soon as the previous one is accessed. In any case such gather would rarely cross cache line much less page boundary. So forget cache and TLB misses.

In any case, I am not asking for an instruction which will be used to gather sparse values which are 2GB apart. Sure you could use it for that too, but I believe that the performance would not be any worse than with the code I wrote above.

But worst case is still 4 TLB misses and 4 cache misses. Each TLB miss can invoke multiple memory transactions themselves.

You just increased worst case latency by four fold for anything that requires the execution pipeline to drain (ie. any serializing instruction either directly invoked by the program or by interrupt/trap).

Also the scheduling apparatus would have to wait for all four loads to complete before doing any work on them since CPU state is maintained at instruction boundaries (ie. you can't have a partially updated register visible to the rest of the CPU). So you also get worse instruction latency out of it (and a lot worse if just one of the four loads miss cache).

All for a very modest gain in instruction stream density on a not-at-all-common case. You would stil* be completely limited by the execution of the actual loads.

Cheers
 
Back
Top