G80: Physics inflection point

ZoinKs! · Nov 15, 2006

DemoCoder said:
(resurrect Aureal Wavetracing please)

If that's the only thing delivered by the new gpu's, I'd still be ecstatic.

no_way · Nov 24, 2006

DemoCoder said:
The most expensive operation in SSL isn't the stream cipher (such as RC4, 3DES, etc), but the connection setup in which the session key is negotiated. This is what requires alot of high precision modular arithmetic, ala diffie-hellman/rsa/el gamal/etc. ECC uses Galois fields so it's slightly different (and more expensive). Accelerating that will take a huge weight off of connection setup costs, and level-3 routers will be able to setup the SSL/TLS connection, and parse the HTTP headers before handoff.

Just as an aside, MS upcoming CNG ( cryptography next generation ) provides for lots of easy hooks for hardware acceleration of basically any crypto op. Coincidence ?

Simon F · Nov 24, 2006

DemoCoder said:
The most expensive operation in SSL isn't the stream cipher (such as RC4, 3DES, etc), but the connection setup in which the session key is negotiated. This is what requires alot of high precision modular arithmetic, ala diffie-hellman/rsa/el gamal/etc. ECC uses Galois fields so it's slightly different (and more expensive).

Surely Galois field calculations are cheaper than modular arithmetic over the integers?

Mintmaster · Nov 24, 2006

I didn't notice this thread last time, but now that it's bumped up I think you all (esp. DemoCoder) should reconsider your stance a bit.

Regarding scalar vs. vector, if your comparing an SPE to a graphics chip for general computation, then G80 is a lot wider than a SPE. The only way G80 will run at peak throughput for scalar computations is if you're running the same program on about 5000 fragments. Cell could theoretically get away with only 32 parallel scalar computations and reach its peak. Thinking that each SPE can only do one scalar operation per clock is very disingenuous. You'd either have an idiot programmer or a workload that couldn't parallelize to a GPU anyway. It can do a MADD on 4 different pieces of data, just like a G80 cluster does a MADD on 16 pieces of data. The "scalar MADD power" of G80 is only 69% better than CELL, not 8 times bigger.

For branching, I don't know Cell's instruction set, but if you have enough parallelism in your workload to run fast on a GPU then avoiding branching penalties should be easy on Cell. Moreover, your granularity is 4 scalar streams instead of 32.

(Incidentally, I personally think CPUs should stay away from GPU type parallelism. Workloads will generally either be massively parallel -- in which case GPUs have the advantage regardless -- or they will be hard to parallelize, and there won't be much between the extremes. CPUs should stick to making the latter as fast as possible.)

DemoCoder · Dec 9, 2006

Mintmaster said:
Regarding scalar vs. vector, if your comparing an SPE to a graphics chip for general computation, then G80 is a lot wider than a SPE. The only way G80 will run at peak throughput for scalar computations is if you're running the same program on about 5000 fragments.

Not neccessarily. If I'm operating in CUDA mode and dealing only with scratchpad RAM.

Cell could theoretically get away with only 32 parallel scalar computations and reach its peak. Thinking that each SPE can only do one scalar operation per clock is very disingenuous. You'd either have an idiot programmer or a workload that couldn't parallelize to a GPU anyway.

Well, if you want to compare theoretical peaks, then fine, but my point was, it's much easier to extract peak performance from a scalar cluster than a SIMD cluster. I will be interesting to see measured throughput performance figures on typical long running SPE apps. I'm guessing SPE SIMD trannies will sit idle far more often than on the G80, even given herculean hand-tuned asm efforts from macho alpha-male coders.

(Incidentally, I personally think CPUs should stay away from GPU type parallelism. Workloads will generally either be massively parallel -- in which case GPUs have the advantage regardless -- or they will be hard to parallelize, and there won't be much between the extremes. CPUs should stick to making the latter as fast as possible.)

I disagree. There is rich literature in computer science on parallel programming problem classes, and there is more a continuum rather than an "either-or" (you're either embarrassingly paralllel, or not). That's why the CRCW/EREW/CREW/ERCW/PRAM models exist, because the asymptotic performance of algorithms differs depending on machine model.

There are algorithms that run good on streaming architectures, algorithms that run well on "shared nothing", algorithms that run better on "shares something", and still others that run best on a traditional serial CPU with large cache and shared-everything. CELL is a case in point, a CPU halfway between a streaming GPU architecture and a traditional CPU because of the local scratchpad memory. CUDA on the G80 is a similar example. There are algorithms that run much better if you have concurrency primitives and the ability to resolve read/write order so you can read back what was just written instead of waiting for all streams to finish and loop back.

A primarily example being a class of 2D dynamic programming problems, in which there is local thread interdependency (you can't compute result X until threads nearby have computed X_n-1, X_n-2, X_n-3) These problems are not completely parallelizable on a stream processor, but can be parallelized more efficiently if local threads have a concurrency primitive and a small amount of memory for inter-thread communication.

In fact, I would go so far as to state the extreme opposite to your position. For many workloads, there is a custom hardware model which will have much better efficiency than the two extremes of GPU and CPU. If I was charged with say, a workload absolutely dependent on very large integer multiplication, I might attempt to built a Pointer Machine or Storage Modification Machine, which Knuth, and later Schonhage, showed can multiply integers in *LINEAR* time. This beats FFT based algorithms which would run well on GPU, and Karatsuba techniques which run nicely on CPUs. If if I wanted to factorize large numbers using Number Field Sieve, there is again, a slight twist on the optimal architecture. Systolic arrays and Super Systolic Arrays do very well for some problems.

Unfortunately, we can't afford to build a custom machine for every workload that comes along (but government entities with invest in HPC can and do, especially the NSA)

But what I would definately say is that workloads aren't either "embarassingly parallel" or "embarassingly serial". There are many problems which do not need much shared state, but do need *a little*, or need to coordinate transactional reads/writes to a small amount of shared data according to some notion of locality. In those cases, a shared-nothing and shared-everything are both suboptimal architectures.

KimB · Dec 9, 2006

I'm curious...how much public information about CUDA do we have?

Arun · Dec 9, 2006

Chalnoth said:
I'm curious...how much public information about CUDA do we have?

It will still be under NDA for a while, sadly...

NVIDIA mentionned they got 1000+ people in the program already though, and some of the applications people already have running right now seem awesome. They talked of it a bit (read: for more than 10 minutes!) at a recent for-analyst conference, taking it as a key example of future potential growth and diversification of GPUs beyond games, as well as diversification of their overall addressable markets. I think someone is already using it to run fairly large neural networks and things like that - cool stuff.

Uttar

Mintmaster · Dec 9, 2006

DemoCoder said:
Well, if you want to compare theoretical peaks, then fine, but my point was, it's much easier to extract peak performance from a scalar cluster than a SIMD cluster.

You're not getting it.

G80 does have SIMD clusters, and they're wider than Cell's. I was assuming identical instruction streams across the two processors before, but even if we go to the finest granularity that CUDA will expose, G80 must have 16 pieces of data undergoing the same scalar instruction to hit peak throughput. Cell must have only 4 pieces of data undergoing the same scalar instruction. Both use SOA to get high throughput on scalar ops. Both need many such instruction streams on many data sets to hide calculation and loading latencies.

There's no way you can say G80 will have 8x the scalar MADD ability of Cell. This will only happen if G80 is running each instruction stream in lock step on 16 scalar streams data and Cell is running a program on a single stream. That's an unfair comparison. If you can't arrange the data in SOA format and each SPE is only doing 1/4 peak flops, then the same workload on G80 will run at 1/16 peak flops, and thus run at less than half of Cell's speed, not 8x faster.

G80 has 70% higher scalar MADD ability than Cell, and that's with over twice the die size. G80's strengths vs. Cell lies in perfect latency hiding with minimal effort, not in raw scalar math ability.

I disagree. There is rich literature in computer science on parallel programming problem classes, and there is more a continuum rather than an "either-or" (you're either embarrassingly paralllel, or not). That's why the CRCW/EREW/CREW/ERCW/PRAM models exist, because the asymptotic performance of algorithms differs depending on machine model.

Perhaps, but I'm thinking more from a practical point of view. I don't think we'll get software companies writing much code that scales well to 10+ cores unless it's massively parallel. By 'scales well' I mean, say, at least 4x speedup with 10 cores as opposed to 1.

IMO, for workloads that cannot be made massively parallel, I think for the vast majority of consumer applications (even for next decade) 2 cores is quite usable, it will be tough to get a big boost from there to 4 cores, and 8+ cores even tougher.

DemoCoder · Dec 9, 2006

Mintmaster said:
You're not getting it.

G80 does have SIMD clusters, and they're wider than Cell's. I was assuming identical instruction streams across the two processors before, but even if we go to the finest granularity that CUDA will expose, G80 must have 16 pieces of data undergoing the same scalar instruction to hit peak throughput.

Assuming you are correct in the CUDA context, it is still easier to arrange for 16 streams of data to be operated on by the same scalar instruction than to take a say, a function with 50 instructions, and make sure that every unit of every SIMD MADD unit is being exercised each cycle. It is certainly alot more *developer work* to fully vectorize such code, and a hellava job for a compiler, compared to taking shaders and extracting efficient performance. Both programming models on GPUs, the graphics API oriented path, and the GPGPU paths, offer a restricted functional/stream oriented paradigm that makes the job alot easier.

I'm not quite sure how exactly you calculate the "16" figure. You have 8 TCPs, within a TCP, you have 16 SPs grouped into 2 blocks of 8 SPs (notice the NVidia diagram), each block being able to run a different thread context. That seems to suggest that minimum coherence required is 8 SPs, not 16. (note, I'm not talking about branch coherence). People may have measured 16-SP wide issuing, but I was told it's really capable of 8.

Cell must have only 4 pieces of data undergoing the same scalar instruction. Both use SOA to get high throughput on scalar ops. Both need many such instruction streams on many data sets to hide calculation and loading latencies.

The nice bit about running a 4-way SIMD op on a single SP as 4 scalar ops, is you get vectorization for free, and I think it's easier and more likely to arrange data coherency of inputs at such small granularity than to maximally pack a bunch of SIMD MADDs and keep those units working all the time. Maybe nAo's experience is different, but is seems alot harder to me.

Sure, you can try to attempt the same thing on SPEs by scalarizing a bunch of parallel streams, running 1 on each component, but are SPEs as good at consuming streaming data as the G80? One dependent fetch would seem to throw a wrench into it since you might have to schedule a DMA, and then what does the SPE do in the meantime? To hide the latency would seem to require careful management of SPE scratchpad RAM so that an SPE could work on the next datum while waiting for pending DMAs.

Can't you atleast see that the work for the programmer on CELL to extract peak performance looks like alot more manual labor, micromanaging of resources, and tweaking than the GPU? It isn't hard for me to create *useful* shaders on either NVidia or ATI hardware which generate near 100% utilization of the ALUs.

Perhaps, but I'm thinking more from a practical point of view. I don't think we'll get software companies writing much code that scales well to 10+ cores unless it's massively parallel. By 'scales well' I mean, say, at least 4x speedup with 10 cores as opposed to 1.

Depends on what kind of company you're talking about. If you're talking about routers, application servers, relational databases, and other service oriented architectures, you'll see exactly that. These workloads are neither "shared nothing" or "embarassingly parallel", they are inbetween, consisting of uber amounts of I/O, plus stateless shared nothing combined with concurrent shared datastructures. These apps excel on a sort of lightweight-SMP, much more highly threaded, but still with shared state and IPC. That's because they block on I/O (not memory reads) more often than anything, and a bunch of threads to hide the I/O block is pretty cheap. They are I/O bound not CPU, so all of the elaborate x86 OoOE arch is a big expense.

Azul Systems ships systems with as many as 372 cores, their Vega 2 CPU will have 48-cores on 90nm @ 800+million transistors. They are going to sell up to 768-way SMP boxes. As I have shown before, based on the application, you can get a speedup of up to 60x on their systems that you won't get by scaling horizontally across a bunch of cheap linux boxes. Even though each one of their cores is pathetic in comparison to a Core 2 Duo or even an SPE, their systems will eat a Core 2 or SPE for lunch on a price/perf and perf/watt aspect. They do not run "embarassingly parallel" code. In fact, they are designed to handle contended synchronization locks between threads fairly well and support truly, 100%, concurrent garbage collection in HW. There is alot of shared datastructure traversal going on in the Azul apps.

IMO, for workloads that cannot be made massively parallel, I think for the vast majority of consumer applications (even for next decade) 2 cores is quite usable, it will be tough to get a big boost from there to 4 cores, and 8+ cores even tougher.

I once thought a 1gb of RAM would be insane as would prodiguous amounts of HD space. And that was mostly true, until desktop publishing, and then digi-cams and desktop video arrived. Maybe if Microsoft Vista and Microsoft's bland and bloated vision of the future takes hold, desktops won't need multicore, except perhaps for games. I'm not willing to make predictions 10 years out. I've been in the industry for 2 decades now, and I've been using the internet since the mid 80s, and so so much has changed, so rapidly.

Arun · Dec 9, 2006

While I agree with you overall Mintmaster (if we exclude development effort, wrt CELL vs CUDA, which is a big IF from DemoCoder's POV, I guess

), I wanted to comment on these two things specifically:

Mintmaster said:
G80 has 70% higher scalar MADD ability than Cell, and that's with over twice the die size. G80's strengths vs. Cell lies in perfect latency hiding with minimal effort, not in raw scalar math ability.

I'd say it has some other pretty damn big advantages - but the same is true of CELL, which has some obvious advantages of its own (such as the ridiculously large LS). I think it isn't fair to consider it as the only big advantage of G80 though.

Things like very cheap and good approximations to a large number of special functions (better than SSE for invsqrt, iirc!) and excellent scalar integer/bitshift/etc. performance come to mind. Integer MULs are still generally slow, but they're full-speed up to 24x24; on CELL, things are limited to 16x16 if you want full-speed. The way the parallel data cache seems to work also feels awfully practical for sharing data between threads. In fact, I'll argue that it's a fair bit better than what you can do on CELL, but that's hard to say for sure without going into implementation details.

Finally, the percentage of the die dedicated to ALUs is only going to go up in future G8x revisions. These 64 TMUs and 24 ROPs still take an impressive part of the chip, and other fixed-function functionality such as rasterization (for a 192 pixels/clock theorical peak!) and compression don't come for free. The SPE ratio in future CELL revisions will also increase, though. But if you look at the projected next-gen 2 PEEs and 32 SPEs CELL in 2010, and compare that to what NVIDIA will have in the 2010 timeframe, I think it's easy to see that GPUs will take a very real lead in terms of massively parallel workloads; of course, that's largely irrelevant if we just want to discuss today's architectures, and it's obvious CELL can still stand its own for now!

Mintmaster said:
IMO, for workloads that cannot be made massively parallel, I think for the vast majority of consumer applications (even for next decade) 2 cores is quite usable, it will be tough to get a big boost from there to 4 cores, and 8+ cores even tougher.

I think that's largely true for the consumer market. There are plenty of workloads where you got massive parallelism and it isn't massively SIMD, though. Think of things like MMORPG servers or Google's clusters as two very basic examples; anything that does a LOT of very small and different tasks, rather than single big ones. DemoCoder gave some very nice other examples I wasn't even fully aware of above, such as the Vega 2 CPU.

You can parallelize tons of stuff beyond 8 cores, even for games, but it's very questionable developers will both doing so though, IMO. One possibility is things like AI; but even parts of that are massively parallel, if you look at them from the right angle, such as A*. So, if you got a massively parallel architecture like a GPU at your disposal, many-core CPUs are still great at some very important things IMO; I'm just not convinced enough of them are consumer-related though.

CPU technology basically having "maturing" in the consumer space would certainly be Intel's and AMD's worst nightmares. I don't think that's for tommorow, but it might come eventually, unless new parallelization paradigms and methods develop - which feels moderately unlikely at this point, but we'll see. Can both companies still fund their R&D efforts, for both CPUs and fabs, if their market shrinks by 50% overnight?

Uttar

KimB · Dec 9, 2006

Nah, what you'll get is that as the number of cores increases, computing-intensive applications will shift their focus to types of processing that are more amenable to parallelization.

DemoCoder · Dec 9, 2006

I think just as a large chunk of apps have moved from C/C++ to Java and .NET and automatic memory management is no longer a dirty word, I think you'll see at some point in the future, that functional paradigms will make it into the mainstream. Almost every academic programming revolution got absorbed somehow into mainstream languages, and I think eventually the time will come. Strongly typed functional paradigms are ideal for extracting maximum parallel performance as well as building error-free concurrency (e.g. transactional memory)

That doesn't mean C/C++ will die, but it will mean that many tasks done in other languages might migrate, just as games today utilize scripting/JIT engines for game logic (outside gfx/rendering loops). Today, developers already write code in a pseudo functional language, without realizing it, as HLSL/GLSL are essentially languages without environmental side effects during their execution (they can't read back mutated environment other than scratch registers)

It's no coincidence that Sweeney realized this, it will just take time for everyone else to. It was a long fight just to get ASM people to stop bashing structured programming, and then to get structured programming people to stop bashing O-O.

kyetech · Dec 9, 2006

Uttar said:
dnavas: fwiw, here's my current guess for G90. This is 100% speculation, and should not be taken as more than that. Quoting me on this randomly or "leaking" this is not "fun", no...
- 65nm, Q4 2007; 400mm2+
- 1.5GHz GDDR4, 384-bit Bus
- 1.5GHz Shader Core Clock
- 650MHz Core Clock
- 32 MADDs/Cluster
- 24 Interps/Cluster
- 10 Clusters

Uttar

Uttar, For comparisons sake please can you remind me how many MADDS and Interps a cluster on the g80 can do? am I right in thinking the g80 has 6 clusters ?

Mintmaster · Dec 10, 2006

DemoCoder said:
Assuming you are correct in the CUDA context, it is still easier to arrange for 16 streams of data to be operated on by the same scalar instruction than to take a say, a function with 50 instructions, and make sure that every unit of every SIMD MADD unit is being exercised each cycle.

You're still not getting it here, but later on you've got the right idea:

Sure, you can try to attempt the same thing on SPEs by scalarizing a bunch of parallel streams, running 1 on each component, but are SPEs as good at consuming streaming data as the G80? One dependent fetch would seem to throw a wrench into it since you might have to schedule a DMA, and then what does the SPE do in the meantime? To hide the latency would seem to require careful management of SPE scratchpad RAM so that an SPE could work on the next datum while waiting for pending DMAs.

Can't you atleast see that the work for the programmer on CELL to extract peak performance looks like alot more manual labor, micromanaging of resources, and tweaking than the GPU? It isn't hard for me to create *useful* shaders on either NVidia or ATI hardware which generate near 100% utilization of the ALUs.

Yes.

When you said G80 is 8x better than Cell at scalar MADD ability, you weren't doing a fair comparison, and that's all I object to. Cell actually has higher scalar MADD ability than G80 per unit area. You'll notice I said "G80's strengths vs. Cell lies in perfect latency hiding with minimal effort, not in raw scalar math ability." Exactly the same thing as you're pointing out in the above two paragraphs.

Assuming a workload with enough streams for G80 to do latency hiding and not be hampered by the SIMD, Cell will also have enough streams to do almost the same. The tougher part is coding it for Cell, and even though you can likely do it theoretically (interleaving parallel, identical streams is not hard), it's unlikely you'll get near the speed of G80 most of the time.

I'm not quite sure how exactly you calculate the "16" figure. You have 8 TCPs, within a TCP, you have 16 SPs grouped into 2 blocks of 8 SPs (notice the NVidia diagram), each block being able to run a different thread context. That seems to suggest that minimum coherence required is 8 SPs, not 16. (note, I'm not talking about branch coherence). People may have measured 16-SP wide issuing, but I was told it's really capable of 8.

I don't know for sure, but even if it's 8-wide, it's still twice as wide as Cell. The point is that your "8x scalar MADD ability" statement is not true. If you're calling G80 scalar, then Cell is even more scalar.

Now, as for the workloads, I'll have to concede to you here. I would certainly classify any high-utilization workload on a 768-way SMP system as massively parallel, but what I didn't properly consider is thread synchronization and interthread communication. I could easily see this as being very troublesome for a GPU which is built for speed on completely independent threads. I'm not convinced we'll see this sort of workload appear beyond a few niches, but as you said, it's probably a bit foolish of me to make predictions about how companies write their applications in the future.

silent_guy · Dec 10, 2006

Assuming a workload with enough streams for G80 to do latency hiding and not be hampered by the SIMD, Cell will also have enough streams to do almost the same. The tougher part is coding it for Cell, and even though you can likely do it theoretically (interleaving parallel, identical streams is not hard), it's unlikely you'll get near the speed of G80 most of the time.

A somewhat related question: in a 3D context, grouping scalar threads with similar expected execution characteristics is trivial, by grouping spacially close pixels together in the rasterizer.

How does this happen in a CUDA environment? Are there primitives that permit the programmer to control this?

G80: Physics inflection point

ZoinKs!

no_way

Simon F

Tea maker

Mintmaster

DemoCoder

KimB

Arun

Unknown.

Mintmaster

DemoCoder

Arun

Unknown.

KimB

DemoCoder

kyetech

Mintmaster

silent_guy

Similar threads