(resurrect Aureal Wavetracing please)
If that's the only thing delivered by the new gpu's, I'd still be ecstatic.
(resurrect Aureal Wavetracing please)
Just as an aside, MS upcoming CNG ( cryptography next generation ) provides for lots of easy hooks for hardware acceleration of basically any crypto op. Coincidence ?The most expensive operation in SSL isn't the stream cipher (such as RC4, 3DES, etc), but the connection setup in which the session key is negotiated. This is what requires alot of high precision modular arithmetic, ala diffie-hellman/rsa/el gamal/etc. ECC uses Galois fields so it's slightly different (and more expensive). Accelerating that will take a huge weight off of connection setup costs, and level-3 routers will be able to setup the SSL/TLS connection, and parse the HTTP headers before handoff.
The most expensive operation in SSL isn't the stream cipher (such as RC4, 3DES, etc), but the connection setup in which the session key is negotiated. This is what requires alot of high precision modular arithmetic, ala diffie-hellman/rsa/el gamal/etc. ECC uses Galois fields so it's slightly different (and more expensive).
Regarding scalar vs. vector, if your comparing an SPE to a graphics chip for general computation, then G80 is a lot wider than a SPE. The only way G80 will run at peak throughput for scalar computations is if you're running the same program on about 5000 fragments.
Cell could theoretically get away with only 32 parallel scalar computations and reach its peak. Thinking that each SPE can only do one scalar operation per clock is very disingenuous. You'd either have an idiot programmer or a workload that couldn't parallelize to a GPU anyway.
(Incidentally, I personally think CPUs should stay away from GPU type parallelism. Workloads will generally either be massively parallel -- in which case GPUs have the advantage regardless -- or they will be hard to parallelize, and there won't be much between the extremes. CPUs should stick to making the latter as fast as possible.)
It will still be under NDA for a while, sadly...I'm curious...how much public information about CUDA do we have?
You're not getting it.Well, if you want to compare theoretical peaks, then fine, but my point was, it's much easier to extract peak performance from a scalar cluster than a SIMD cluster.
Perhaps, but I'm thinking more from a practical point of view. I don't think we'll get software companies writing much code that scales well to 10+ cores unless it's massively parallel. By 'scales well' I mean, say, at least 4x speedup with 10 cores as opposed to 1.I disagree. There is rich literature in computer science on parallel programming problem classes, and there is more a continuum rather than an "either-or" (you're either embarrassingly paralllel, or not). That's why the CRCW/EREW/CREW/ERCW/PRAM models exist, because the asymptotic performance of algorithms differs depending on machine model.
You're not getting it.
G80 does have SIMD clusters, and they're wider than Cell's. I was assuming identical instruction streams across the two processors before, but even if we go to the finest granularity that CUDA will expose, G80 must have 16 pieces of data undergoing the same scalar instruction to hit peak throughput.
Cell must have only 4 pieces of data undergoing the same scalar instruction. Both use SOA to get high throughput on scalar ops. Both need many such instruction streams on many data sets to hide calculation and loading latencies.
Perhaps, but I'm thinking more from a practical point of view. I don't think we'll get software companies writing much code that scales well to 10+ cores unless it's massively parallel. By 'scales well' I mean, say, at least 4x speedup with 10 cores as opposed to 1.
IMO, for workloads that cannot be made massively parallel, I think for the vast majority of consumer applications (even for next decade) 2 cores is quite usable, it will be tough to get a big boost from there to 4 cores, and 8+ cores even tougher.
I'd say it has some other pretty damn big advantages - but the same is true of CELL, which has some obvious advantages of its own (such as the ridiculously large LS). I think it isn't fair to consider it as the only big advantage of G80 though.G80 has 70% higher scalar MADD ability than Cell, and that's with over twice the die size. G80's strengths vs. Cell lies in perfect latency hiding with minimal effort, not in raw scalar math ability.
I think that's largely true for the consumer market. There are plenty of workloads where you got massive parallelism and it isn't massively SIMD, though. Think of things like MMORPG servers or Google's clusters as two very basic examples; anything that does a LOT of very small and different tasks, rather than single big ones. DemoCoder gave some very nice other examples I wasn't even fully aware of above, such as the Vega 2 CPU.IMO, for workloads that cannot be made massively parallel, I think for the vast majority of consumer applications (even for next decade) 2 cores is quite usable, it will be tough to get a big boost from there to 4 cores, and 8+ cores even tougher.
dnavas: fwiw, here's my current guess for G90. This is 100% speculation, and should not be taken as more than that. Quoting me on this randomly or "leaking" this is not "fun", no...
- 65nm, Q4 2007; 400mm2+
- 1.5GHz GDDR4, 384-bit Bus
- 1.5GHz Shader Core Clock
- 650MHz Core Clock
- 32 MADDs/Cluster
- 24 Interps/Cluster
- 10 Clusters
Uttar
You're still not getting it here, but later on you've got the right idea:Assuming you are correct in the CUDA context, it is still easier to arrange for 16 streams of data to be operated on by the same scalar instruction than to take a say, a function with 50 instructions, and make sure that every unit of every SIMD MADD unit is being exercised each cycle.
Yes.Sure, you can try to attempt the same thing on SPEs by scalarizing a bunch of parallel streams, running 1 on each component, but are SPEs as good at consuming streaming data as the G80? One dependent fetch would seem to throw a wrench into it since you might have to schedule a DMA, and then what does the SPE do in the meantime? To hide the latency would seem to require careful management of SPE scratchpad RAM so that an SPE could work on the next datum while waiting for pending DMAs.
Can't you atleast see that the work for the programmer on CELL to extract peak performance looks like alot more manual labor, micromanaging of resources, and tweaking than the GPU? It isn't hard for me to create *useful* shaders on either NVidia or ATI hardware which generate near 100% utilization of the ALUs.
I don't know for sure, but even if it's 8-wide, it's still twice as wide as Cell. The point is that your "8x scalar MADD ability" statement is not true. If you're calling G80 scalar, then Cell is even more scalar.I'm not quite sure how exactly you calculate the "16" figure. You have 8 TCPs, within a TCP, you have 16 SPs grouped into 2 blocks of 8 SPs (notice the NVidia diagram), each block being able to run a different thread context. That seems to suggest that minimum coherence required is 8 SPs, not 16. (note, I'm not talking about branch coherence). People may have measured 16-SP wide issuing, but I was told it's really capable of 8.
Assuming a workload with enough streams for G80 to do latency hiding and not be hampered by the SIMD, Cell will also have enough streams to do almost the same. The tougher part is coding it for Cell, and even though you can likely do it theoretically (interleaving parallel, identical streams is not hard), it's unlikely you'll get near the speed of G80 most of the time.