View Full Version : Larrabee vs Cell vs GPU's? *read the first post*
rpg.314
17-Apr-2009, 19:12
The purpose of this thread is to discuss the relative weakness and strengths of lrb, GPU and cell architectures with particular reference to how it might affect future architectures' evolution and market niches. I am particularly interested in the applications where one approach delivers considerably better or worse results in comparison, ie "presence/absence of x really hurts y while doing z".
---> ATM, points like "how easy it is to code for it" are somewhat secondary (at least in the beginning) since we are assuming that the programmer is willing to step up to the plate and equally good (a subjective term, of course) tools are available for both.
---> In the interest of fairness, to cancel the advantage of cell vs gpu's (in that ppe and spe sit pretty close), I think it is fair to assume (if need be) that some body (some time in to the future) will put a tiny x86-64 core next to a giant shader array :lol: . The goal here is to consider which problems are easier more efficiently solved on which architecture, so I think this is ok. Comments/cribs/questions/commendations :) welcome!
{This obviously doesn't apply to LRB as all cores their are equal and the only thing stopping Intel from placing it on the motherboard socket is their marketing department. :evil:}
--->Points like "trick x works for a but y for b regarding code optimization" are welcome as they serve to illustrate the differences.
---> Since we are all armchair architects here spouting wisdom for others to follow (ok, mostly....he he), feel free to speculate on "to improve x with regards to doing y, I'll make the following changes". Along this line of thought, suggestions for improving programmer productivity are welcome too. But kindly keep the proposed surgeries a bit conservative/realistic/less invasive if possible.
{Suggestions like chip having dual ARM Cortex A9 based PPE, with 30 SPE's with 512K LS, 512 ALU's from nV, and a quad x86 core each with 4 hyperthreads, an FPGA and topped off with dedicated rasterizer/tessellator/texture units, all on same die, with optical interconnects aka "ONE CHIP TO RULE THEM ALL:twisted:" are fun to read and laugh at, but do not contribute to anybody's (and definitely not mine) understanding/knowledge.}
Potential starting points. (meant to seed the discussion, please add your own)
1) video transcoding seems to be better with cell wrt GPU's
2) GPU's can't handle the task parallel codes. For instance, building jobs on PPE and then submitting it to a queue, that kind of approach can't be taken.
3) For 3D rendering, LRB doesnt suffer from any pipeline bottlenecks as it is all done in software.
TimothyFarrar
17-Apr-2009, 22:07
How about banked vs cached memory in terms of vector scatter/gather. Clearly read only vector gather can be covered by the texture units, so I'm talking about R/W vector scatter/gather only.
A LRB hyperthread doing scatter/gather can only access one L1 line per clock, so if you actually do a scatter/gather which doesn't simply load from the same vector sized cache line, then performance suffers proportional to the number of L1 lines hit. In my eyes, this vastly marginalizes the usefulness of scatter/gather.
An even better question is how many L1 accesses can LRB do per cycle per core. So how badly will one thread issuing an expensive vector scatter/gather effect execution of the other hyperthreads on the core. Perhaps using very infrequent bad cases of scatter/gather will not result in total ALU throughput from suffering too much. Hard to tell given the extent of public info released thus far.
One can compare and contrast to current NVidia hardware where scatter/gather to random memory locations runs at full speed as long as each lane of the vector accesses a separate bank of memory. This opens up a lot more flexibility with scatter/gather compared to the LRB approach in terms of keeping up ALU throughput.
rpg.314
18-Apr-2009, 05:59
A LRB hyperthread doing scatter/gather can only access one L1 line per clock, so if you actually do a scatter/gather which doesn't simply load from the same vector sized cache line, then performance suffers proportional to the number of L1 lines hit. In my eyes, this vastly marginalizes the usefulness of scatter/gather.
An even better question is how many L1 accesses can LRB do per cycle per core. So how badly will one thread issuing an expensive vector scatter/gather effect execution of the other hyperthreads on the core. Perhaps using very infrequent bad cases of scatter/gather will not result in total ALU throughput from suffering too much. Hard to tell given the extent of public info released thus far.
Good question. I am pretty sure that it can access only one cache line per core. AFAIK, there is a performance penalty if you read a value that is split across cachelines on CPUs. And I don't think this part of the Pentium core would have been reworked too much.
One can compare and contrast to current NVidia hardware where scatter/gather to random memory locations runs at full speed as long as each lane of the vector accesses a separate bank of memory. This opens up a lot more flexibility with scatter/gather compared to the LRB approach in terms of keeping up ALU throughput.
Banked memory is certainly better in this regard. Making a banked cache seems possible too. If this problem turns out to be significant, now or in the future, I'd expect Intel to move to a banked cache. However the cost (die size) of making a banked cache would certainly be larger than the normal cache they appear to be using right now.
rpg.314
18-Apr-2009, 10:09
To rephrase my question, why does video transcoding seem to be better with cell than with GPU's?
Is it a software maturity issue or is it an architectural issue?
Brad Grenz
18-Apr-2009, 10:29
Both, and marketing as well. I think the hardware vendors overpromised on that front and as transcoding has proven to be a tough nut to crack on GPUs, corners have been cut to create an expected performance advantage.
My opinion is that GPGPU has proven to be a huge red herring. HardOCP has been posting videos of tech demos from the Infernal Engine all week. The developers narrating are very emphatic on the point that it is far preferable to have your GPU processing graphics. Their physics engine, which is very impressive, relies on heavily threaded, CPU based processing. The results are pretty cool, but they still point out that the PS3 version is the one they like to demo because it can simulate the most objects. Even better than their PC build with 8 threads on an i7 platform. That being the case, I think there is a place for an intermediate level of hardware splitting the difference between big CPU cores and tiny shader cores. Ideally, these SPE level execution units would have high speed access to both the system and video memories.
rpg.314
18-Apr-2009, 10:48
So GPUs are relatively not very good at transcoding. The question is why? Any insights?
For physics, SPEs are better than even i7. Do you think that is because of the fast (L1 speed) and large (L2 sized) LS?
AFAICS, in game physics, it should be possible to know deterministically your data flow patterns, so there Cell should shine. But won't gpu's be just as good at physics? After all, they too can use shared mem to collaborate and cache. And they make up for it's small size by having relatively thinner alu's per SM (CUDA speak).
So GPUs are relatively not very good at transcoding. The question is why? Any insights?
My knowledge of encoders goes back to the MPEG-2 era, so I'm not that familiar with more recent codecs such as H.264. However, to my understanding, the most easily workable tasks on a GPU are basically just transformation (DCT, or the H.264 thingy), motion search, and some pre-processing such as noise removal. Traditionally DCT and motion search are the two most time consuming tasks in a encoder, but with a complex standard such as H.264 it's probably no longer the case. For example, I am not sure if CABAC can be done fast on a GPU.
For physics, SPEs are better than even i7. Do you think that is because of the fast (L1 speed) and large (L2 sized) LS?
In many physics simulation, at first you bucket sort objects into grids, then a SPE focus on a grid (which should able to fit into the LS). The processing of a grid is basically independent of other grids, so it's quite suitable on CELL.
But won't gpu's be just as good at physics? After all, they too can use shared mem to collaborate and cache. And they make up for it's small size by having relatively thinner alu's per SM (CUDA speak).
The share memory on G92/GT200 is really small, only 16KB. Each SPE has 4 pages of 64KB in its LS. However, I don't think GPU is "bad" at physics, as there are many nice demo showing that GPU can do some interesting physics simulation effects.
http://ati.amd.com/products/firepro/Siggraph_2008_video_encode_final.pdf
Jawed
rpg.314
18-Apr-2009, 13:54
My knowledge of encoders goes back to the MPEG-2 era, so I'm not that familiar with more recent codecs such as H.264. However, to my understanding, the most easily workable tasks on a GPU are basically just transformation (DCT, or the H.264 thingy), motion search, and some pre-processing such as noise removal. Traditionally DCT and motion search are the two most time consuming tasks in a encoder, but with a complex standard such as H.264 it's probably no longer the case. For example, I am not sure if CABAC can be done fast on a GPU.
CABAC is particularly bad on CPU's, due to it's very branchy nature. In fact CABAC portion of x264 slowed down (http://x264dev.multimedia.cx/?p=51) by almost 10% with the new Core i7. :-/
In many physics simulation, at first you bucket sort objects into grids, then a SPE focus on a grid (which should able to fit into the LS). The processing of a grid is basically independent of other grids, so it's quite suitable on CELL.
The share memory on G92/GT200 is really small, only 16KB. Each SPE has 4 pages of 64KB in its LS. However, I don't think GPU is "bad" at physics, as there are many nice demo showing that GPU can do some interesting physics simulation effects.
Hmm. 1 spe has 25.6Gflops and 256k LS, so 100 flops per byte. 1 SM on nV has 21Gflops per 16Kbyte SM. That's not an equal comparison of course. Code is also part of a spe's LS but still the ratio is more than an order of magnitude higher. IMHO, code size > data size would still be anomaly in this regard (or when a bucket has very few particles)
BTW, does any one have an estimate for the latency encountered on spe. IE when you dma 16 bytes (the minimum allowed, that is) in terms of clocks. On nV it is 400-600 clocks as per CUDA guide but 500-650 clocks as per Volkov's paper. It should be in the same ball park right? Or does XDR ram act funnily in this regard. On CPU's it is typically 100-200 clocks cycles if I am not mistaken. Better estimates welcome.
rpg.314
18-Apr-2009, 13:56
@ Jawed: I have seen that presentation. What are the points you are trying to make here? And amd's transcoder is worse off/not much better compared to badaboom AFAIK.
To rephrase my question, why does video transcoding seem to be better with cell than with GPU's?
Cell has 8 and 16 bit SIMD instructions for one. For another, GPUs use parallelism not merely to hide memory latency, they use it to hide pretty much all latency in the pipeline ... because they usually work on massively parallel problems and because memory latency is orders of magnitude higher than everything else they just don't bother trying to keep latencies low for anything else. Their actual instruction latency is hideous compared to Cell. That means you are forced to have many more threads than on Cell.
They are too optimized for massively parallel workloads.
h.264 encoding isn't usefully faster (for a given bit-rate and quality trade-off) on GPUs because of bottlenecks (e.g. PCI Express bus) and limited applicability of GPUs (motion estimation is about the only thing worth doing).
Latency across PCI Express is a serious problem. As far as I can tell in games it means a frame of delay in things like world-physics.
h.264 encoding isn't, in itself, all that transcoding requires so there are other opportunities for performance gains. De-coding and de-noising are reasonable candidates, though of course GPUs have their dedicated, fixed-function, decoders.
Jawed
rpg.314
18-Apr-2009, 15:10
Cell has 8 and 16 bit SIMD instructions for one. For another, GPUs use parallelism not merely to hide memory latency, they use it to hide pretty much all latency in the pipeline ... because they usually work on massively parallel problems and because memory latency is orders of magnitude higher than everything else they just don't bother trying to keep latencies low for anything else. Their actual instruction latency is hideous compared to Cell. That means you are forced to have many more threads than on Cell.
Well 8 and 16 bit ops certainly help.
They are too optimized for massively parallel workloads.
:) But then shouldn't they have more cache (aka shared memory) for massive parallelism. I mean they are too optimized for massively parallel workloads which have high arithmetic intensity too. LRB seems much better in this regard, in that it will have much larger cache to flop ratio wrt gpus. Assuming 2Ghz clock, it has 16 lanes *2 madd * 2Ghz=64Gflops per core. So perhaps not better than Cell.
rpg.314
18-Apr-2009, 15:15
h.264 encoding isn't usefully faster (for a given bit-rate and quality trade-off) on GPUs because of bottlenecks (e.g. PCI Express bus) and limited applicability of GPUs (motion estimation is about the only thing worth doing).
What about DCT, eh?
Latency across PCI Express is a serious problem. As far as I can tell in games it means a frame of delay in things like world-physics.
It's an offline transcode, not a realtime one, so latency should be dominated by bandwidth right?
BTW, these transcoders need cuda's atomic ops. Is that a pointer why cell does better at them than gpu's? But why do they need them?
What about DCT, eh?
It's in the AMD presentation. Firmly stuck in CPU land.
It's an offline transcode, not a realtime one, so latency should be dominated by bandwidth right?
The issue is that while the CPU does X, Y and Z steps in the algorithm, the GPU doing A has to be fast enough that X, Y and Z aren't slowed down. Otherwise the CPU just waits. It seems with AMD this latency was so bad that they didn't even bother with h.264 encoding until RV770 came along.
BTW, these transcoders need cuda's atomic ops. Is that a pointer why cell does better at them than gpu's? But why do they need them?
Do these transcoders need atomic ops? I was under the impression that G80 was supported.
http://www.cyberlink.com/products/powerdirector/cuda-optimization_en_US.html
http://www.badaboomit.com/?q=node/113
I can't find any statement of supported GPUs for Super LoiLoScope.
I think G80's out-of-date Pure Video (hmm, does it have PV?) is what's causing variations in support.
Jawed
:) But then shouldn't they have more cache (aka shared memory) for massive parallelism.
Cache runs into diminishing returns for graphics relatively fast, so no.
rpg.314
18-Apr-2009, 15:40
Ohh that was this.
http://techreport.com/articles.x/16617/1
My bad. Could be just the PureVideo decoder though instead of atomic ops. My mistake.
I just can talk about my experiences with the x86 architecture.
I also have some experience with the Cell. I wrote some wavelets, and i tried to speed up the code using all the SPEs, but, it didn't end very well. The Cell is too complex at low level. I remember having serious problems with the DMAs to feed the SPEs with the data to process the blocks. Bad performance. More time wasted in the setup than in the data processing.
About the x86 architecture, i think that one of the biggest problems with the SIMD instructions is the time that you need to fill all the elements of a XMM register.
If you are lucky, and your data is aligned in memory, you can load your data with a MOVAPS instruction. If the data is not aligned, you can use the MOVUPS (slower of course), but, i did read that in the core i7, movups is as fast as movaps on aligned data.
But, there are worse cases. Sometimes you need to read 4 elements from 4 different memory addresses, and shuffle them to fit into a XMM register. This is for me the worst case, and it kills the SSE performance. You lose a lot of time doing the 4 loads, and shuffling the data, and in these cases, usually it is better to use the FPU, or a single component of a XMM register. Basically you lose all the advantages of the SIMD instructions in these cases, due to the high overhead of the data shuffling.
I also think that intel should implement a hardware 4x4 matrix transpose instruction (among other things :)).
About my recent experiences with CUDA on the GPU, well, i love the SIMT concept, but, i think that it still has the same problem with the data setup. In my perlin noise kernel, i waste half of the time filling the shared memory, and it's a very small chunk of memory.
As resume, i think that we have a lot of ALU power today, but, the access to the memory should be more flexible today (and with TB/s :D).
rpg.314
18-Apr-2009, 17:24
I also have some experience with the Cell. I wrote some wavelets, and i tried to speed up the code using all the SPEs, but, it didn't end very well. The Cell is too complex at low level. I remember having serious problems with the DMAs to feed the SPEs with the data to process the blocks. Bad performance. More time wasted in the setup than in the data processing.
Can we have some more details please for that?
About the x86 architecture, i think that one of the biggest problems with the SIMD instructions is the time that you need to fill all the elements of a XMM register.
If you are lucky, and your data is aligned in memory, you can load your data with a MOVAPS instruction. If the data is not aligned, you can use the MOVUPS (slower of course), but, i did read that in the core i7, movups is as fast as movaps on aligned data.
But, there are worse cases. Sometimes you need to read 4 elements from 4 different memory addresses, and shuffle them to fit into a XMM register. This is for me the worst case, and it kills the SSE performance. You lose a lot of time doing the 4 loads, and shuffling the data, and in these cases, usually it is better to use the FPU, or a single component of a XMM register. Basically you lose all the advantages of the SIMD instructions in these cases, due to the high overhead of the data shuffling.
It's really a shame that SSEx isn't orthogonal. Why does intel have to design every ISA (save lrbni) like they are brain dead or something? For your problem, may be loading the four values into a aligned float[4] array and then a load may be faster:???:
I also think that intel should implement a hardware 4x4 matrix transpose instruction (among other things :)).
That would be fun. They do have an optimized 4x4 transpose macro available in if you use the intrinsics.
As resume, i think that we have a lot of ALU power today, but, the access to the memory should be more flexible today (and with TB/s :D).
That would make a lot of effort put into exploiting memory hiearchy useless. :twisted:
Personally I have always thought every architecture should have special engines doing nothing except for processing DMA lists (wasting 1/4th of your potential ALU issue slots on Larrabee is a rather poor alternative IMO).
rpg.314
18-Apr-2009, 18:08
(wasting 1/4th of your potential ALU issue slots on Larrabee is a rather poor alternative IMO).
Ok, I didn't get it.
rpg.314
18-Apr-2009, 18:15
And dedicated DMA engines, how are they likely to help with the x86 cores? They have no software lockable cache. Even if you prefetched a large amount, it will likely be thrown out quickly.
Ok, I didn't get it.
Well one of the suggestions has been to use a single of the 4 threads for gather operations, if it gets equal running time that's so many cycles where nothing gets issued to the vector ALUs.
And dedicated DMA engines, how are they likely to help with the x86 cores? They have no software lockable cache. Even if you prefetched a large amount, it will likely be thrown out quickly.
So? I never said it would work bolted onto Larrabee as is, I don't particularly like Larrabee as is ...
rpg.314
18-Apr-2009, 18:36
so you are suggesting that there should be a dma engine on gpu's to prefetch to shared memory?
Would be nice ... and almost free as far as transistors is concerned.
Can we have some more details please for that?
I barely remember anything, because it was year 2006/7, but i remember that with a single buffer for the DMA transfer, the SPE was idle most of the time. But, with multiple buffers and overlapping the computation on one buffer with the data transfer in other, the results were better. This is a must for any Cell app imho. I also remember problems trying to multi-thread my code because there was shared data across the threads.
It's really a shame that SSEx isn't orthogonal. Why does intel have to design every ISA (save lrbni) like they are brain dead or something? For your problem, may be loading the four values into a aligned float[4] array and then a load may be faster:???:
Yeah, i did something like that, but with the suffle instructions & several registers.
That would be fun. They do have an optimized 4x4 transpose macro available in if you use the intrinsics.
I remember using the _MM_TRANSPOSE4_PS macro, but it needed the 8 XMM registers, and sometimes, you are using some registers to store some 'previously suffled' data :)
That would make a lot of effort put into exploiting memory hiearchy useless. :twisted:
Don't shoot the messenger :)
How about banked vs cached memory in terms of vector scatter/gather. Clearly read only vector gather can be covered by the texture units, so I'm talking about R/W vector scatter/gather only.
Cache memories are banked as well so I am not sure the premise is entirely correct, though I still get what you are pointing at :)
Also gather done via texture unit doesn't sound like a good idea if you are going to re-use your data multiple times.
A LRB hyperthread doing scatter/gather can only access one L1 line per clock, so if you actually do a scatter/gather which doesn't simply load from the same vector sized cache line, then performance suffers proportional to the number of L1 lines hit. In my eyes, this vastly marginalizes the usefulness of scatter/gather.
This is a scalable approach, which can get faster in the future. Doesn't nvidia operate in a very similar way for (uncached) global memory accesses?
Unfortunately scratchpad memory based programming models don't scale so nicely.
One can compare and contrast to current NVidia hardware where scatter/gather to random memory locations runs at full speed as long as each lane of the vector accesses a separate bank of memory. This opens up a lot more flexibility with scatter/gather compared to the LRB approach in terms of keeping up ALU throughput.
It certainly fast, but I wouldn't call gather/scatter from a minuscule memory that you have to manange on your own 'flexibile' ;-)
MrGaribaldi
19-Apr-2009, 11:01
What about DCT, eh?It's in the AMD presentation. Firmly stuck in CPU land.
But why is that? They don't give any reasons in the slides, although it looks like the ME is taking "too long".
Nvidia has a nice CUDA sample for doing DCT on the GPU with 2 different kernels that are enjoyably fast of themselves, and can be further tweaked, so I don't see why you couldn't offload it to the GPU as well.
Granted, I've only played around with doing MJPEG compression on the Cell and GPU, so could very well be that I'm missing some of picture for doing .x264 encoding...
rpg.314
19-Apr-2009, 11:07
Blockwise DCT should be very nicely parallelizable. Also cuda helps as it exposes the dedicated video decode hardware, alteast on windows, so only the last lossless compression needs to be done on the CPU. ME and DCT can be both on GPU.
rpg.314
19-Apr-2009, 11:10
This is a scalable approach, which can get faster in the future. Doesn't nvidia operate in a very similar way for (uncached) global memory accesses?
Unfortunately scratchpad memory based programming models don't scale so nicely.
I'd like a detailed explanation on this please. It seems to me that you are implying that the hardware managed caches (with s/w hints like on lrb if need be) scale better than purely software managed caches like on gpu's.
But why is that? They don't give any reasons in the slides, although it looks like the ME is taking "too long".
Yeah, shame there's nothing more detailed.
Is there an NVidia presentation on the details of h.264 encoding? Anyone else done one for CUDA encoding?
Nvidia has a nice CUDA sample for doing DCT on the GPU with 2 different kernels that are enjoyably fast of themselves, and can be further tweaked, so I don't see why you couldn't offload it to the GPU as well.
There's a CAL sample for DCT too.
Granted, I've only played around with doing MJPEG compression on the Cell and GPU, so could very well be that I'm missing some of picture for doing .x264 encoding...
I don't understand the h.264 encoding pipeline at all well, so I don't know how realistic ME and DCT both on the GPU is.
Jawed
TimothyFarrar
19-Apr-2009, 21:05
This is a scalable approach, which can get faster in the future. Doesn't nvidia operate in a very similar way for (uncached) global memory accesses?
Not to my knowledge, global accesses (ie uncached) are also banked addressed (just like shared memory) with CUDA. I'd argue that in terms of bandwidth limited cases, that having the extra 16x addressing capacity (16 bank addresses, vs 1 cacheline) per global access can be quite a performance benefit (well assuming the programmer can make use of it).
trinibwoy
19-Apr-2009, 21:30
Global memory accesses don't seem to be banked. Coalescing is contingent on all required addresses being present in the same contiguous memory segment. However the segment size can be either 32, 64 or 128 bytes. So global memory access works similar to the LRB cache where there is one read per segment (cache line).
Not to my knowledge, global accesses (ie uncached) are also banked addressed (just like shared memory) with CUDA. I'd argue that in terms of bandwidth limited cases, that having the extra 16x addressing capacity (16 bank addresses, vs 1 cacheline) per global access can be quite a performance benefit (well assuming the programmer can make use of it).
Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
Also this cache vs bank memory stuff doesn't make any sense, cache memories *are* banked.
Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
Yep, that's pretty cool.
An example of the effect of the improved memory controller in GT200:
http://dl.getdropbox.com/u/484203/Lectures/Guests/JoeStam__ConvolutionSoup.pdf
Jawed
rpg.314
20-Apr-2009, 05:19
Global memory accesses don't seem to be banked. Coalescing is contingent on all required addresses being present in the same contiguous memory segment. However the segment size can be either 32, 64 or 128 bytes. So global memory access works similar to the LRB cache where there is one read per segment (cache line).
If 64 threads issue contigous reads, each thread reading 16 bytes, then you have fetched 1k data from memory. Probably Timothy is referring to the fact that that 1K will be fetched by different memeory controllers and then merged together iin the register file, hence the banking. Elements of this technique re already there in CPU's with multi channel memory controllers, so I suppose it is an issue of semantics.
Simon F
20-Apr-2009, 09:32
For example, I am not sure if CABAC can be done fast on a GPU.
It's damned difficult to do fast on anything :lol:
FWIW Encoding is probably easier than decoding which is ironic as the latter is surely going to be more common.
rpg.314
20-Apr-2009, 10:17
For decoding we already have dedicated hw in gpu's today. So transcoders most likely will be taking advantage of it.
Yep, that's pretty cool.
It really simplifies the code in many cases, which can give you a nice speed-up. For my biggest algorithm, I got some 30% just by removing the ungodly-complex-GF8-coalescing-code with a very straight-forward partially coalesced one.
Staying on topic, I'd say having a little bit of smarts in the way you process memory accesses is often quite useful, especially if you are not on a fixed platform. Cell suffers from not having this, but makes up for it with 6 cycles latency. It would be interesting to know how much the GT200's reorder-buffer costs in terms of latency.
rpg.314
20-Apr-2009, 13:14
I suspect it is about 50-100 cycles. In initial CUDA docs they said that 400-600 clock cycle latency should be expected. But when volkov published his benchmarks, he found more like 500-700 clocks on GT200. This is a rather crude estimate, I admit.
I suspect it is about 50-100 cycles.
That sounds waaaaay too high for me. Like an order of magnitude too high.
OK, the reorder is at base-clock, not shader clock, which I assume you meant, so that brings it down a good bit.
Still, I may be totally off here, but even 20cy is a long time. :)
rpg.314
20-Apr-2009, 13:48
The numbers there are definitely in terms of shader clocks. However, considering the logic involved, 100 cycles is indeed high. However, it may well be the case that nv wanted to minimize the area used (as it is CUDA specific) so used a smaller coalescer many times to reduce an already bloated die. Like I said, it is a crude estimate.
CouldntResist
20-Apr-2009, 13:55
Only Arm and Core will prevail. Other architectures (nvidia, amd, larrabee, itanium, cell) will be assimilated or obsoleted.
Resulting polarity on the industrial scene, will be seed of epic conflict within human civilisation, destined to last for eons. Eventually spreading over entire galaxy, the conflict will outlast biology based life form of humanity. Our cosciousnesses, now encased in machine shells, will be occupied with the ultimate goal: complete annihilation of the opponent, before the Heat Death of Universe happens...
Only Arm and Core will prevail... Eventually spreading over entire galaxy...
Galactic Arm vs. Galactic Core? Hmm.... ;)
TimothyFarrar
20-Apr-2009, 16:04
Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
Oops, that's what I get for a quick post in the local Apple Store this weekend. I was referring to the ability of the GT2xx series to reduce access size. Specifically the ability for the hardware to reduce global access requests from 128 byte segments to 64 byte segments or 32 byte segments. My mistaken 16x addressing factor is really a 4x addressing factor with the reduction from a 128 byte segment to four 32 byte segments.
The 32 byte segment turns out to only be 1/2 the size of a LRB cache line. Marco's point is indeed a good one for global memory accesses.
trinibwoy
20-Apr-2009, 16:43
Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
Also this cache vs bank memory stuff doesn't make any sense, cache memories *are* banked.
Sure, but with the shared memory stuff a single read can pull data from all banks simultaneously at arbitrary offsets within each bank. Aren't single-ported cache reads rigidly limited to pulling a single contiguous cache line at once?
Sure, but with the shared memory stuff a single read can pull data from all banks simultaneously at arbitrary offsets within each bank.
What algorithms need to do this? Why? When are these offsets arbitrary, but not random?
Jawed
Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
Also this cache vs bank memory stuff doesn't make any sense, cache memories *are* banked.
'Banked' can actually mean a lot of things. 'Banked' to allow one or more reads per cycle (if there are no conflict). 'Banked' because a given size of cache memory isn't synthesizable at a given frequency so you need multiple blocks and select between them on access.
In the case of NVidia I have not read the documentation so I'm not sure but I would guess that their local storage memory has as many banks as vector ALUs (requests per cycle) and each bank is independtly addressable. Because it makes sense when taking into account how a shader processor works. In fact that way it would work more like a single-ported register file than like a normal cache (it may even be synthesized as such).
In fact it's not a cache, it's a multi-banked local memory.
TimothyFarrar
20-Apr-2009, 17:12
An example of the effect of the improved memory controller in GT200:
http://dl.getdropbox.com/u/484203/Lectures/Guests/JoeStam__ConvolutionSoup.pdf
Jawed
Awesome link Jawed. Thanks.
So for the convolution with G92 level hardware it looks as if only a little over 2.5x faster to go fully optimized CUDA shared memory vs GPGPU texture access only. Then with GT200 still around 2.5x faster using shared memory vs texture access, just the coalescing brings the hideous CUDA code up to the texture access only case's performance.
Wasn't a similar ~2x factor was also mentioned in some DX10 vs DX11 CS slides. Best GPGPU vs CUDA speed up factor I can remember was something like ~7x for a parallel SCAN (comparing CUDA to OpenGL).
trinibwoy
20-Apr-2009, 17:45
What algorithms need to do this? Why? When are these offsets arbitrary, but not random?
Don't really understand the question. Why is randomness relevant? The hardware doesn't care whether the offsets are fixed, arbitrary or random. Lane-aware dynamic warp formation can be considered a case where the offsets are both arbitrary and random, yet it would certainly take advantage of the fact that each lane has a dedicated bank to facilitate simultaneous reads across all lanes.
'Banked' can actually mean a lot of things. 'Banked' to allow one or more reads per cycle (if there are no conflict).
In the case of NVidia I have not read the documentation so I'm not sure but I would guess that their local storage memory has as many banks as vector ALUs (requests per cycle) and each bank is independtly addressable.
Yep, that's pretty much how it's setup. But since their ALUs are double pumped there are actually 16 banks to each 8-wide ALU (reads are 16 wide).
So for the convolution with G92 level hardware it looks as if only a little over 2.5x faster to go fully optimized CUDA shared memory vs GPGPU texture access only. Then with GT200 still around 2.5x faster using shared memory vs texture access, just the coalescing brings the hideous CUDA code up to the texture access only case's performance.
It's notable that the later tweaks (128-bit memory operations) weren't applied to the texturing algorithm. Also, does shared memory actually benefit this kernel? Why not just let the texture cache do all the heavy lifting? OK, so this isn't trilinear filtering where the texture cache is really at home, but still.
Wasn't a similar ~2x factor was also mentioned in some DX10 vs DX11 CS slides. Best GPGPU vs CUDA speed up factor I can remember was something like ~7x for a parallel SCAN (comparing CUDA to OpenGL).
I think the D3D11-CS speed-up comes in FFT.
Slide 54 for the Scan speed-up:
http://gpgpu.org/static/asplos2008/ASPLOS08-5-advanced-data-parallel-programming.pdf
which relies upon shared memory so that scatter doesn't get bogged down by slowness of memory. Luckily it's highly localised scatter.
Jawed
Don't really understand the question. Why is randomness relevant? The hardware doesn't care whether the offsets are fixed, arbitrary or random.
Randomness is relevant because you need an algorithm with precisely controlled offsetting to gain the benefit you're ascribing to NVidia's architecture.
Which raises the fundamental question: are you doing offsetting because the memory forces you to do so, or is this intrinsic in the algorithm (and with a wonderful stroke of luck the algorithm just happens to match the dimensions of the memory :lol: )
I've seen people use 17 and 63 as offsets in their shared memory programming purely to access the performance benefit of banked memory in NVidia. In neither case was 17 or 63 a convenient or useful tweak determined specifically by the algorithm, it was a pure hardware-specific adjustment.
So basing an argument for the superiority of NVidia's banked shared memory over L1 fetches solely on offset-banking is not very convincing.
e.g. if an algorithm eventually consumes all the data in the tile sat in shared memory, then what we're actually talking about here is ordering of fetches. The ordering that works on NVidia may not suit Larrabee, but the converse prolly exists.
Some Larrabee factors that might affect how you implement the algorithm when faced with Larrabee:
The large per-work-item state that Larrabee affords
the mutability of "registers"<->"shared-memory" in on-die memory (no hard limits)
the shuffle instructions that allow work-items to get at data belonging to other work-items without having to go via "shared memory"
Jawed
trinibwoy
20-Apr-2009, 18:31
Randomness is relevant because you need an algorithm with precisely controlled offsetting to gain the benefit you're ascribing to NVidia's architecture.
So basing an argument for the superiority of NVidia's banked shared memory over L1 fetches solely on offset-banking is not very convincing.
It's highly convincing since all of the "problems" you see with shared memory conflicts also apply to a traditional cache. Whereas a multi-banked local memory will in a large number of random cases coalesce better than cache reads. So the cache has all of the same downsides and none of the potential upside. But like you said, at least caches are bigger.
3dilettante
20-Apr-2009, 18:36
the shuffle instructions that allow work-items to get at data belonging to other work-items without having to go via "shared memory"
This one may be somewhat more complex, depending on the situation.
Shuffles and broadcasts allow for getting data belonging to other work items, so long as we have and SOA scheme and sharing data by shuffling between the 16 items in the same SIMD register.
If sharing for more than 16 units, but the work group is small enough that we can hold enough relevant data within the register file, then some combination of instructions can find the needed register and then access the needed elements.
Once we start spilling into main memory, arbitrary access to any work unit's data would involve the program tracking the needed index values and then calculating the needed address to load from. It may not share the name with the abstraction of shared memory, but the actions involved are similar.
It's highly convincing since all of the "problems" you see with shared memory conflicts also apply to a traditional cache.
The Larrabee implementation may not stumble blindly into thrashing the cache at 1/16th throughput. Like I asked earlier, why is the algorithm behaving "so badly"?
A large-radius stochastic kernel might be an example of such an access pattern. If you treat each work-item individually then the fetches appear random. All the work-items in a fibre are less random. If you have multiple fibres forming a tile, and the tile strides, then you're left with very little randomness.
Jawed
trinibwoy
20-Apr-2009, 18:54
The Larrabee implementation may not stumble blindly into thrashing the cache at 1/16th throughput.
I have to admit you've lost me here. You criticize shared memory based on the fact that the software has to explicitly take advantage of its structure. But then you say it's not a problem for Larrabee because the software explicitly takes advantage of its memory structure? Isn't that the same thing?
If so, I would argue that it takes a lot less fudging about with shared memory (offsets) than with cache (shuffling data around) to get what you want.
TimothyFarrar
20-Apr-2009, 19:02
Randomness is relevant because you need an algorithm with precisely controlled offsetting to gain the benefit you're ascribing to NVidia's architecture.
I'm not sure about that. If you were to scatter randomly (or semi-randomly) into your CUDA shared memory wouldn't you have a rather low on average number of bank conflicts ... if so your scatter would actually be a bit faster than if you could only hit a single cache line per access.
What algorithms need to do this? Why? When are these offsets arbitrary, but not random?
Jawed
The gradient vectors in the Perlin noise are 100% random, if you use the original Ken Perlin algorithm.
In my CUDA implementation, i have 24 random accesses to float4 vectors, plus 6 accesses to byte indexes, and the final float write result. :oops:
TimothyFarrar
20-Apr-2009, 19:17
A large-radius stochastic kernel might be an example of such an access pattern. If you treat each work-item individually then the fetches appear random.
A large radius stochastic kernel is a good example. With LRB if you were gathering 16 random pixels each vector gather, how many would hit in the same 4x4 tile (bit interleaved addressing so 4x4 tile = one cache line)? Compare to how many expected bank conflicts you'd get with CUDA. Not to mention if large radius kernel, you could create a "random" pattern which was insured not to have bank conflicts...
I have to admit you've lost me here. You criticize shared memory based on the fact that the software has to explicitly take advantage of its structure. But then you say it's not a problem for Larrabee because the software explicitly takes advantage of its memory structure? Isn't that the same thing?
Just because I'm describing what people might do on Larrabee doesn't mean I think all this stuff is wonderful.
Shared memory isn't even an automatic win. Still it can be a huge bonus so until the abstractions are improved (if it's even possible?) people are stuck with programming to the metal.
Maybe NVidia will start opening up on their architecture, in light of increasing competition, instead of making people faff for 18 months in order to get a decent matrix multiply programmed.
If so, I would argue that it takes a lot less fudging about with shared memory (offsets) than with cache (shuffling data around) to get what you want.
I'm sure this will be qualified by developers over time once they get their hands on Larrabee.
Maybe we need a thread: "Ray Tracer: CUDA or Larrabee?"
Jawed
A large radius stochastic kernel is a good example. With LRB if you were gathering 16 random pixels each vector gather, how many would hit in the same 4x4 tile (bit interleaved addressing so 4x4 tile = one cache line)? Compare to how many expected bank conflicts you'd get with CUDA. Not to mention if large radius kernel, you could create a "random" pattern which was insured not to have bank conflicts...
How do you make your data fit in a small local memory for a "large radius stochastic kernel"? :)
I'm not sure about that. If you were to scatter randomly (or semi-randomly) into your CUDA shared memory wouldn't you have a rather low on average number of bank conflicts ... if so your scatter would actually be a bit faster than if you could only hit a single cache line per access.
The fundamental question is: if you're allowed to optimise for NVidia, aren't you allowed to optimise for Larrabee? Why blindly take a piecemeal approach on Larrabee just because piecemeal doesn't hurt too much on NVidia?
This paper:
http://impact.crhc.illinois.edu/ftp/conference/ppopp-08-ryoo.pdf
is just over a year old. I wonder how much of that is still useful? They didn't even get to 100GFLOPs in MM on G80.
The learning curve is pretty steep with these "desktop supercomputers" :lol:
Jawed
A large radius stochastic kernel is a good example. With LRB if you were gathering 16 random pixels each vector gather, how many would hit in the same 4x4 tile (bit interleaved addressing so 4x4 tile = one cache line)? Compare to how many expected bank conflicts you'd get with CUDA. Not to mention if large radius kernel, you could create a "random" pattern which was insured not to have bank conflicts...
In Larrabee you might hold the whole tile in L2 and slide a tile from the dataset through L2 using the scalar pipeline's cache pre-fetching instructions, in parallel with the work being done by the VPU.
The worst case latency is 1/16th * 10 L2 cycles = 160 cycles. How does that compare with the 500 cycles of latency that NVidia's trying to hide, each time it goes to fetch a tile from video memory into shared memory? How much work are the multiprocessors doing (erm, not doing on the kernel) while they compute tile addresses and move data into shared memory?
Jawed
TimothyFarrar
20-Apr-2009, 20:10
How do you make your data fit in a small local memory for a "large radius stochastic kernel"? :)
:sad: Yeah, a 64x64 or so sized tile for local memory isn't all that large...
:sad: Yeah, a 64x64 or so sized tile for local memory isn't all that large...
This is what I mean when I say that a data cache based approach scales better (which obviously doesn't say much about absolute performance).
3dilettante
20-Apr-2009, 20:19
In Larrabee you might hold the whole tile in L2 and slide a tile from the dataset through L2 using the scalar pipeline's cache pre-fetching instructions, in parallel with the work being done by the VPU.
The worst case latency is 1/16th * 10 L2 cycles = 160 cycles. How does that compare with the 500 cycles of latency that NVidia's trying to hide, each time it goes to fetch a tile from video memory into shared memory? How much work are the multiprocessors doing (erm, not doing on the kernel) while they compute tile addresses and move data into shared memory?
Well, it is cache, so in the spirit of nitpicking, the worst-case is "Surprise!! One of the other three threads just thrashed the L2 at those lines you need" and it's 16*200 cycles or more.
The best way to prevent that would be to properly pin the needed threads to the core and make sure that they don't fetch to the L2 or write out any data that might force an eviction before all filter work is done for all threads.
trinibwoy
20-Apr-2009, 21:17
This is what I mean when I say that a data cache based approach scales better (which obviously doesn't say much about absolute performance).
I've been wondering lately whether Nvidia will increase the amount of shared memory beyond DX11-CS requirements and try to take advantage of it for graphics rendering tasks as well. If it's there, might as well use it right? Is there anything that could potentially make use of it? Tiling, programmable AA resolve? Or is that a non-starter?
The best way to prevent that would be to properly pin the needed threads to the core and make sure that they don't fetch to the L2 or write out any data that might force an eviction before all filter work is done for all threads.
In Seiler et al there's a thread that controls this stuff as well as distributing work to the other threads. I've got no idea if a co-operative scheme amongst 4 threads is reasonable.
There's only L1 line locking, there's no way to lock L2 lines.
Jawed
3dilettante
20-Apr-2009, 22:07
Threads working on the tile can synchronize so that they work on the tile in the L2 at the same time.
Load and store traffic from these threads during this process could be set to be non-temporal to minimize L2 cache disruption.
There may be some awkwardness if results are written back non-temporally, since they'd have to be reloaded if required soon after.
If complexity is the order of the day, the data could be structured and packed so that 1 or more ways in the L2 are purposefully kept free of data, so that writeback or scratchpad work doesn't evict data.
Neither scheme works fully if another core is writing to the exact same addresses, then invalidates its cached copy, but that sounds like there would be worse problems than cache thrashing then.
I've been wondering lately whether Nvidia will increase the amount of shared memory beyond DX11-CS requirements and try to take advantage of it for graphics rendering tasks as well. If it's there, might as well use it right? Is there anything that could potentially make use of it? Tiling, programmable AA resolve? Or is that a non-starter?
If they deleted the ROPs?
Currently NVidia's L2s are located close to the memory controllers. I'm not clear on whether L2s serve both for textures and render targets though. If the L2s were moved closer to the multiprocessors then there'd be a problem in moving texels from L2 in MP1 to L1 in MP37, say. Becomes similar to the ring-bus palaver of R600.
Larrabee dodges that bullet (at least partially) by giving texturing a dedicated cache. No idea how big it is. No idea if it's multilevel. No idea if Larrabee's L2 is on the critical path for texels on their trip from memory to TUs.
The other issue is the early-Z culling. Rasterisation and ROPs effectively interact in determining whether to shade a fragment. Deleting the ROPs would require the output merge programs on all the multi-processors to keep the rasteriser's low-resolution version of Z/stencil up-to-date. That's a potentially tricky interaction twixt fixed-function rasterisation and OM programs. e.g. if there's 64 multiprocessors each trying to keep the rasteriser up-to-date...
There might be an argument for saying that ROPs can only die when setup/rasterisation also becomes fully software.
Anyway, I like the way Larrabee holds a tile of the render target in L2 until it's completed. Shared memory doesn't seem particularly different, so it would be sweet if NVidia went with a significantly expanded shared memory, no-ROPs and lots of FLOPs to fill the space taken by the ROPs.
Jawed
TimothyFarrar
20-Apr-2009, 22:58
Interesting idea. But what about MSAA and render target compression? Something tells me with the number of samples (scaled up in comparison to the GT2xx) described in that GPU 2013 (http://forums.techarena.in/web-news-trends/1027042.htm) presentation that MSAA is still in hardware for GT3xx at least...
TimothyFarrar
20-Apr-2009, 23:15
Forgot about this, but also in that NVision 08 GPU 2013 presentation,
Rough timeline of changes,
- unified shader (G80)
- double precision (GT200)
- C++
- preemption
- complete pointer support
- virtual pipeline
- adaptive workload partitioning
And by 2013,
- arbitrary dataflow
- hardware managed threading and pipelining
My guesses,
- C++ (CUDA 3.0 gets C++ like interface?)
- preemption (support for multiple CUDA kernels at the same time?)
- complete pointer support (dynamic branching in a kernel?)
- virtual pipeline (software rasterization?)
- adaptive workload partitioning (hardware sched of kernels to SIMD units?)
Larrabee dodges that bullet (at least partially) by giving texturing a dedicated cache. No idea how big it is. No idea if it's multilevel. No idea if Larrabee's L2 is on the critical path for texels on their trip from memory to TUs.
It's 32 Kb per core (see Siggraph paper).
Interesting idea. But what about MSAA and render target compression?
Do you mean compression of both MSAA and colour data? It's certainly more work...
This is for Z compression, which I've picked because of the sheer stupidly high rate in NVidia:
http://v3.espacenet.com/publicationDetails/originalDocument?CC=US&NR=7382368B1&KC=B1&FT=D&date=20080603&DB=EPODOC&locale=en_GB
Hmm, quite a bit of work.
In Larrabee, with no compression required (since pixels are written only once to memory, with optional MSAA resolve, depending on whether samples are required for following pass) there's obviously less work to do. In that case it seems MSAA/stencil-shadow performance is more than adequate.
Here:
http://forum.beyond3d.com/showthread.php?t=53993
the Archmark results imply that theoretical Z-rate is achievable, 147Gzixels/s, on GTX280 (I wonder if that's meaningful, though - how much z-culling is happening? HD4870 is faster than theoretical...). That's 2.3 shader cycles per zixel. Say Z is in 8x8 tiles, that's ~147 cycles per tile. Looks unlikely.
At the same time, NVidia's theoretical rate is nearly 3x RV770's. Is that factor ever witnessed in any game? If not ...
Something tells me with the number of samples (scaled up in comparison to the GT2xx) described in that GPU 2013 (http://forums.techarena.in/web-news-trends/1027042.htm) presentation that MSAA is still in hardware for GT3xx at least...
Maybe some special instructions would speed things up, e.g. max + min of a tile in shared memory.
How many FLOPs would fit in the ROPs and Frame Buffer (what is that?) sections of the final picture here?:
http://www.techreport.com/articles.x/14934/2
Jawed
rpg.314
21-Apr-2009, 09:51
complete pointer support (dynamic branching in a kernel?)
I'd have thought that it would be mean pointers to registers too. Dynamic branching is already there, isn't it?
3dilettante
21-Apr-2009, 14:39
How many FLOPs would fit in the ROPs and Frame Buffer (what is that?) sections of the final picture here?:
http://www.techreport.com/articles.x/14934/2
Looking at ROP sections alone, my eyeball says 2.5-3 clusters (without TMUs) could fit.
The framebuffer part is something of an unknown. Part of that might be the various other widgets like memory controllers, PCI-E and other miscellaneous parts.
I've seen other pictures labelling that region as containing those.
TimothyFarrar
21-Apr-2009, 15:13
Dynamic branching is already there, isn't it?
CUDA 2.1 PTX ISA 1.3 section 10.1.5 has the following,
- bra : Indirect branch via register are not implemented.
- call : Indirect call via register are not implemented.
So branching and calls are currently only supported to an immediate address (defined at compile time).
TimothyFarrar
21-Apr-2009, 15:42
In Larrabee, with no compression required (since pixels are written only once to memory, with optional MSAA resolve, depending on whether samples are required for following pass) there's obviously less work to do. In that case it seems MSAA/stencil-shadow performance is more than adequate.
If you keep everything in cache, ie pre bin every draw command intersecting the tile before the resolve, then sure I could see MSAA working. One thing I'm wondering about here is how the texture unit on LRB handles state changes (texture changes). If you have all these tiles running in parallel, each with perhaps up to 1K to 4K of state changes globally (and a smaller subset per tile), how does the texture unit manage tiles needing divergent subsets of textures?
What I'm getting at here is if there conditions on LRB where tiles need to go to and from global memory pre-resolve?
At the same time, NVidia's theoretical rate is nearly 3x RV770's. Is that factor ever witnessed in any game? If not ...
There are all sorts of good usage of insane Z or stencil fill. Most obvious cases are pre-z drawing pass, shadowmaps, and stencil frustums for shadowing or lighting optimizations. In the frustum drawing case I'd guess that you'd be hitting the max fill limit. Sometimes on pre-z or shadowmap drawing other constraints get in the way. Such as writing out normal in pre-z (loose double Z rate for light prepass rendering for example), or small triangles or small triangle count per batch making the draws not limited by fill.
Maybe some special instructions would speed things up, e.g. max + min of a tile in shared memory.
Isn't vector min and max a LRB instruction. HiZ likely easy to do with a parallel min/max reduction.
One thing I'm wondering about here is how the texture unit on LRB handles state changes (texture changes). If you have all these tiles running in parallel, each with perhaps up to 1K to 4K of state changes globally (and a smaller subset per tile), how does the texture unit manage tiles needing divergent subsets of textures?
What I'm getting at here is if there conditions on LRB where tiles need to go to and from global memory pre-resolve?
The only useful thing I can say here is that texturing in Larrabee seems to attract the most handwaving. With the supposed strictness of D3D11 it'll be interesting to see how Intel gains WHQL...
There are all sorts of good usage of insane Z or stencil fill. Most obvious cases are pre-z drawing pass, shadowmaps, and stencil frustums for shadowing or lighting optimizations. In the frustum drawing case I'd guess that you'd be hitting the max fill limit. Sometimes on pre-z or shadowmap drawing other constraints get in the way. Such as writing out normal in pre-z (loose double Z rate for light prepass rendering for example), or small triangles or small triangle count per batch making the draws not limited by fill.
But 3x seems excessive, that's the only point I'm making. Perhaps the argument is that 3x Z-rate only costs NVidia an extra 15% die space (whatever, don't know what it is) and produces a 15% performance gain (again, for argument's sake, don't know what it is). 3x might not be a bar against the fixed-function ROP but it would be a bar against the software-version - so is it really needed?
Game benchmarks answer with a resolute "no".
Larrabee, in 2010, might be a little "early" as a full software pipeline. But what is the definition of the correct time? When NVidia implements it? Or when Intel's price/performance is better? etc.
Isn't vector min and max a LRB instruction. HiZ likely easy to do with a parallel min/max reduction.
Yeah, REDUCE_MIN/MAX.
Jawed
trinibwoy
21-Apr-2009, 23:03
Game benchmarks answer with a resolute "no".
Not sure you can draw that conclusion based on the data we have. Who's to say extemely fast Z isn't compensating for deficiencies elsewhere? Sort of giving an ALU light architecture a head start.
HD4870 is faster than theoretical...). That's 2.3 shader cycles per zixel. Say Z is in 8x8 tiles, that's ~147 cycles per tile. Looks unlikely.
Phony measurement -- on my 4890 board all numbers are correct (53 GPix out of 59 theoretical).
At the same time, NVidia's theoretical rate is nearly 3x RV770's. Is that factor ever witnessed in any game? If not ...
Does GT200 have support for hierarchical depth/stencil buffering?
Not sure you can draw that conclusion based on the data we have. Who's to say extemely fast Z isn't compensating for deficiencies elsewhere? Sort of giving an ALU light architecture a head start.
It's possible. In the context of replacing ROPs with FLOPs it could indicate that the tipping point is further away, defined by the ratio of FLOPs/mm˛ : ROPs/mm˛ (or Z/mm˛).
Jawed
Phony measurement -- on my 4890 board all numbers are correct (53 GPix out of 59 theoretical).
That was my point, that Archmark is prolly not a useful test.
Does GT200 have support for hierarchical depth/stencil buffering?
Not that I'm aware of.
Jawed
ArchMark is good by its nature, but it has to rely on wacky OGL driver impl. It definitely has less overhead than GZEasy test suite, and this is important for synthetic evaluations.
wacky OGL driver impl.
Is that why it doesn't entirely work on ATI?
Jawed
Yup -- some surface formats are gone for good in there, me thinks.
There's one old "GPGPU" DIP benchmark which is also broken.
TimothyFarrar
22-Apr-2009, 16:07
Phony measurement -- on my 4890 board all numbers are correct (53 GPix out of 59 theoretical).
Does GT200 have support for hierarchical depth/stencil buffering?
It's not exactly the same between vendors but to my knowledge both ATI and NV have some coarse representation of both depth and stencil used for acceleration.
rpg.314
22-Apr-2009, 17:28
Can some body explain to me what hiearchial z-cull is? :roll:
Also, I have never understood how a large fill rate can compensate for anything? How is it (fill rate) calculate anyway? And why should anyone bother about it?
It's not exactly the same between vendors but to my knowledge both ATI and NV have some coarse representation of both depth and stencil used for acceleration.
Well, I don't know for sure, but probably NV's solution is "single-level" coarse representation only, as we know in ATi's impl the buffer can scale from coarse to more fine grained regions in case of a miss.
Can some body explain to me what hiearchial z-cull is? :roll:
Hyper-Z III (http://www.beyond3d.com/content/reviews/37/4) - scroll down the page for more info on the subject.
steampoweredgod
06-Dec-2011, 04:27
Sorry for bumping this thread, but thought it was better than creating a new one as the topic relates to it. In physics related tasks how are these 3 speculated to compare? gpus dx11, larrabee(or the new intel knights cpu) and cell. Earlier in the thread it was mentioned memory latency was orders of magnitude higher in gpus, how does this affect physics calculations? Considering nextgen knights cpu and hypothetical cell are likely around 1~Tflops while dx11 gpus are several times that, is that reflected in physics performance? or do things like latency do away with it?
Ninjaprime
06-Dec-2011, 08:03
Considering nextgen knights cpu and hypothetical cell are likely around 1~Tflops while dx11 gpus are several times that...
I don't think thats accurate. I'm guessing the ~1Tflops you're talking about with Knights Corner is from that recent claim of over 1 Tflops in DGEMM by Intel. DGEMM is double precision. That means its much, much faster than 1Tflops, in GPU SP flops terms. I believe the target clock for KC is 1.6 GHZ, and ~50 cores, which adds up to about 2.5 Tflops SP, which only AMDs top GPU could beat in raw numbers, and probably couldn't even begin to approach KC in real world tests. For example, a Tesla c2070 (fermi)I think hits around ~330Gflops in DGEMM while its theoretical peak would be 515Gflops DP, meaning KC is around 3x faster.
I also think the hypothetical cell rates lower than that, at least the 32 SPU cell I've seen, would only hit 800Gflops SP, unless they plan on also increasing clock rates. Then again, its hypothetical, so who knows. /shrug
rpg.314
06-Dec-2011, 16:06
Both Cell and Larrabee are dead, it doesn't matter how good they are.
steampoweredgod
06-Dec-2011, 16:15
I don't think thats accurate. I'm guessing the ~1Tflops you're talking about with Knights Corner is from that recent claim of over 1 Tflops in DGEMM by Intel. DGEMM is double precision. That means its much, much faster than 1Tflops, in GPU SP flops terms. I believe the target clock for KC is 1.6 GHZ, and ~50 cores, which adds up to about 2.5 Tflops SP, which only AMDs top GPU could beat in raw numbers, and probably couldn't even begin to approach KC in real world tests. For example, a Tesla c2070 (fermi)I think hits around ~330Gflops in DGEMM while its theoretical peak would be 515Gflops DP, meaning KC is around 3x faster.
I also think the hypothetical cell rates lower than that, at least the 32 SPU cell I've seen, would only hit 800Gflops SP, unless they plan on also increasing clock rates. Then again, its hypothetical, so who knows. /shrug
The old dp revision of cell did 1DP per about 2 SP, not sure what other modifications can be done as the progress path seems to be in limbo. But one also has to take into account performance per watt, spes are said to consume about 1 Watt each in a modern process node.
If knights corner is say 100 watts, the fair comparison is 100spe, and I think that goes to the spes :wink:*(though I've knights is likely 200+ watt range)
I'd heard single SP Flop was in the 5+Tflop range in state of the art gpus, I also heard somewhere that around 1 fifth of that is programmable flops, so the figure truly available would be around 1Tflop SP, if it is true that 1/5th comment is true.
There's also another comment by a reputable member that latency is orders of magnitude higher in gpus(though that was a while ago), which might or might not affect physics performance.
If it significantly affected performance and if the 1/5th figure was true, something like a nextgen cell would be equal to about 10 state of the art 2012 gpus in physics related tasks. If the 1/5th figure is true but the latency is not or does not affect performance might be similar. If neither the 1/5th comment is true nor latency affects the performance might turn heavily towards the gpu depending on which architecture is generally better for physics.
Given that highly advanced physics performance could bring ubiquitous cloth simulation, hair simulation, fluid simulation, particle simulation, deformable terrain and realistic destruction, this is a very important area when it comes to videogames.
steampoweredgod
06-Dec-2011, 16:20
Both Cell and Larrabee are dead, it doesn't matter how good they are.
Maybe, cell might still live on future playstation platforms.
Besides why did it die, what sort of agreements were in place, can ibm use and sell spe elements at will freely? or not? Was it ease of programming? Are there any limitations with regards to selling it the console market?
You can see that for example if there are any limitations with selling cell architectures in the console market, it wouldn't bode well for ibm who has contracts with multiple other companies in the console market to have cell architectures shown off in press release after press release excelling.
The cell was featured as a key component in many of the top greenest and most efficient supercomputer systems. Imagine the PR nightmare of having cell in the top of the world supercomputers and then it being on only one console platform, how would that look? Certainly not nice at all from a PR standpoint, especially for their clients.
In another thread it was claimed that a developer got better performance out of the PS3 cell(6spes available) than a 4 core i7, in physics performance. Some unnamed nextgen console is rumored to have just 3 cores, cores that likely do not match an i7 core, and if we extrapolate, we could say the five year old cell would, assuming that statement regarding physics is true, likely beat the brand new console cpu at this task.
rpg.314
06-Dec-2011, 16:31
Maybe, cell might still live on future playstation platforms.No.
Besides why did it die, what sort of agreements were in place, can ibm use and sell spe elements at will freely? or not? Was it ease of programming? Are there any limitations with regards to selling it the console market?
Nobody wants it. Period. If demand is there, then contracts can be modified.
The cell was featured as a key component in many of the top greenest and most efficient supercomputer systems. Imagine the PR nightmare of having cell in the top of the world supercomputers and then it being on only one console platform, how would that look? Certainly not nice at all from a PR standpoint, especially for their clients.
This situation already exists.
If knights corner is say 100 watts, the fair comparison is 100spe, and I think that goes to the spes :wink:*(though I've knights is likely 200+ watt range)Now add in the infrastructure on the chip that is required to service those 100 SPEs with data and you'll probably be incredibly lucky if you only need to cut the SPE count by 30% to fit in same TDP :)
Big point in Larrabee/KC was the interconnect between the individual cores and the mesh grid kind seems to work quite well for them. Cell's ringbus quite definitely can't scale to that many SPEs.
steampoweredgod
06-Dec-2011, 16:55
Nobody wants it. Period. If demand is there, then contracts can be modified.If there are exclusivity agreements with sony with regards to console space use, I doubt sony would be interested in changing those terms.
This situation already exists. There's a difference with this happening when sony was expected to be the console leader by a vast margin, and the cell appeared in state of the art supercomputers.
The newest record breaker supercomputers aren't cell based, iirc.
Now with the market more evenly split it wouldn't be the same to have the 1 supercomputer be cell based especially at or near new console launches, if there's any exclusivity agreement.
Now add in the infrastructure on the chip that is required to service those 100 SPEs with data and you'll probably be incredibly lucky if you only need to cut the SPE count by 30% to fit in same TDP :smile:
Big point in Larrabee/KC was the interconnect between the individual cores and the mesh grid kind seems to work quite well for them. Cell's ringbus quite definitely can't scale to that many SPEs. No one's saying it has to be on the same chip. Put 3 32 spe cells on a board, given that prior generation cell boards are in the greenest and most efficient supercomputer systems these boards would likely compare favorably.
Cell local store was said to be more scalable than larrabee's coherent caches
The lack of coherence between the Local Stores is probably seen as a disadvantage but once you start to scale Cell it'll turn out to be a big advantage.
Once you start adding in piles of cores coherent caches will become a major source of latency and power consumption. -ADEX, beyond3d
That was one of the larger design decisions on creating the Cell was that there was a limit to the amount of cache you can use before you hit diminishing returns. Where as not only is the sdram predictable its infinitly scaleable. -Terarrim, beyond3d
The memory wall: the processor frequency has now surpassed the speed of the DRAM and the current workaround of using multilevel caching leads to increased memory latency....
The slow main memory access on traditional x86 architectures creates a data flow bottleneck causing processor idle times. This results in much lower sustained performance than the theoretical peak of the CPU. To combat the bottleneck, state of the art processors have significant cache (L1, L2, L3), typically several megabytes on the processor chip. This uses up space that would otherwise be available to allow more transistors (and more processing power, as well as more heat). This “wasted” cache memory area is one explanation for why Moore’s law no longer translates into equivalent performance increases.-link
(http://www.simbiosys.ca/science/white_papers/IBM_eHiTS_BLW03019USEN_1.1.pdf)
rpg.314
06-Dec-2011, 17:07
If there are exclusivity agreements with sony with regards to console space use, I doubt sony would be interested in changing those terms.
Are you seriously saying that even if there was demand for cell or it's derivatives, Sony would insist on exclusivity instead of getting some royalties on top of sunk expenses?
steampoweredgod
06-Dec-2011, 17:17
Are you seriously saying that even if there was demand for cell or it's derivatives, Sony would insist on exclusivity instead of getting some royalties on top of sunk expenses?
Sony invested 100s of millions, iirc, I doubt any console competitor is going to pay that in addition to royalties.
Powervr gpu royalties are said to go from cents to 1 dollar per chip, and sony pays nvidia 5 dollars per chip, iirc. At most a competitor can be expected to sell around 100M units in its 10 year lifetime, so you can expect 500M income over ten years from royalties with high nvidia like royalties. If cpu royalties are much lower, the expected income could fall under 100M over ten years.(This sum one would presume would subsequently likely be divided between ibm, toshiba and sony)
I don't see sony handing over a heavy investment like cell for pocket change, especially if it turns out to offer any serious advantage in the console space.
Ninjaprime
07-Dec-2011, 01:13
The old dp revision of cell did 1DP per about 2 SP, not sure what other modifications can be done as the progress path seems to be in limbo. But one also has to take into account performance per watt, spes are said to consume about 1 Watt each in a modern process node.
I don't know where you heard that, but I'm dubious of its truth. Current modern process 45nm Cell chips have a TDP in the 45-50 watt range, with the SPEs taking up ~40% of the die, IIRC. That would mean ~40% of the die that is doing most of the work is only generating ~20% of the heat.
If knights corner is say 100 watts, the fair comparison is 100spe, and I think that goes to the spes
100 SPEs by themselves are paperweights without a 12x scaled up EiB and 12 more PPEs to deliever the bandwidth and scheduling they need. Even assuming the 1 watt each were true, 12 PPE cores and more importantly a 12x scaled up EiB would be quite the power hog. This is also saying nothing of the die you are putting it on... 22nm KC is an actual product that has an actual die, I don't know if you could even fit 100 SPEs within the limits of a normal die. 45nm Cell is 115mm^2, so that puts the 8 SPEs at 46mm^2, 5.75 mm^2 each. So your 100 SPE chip is 575mm^2 with just the SPEs, near the limits of a die, with no memory interface, no EiB, no PPEs. Considering the EiB scales horribly this is a major problem.
I'd heard single SP Flop was in the 5+Tflop range in state of the art gpus, I also heard somewhere that around 1 fifth of that is programmable flops, so the figure truly available would be around 1Tflop SP, if it is true that 1/5th comment is true.
I'm not completely clear what you're saying here? That a GPU flop is 1/5th of a normal CPU flop maybe? Thats certainly not true. Look at DGEMM benches, while utilization might be lower on a GPU, it certainly isn't 1/5th. Fermi is about 60% or so in DGEMM from its theoretical peak. KC is even better. KC is pretty well in line with a CPU, I think.
There's also another comment by a reputable member that latency is orders of magnitude higher in gpus(though that was a while ago), which might or might not affect physics performance.
Doesn't really matter. Latency is relative. 10 nanoseconds to 10 microseconds might seem a lot on paper, but to a game with 16,666 microseconds between frames its nothing. Want a real world test of this? Look at Nvidia and PhysX. If it mattered, and GPU physics was hard, it wouldn't exist.
If it significantly affected performance and if the 1/5th figure was true, something like a nextgen cell would be equal to about 10 state of the art 2012 gpus in physics related tasks. If the 1/5th figure is true but the latency is not or does not affect performance might be similar. If neither the 1/5th comment is true nor latency affects the performance might turn heavily towards the gpu depending on which architecture is generally better for physics
Thats a lot of qualifiers and they're all wrong. A hypothetical 45nm 32 SPU Cell, as was once purposed before Cell died off, would maybe be the equal to a rv770 GPU from 2008 in flops crunching for physics.
Ninjaprime
07-Dec-2011, 01:17
Both Cell and Larrabee are dead, it doesn't matter how good they are.
Larrabee may be dead but Knights Corner has risen from the ashes, and its already planned to be in a 10 Petaflop supercomputer, and has actual products with actual specs.
steampoweredgod
07-Dec-2011, 02:29
I don't know where you heard that, but I'm dubious of its truth. Current modern process 45nm Cell chips have a TDP in the 45-50 watt range, with the SPEs taking up ~40% of the die, IIRC. That would mean ~40% of the die that is doing most of the work is only generating ~20% of the heat.
Second revision of cell years ago did around 100Gflops
IBM Roadrunner uses the PowerXCell 8i version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128-bit registers, reaching double precision 102 GFLOPs per chip.
100 SPEs by themselves are paperweights without a 12x scaled up EiB and 12 more PPEs to deliever the bandwidth and scheduling they need. Even assuming the 1 watt each were true, 12 PPE cores and more importantly a 12x scaled up EiB would be quite the power hog. This is also saying nothing of the die you are putting it on... 22nm KC is an actual product that has an actual die, I don't know if you could even fit 100 SPEs within the limits of a normal die. 45nm Cell is 115mm^2, so that puts the 8 SPEs at 46mm^2, 5.75 mm^2 each. So your 100 SPE chip is 575mm^2 with just the SPEs, near the limits of a die, with no memory interface, no EiB, no PPEs. Considering the EiB scales horribly this is a major problem.
Note I said I wasn't speaking necessarily of single chip solutions. 3 32SPE cell chips should be viable and should not have extraordinary power requirements, as others commented the memory architecture scales far better in terms of power consumption and heat generation vs coherent cache.
IBM has said a crossbar can be switched for EiB component without problem, iirc.
I'm not completely clear what you're saying here? That a GPU flop is 1/5th of a normal CPU flop maybe? Thats certainly not true. Look at DGEMM benches, while utilization might be lower on a GPU, it certainly isn't 1/5th. Fermi is about 60% or so in DGEMM from its theoretical peak. KC is even better. KC is pretty well in line with a CPU, I think.In another thread it was mentioned that programmable flops were 1/5th, the rest being fixed function, though that thread is a bit old. I think it was the "is there anything the cell can do better than a modern cpu gpu" thread.
Doesn't really matter. Latency is relative. 10 nanoseconds to 10 microseconds might seem a lot on paper, but to a game with 16,666 microseconds between frames its nothing. Want a real world test of this? Look at Nvidia and PhysX. If it mattered, and GPU physics was hard, it wouldn't exist.If it can adversely affect performance, as long as remaining performance is still reasonable to offer physics it can still be offered it wouldn't make it impossible necessarily so. It would only mean that a more suitable processor would offer that much more.
A 10x slowdown while still viable can still mean that something 10x faster can do 10x more, if you've several orders of magnitude difference in speed we could hypothetically see orders of magnitude difference in results. This can put a nextgen cell anywhere from being comparable to a 560 to substantially exceeding the 2012 gpus in physics, depending on how such would affect physics calculations.
Some early investigations into PhysX performance showed that the library uses only a single thread when it runs on a CPU. This is a shocker for two reasons. First, the workload is highly parallelizable, so there's no technical reason for it not to use as many threads as possible; and second, it uses hundreds of threads when it runs on an NVIDIA GPU. So the fact that it runs single-threaded on the CPU is evidence of neglect on NVIDIA's part at the very least, and possibly malign neglect at that... But the big kicker detailed by Kanter's investigation is that PhysX on a CPU appears to exclusively use x87 floating-point instructions, instead of the newer SSE instructions...
The x87 floating-point math extensions have long been one of the ugliest legacy warts on x86. Stack-based and register-starved, x87 is hard to optimize and needs more instructions and memory accesses to accomplish the same task than comparable RISC hardware.
-link (http://arstechnica.com/gaming/news/2010/07/did-nvidia-cripple-its-cpu-gaming-physics-library-to-spite-intel.ars)this hints that even traditional cpus might not be that bad at physics as compared to gpus.
upnorthsox
07-Dec-2011, 03:04
I don't know where you heard that, but I'm dubious of its truth. Current modern process 45nm Cell chips have a TDP in the 45-50 watt range, with the SPEs taking up ~40% of the die, IIRC. That would mean ~40% of the die that is doing most of the work is only generating ~20% of the heat..
not quite
Power consumption of the 45 nm CELL processor is less than forty-percent that of the 90 nm CELL processor – now less than 20 watts.
http://www.realworldtech.com/page.cfm?ArticleID=RWT022508002434
rpg.314
07-Dec-2011, 04:00
Larrabee may be dead but Knights Corner has risen from the ashes, and its already planned to be in a 10 Petaflop supercomputer, and has actual products with actual specs.
But we don't know much about it.
Ninjaprime
07-Dec-2011, 09:14
Second revision of cell years ago did around 100Gflops
I meant more about the 1 watt each thing, though upnorthsox's link has resolved that. I searched for around 30 min trying to find a TDP for 45nm Cell and the only one I could find was that it was "just below 50 watts." It was on RWT right there the whole time. /shrug
3 32SPE cell chips should be viable and should not have extraordinary power requirements, as others commented the memory architecture scales far better in terms of power consumption and heat generation vs coherent cache.
Maybe, I'm not so sure. If a 8 SPE Cell chip at 45nm is 115mm^2, it stands to reason that a Cell chip with 4 times the SPEs would need 4x the PPE control cores, 4x the EiB, ect. While I don't think it would have to be 4x as big, I think its fairly reasonable to expect it to be at least 3x as big, which is around ~350mm^2, 50% bigger than the original Cell at launch. Putting 3 of these on a board seems unlikely. You're talking over 1000mm^2 worth of dies on a board. Maybe a supercomputing board, but I thought it was about game physics?
IBM has said a crossbar can be switched for EiB component without problem, iirc.
Yeah, but the original reason for using EiB over a crossbar was that its smaller than a crossbar and scales better, so that goes the opposite way you want.
In another thread it was mentioned that programmable flops were 1/5th, the rest being fixed function, though that thread is a bit old. I think it was the "is there anything the cell can do better than a modern cpu gpu" thread.
Sounds ancient, like before unified shaders ancient.
If it can adversely affect performance, as long as remaining performance is still reasonable to offer physics it can still be offered it wouldn't make it impossible necessarily so. It would only mean that a more suitable processor would offer that much more.
A 10x slowdown while still viable can still mean that something 10x faster can do 10x more, if you've several orders of magnitude difference in speed we could hypothetically see orders of magnitude difference in results. This can put a nextgen cell anywhere from being comparable to a 560 to substantially exceeding the 2012 gpus in physics, depending on how such would affect physics calculations.
Still doesn't matter in the context of game physics. 10x latency doesn't mean you lose 10x performance, it means you use 10x the latency hiding tricks and tweaks. If performance was lost from latency, no consoles would use unified memory because thats a latency beast. Most devs prefer it though for its benefits.
this hints that even traditional cpus might not be that bad at physics as compared to gpus.
This is completely true, but you have to take it in context. It means that CPUs aren't 20x as bad at PhysX as a GPU, they are only 3-4x as bad. That article is also fairly old, and PhysX has since been multithreaded and updated, IIRC.
Ninjaprime
07-Dec-2011, 09:16
But we don't know much about it.
Not everything, yet, but we know miles more about it than we did about larrabee. Its already far more a real product that any stage of larrabee was. You can actually get 32 core demo boards from intel right now, I don't think that was ever really the case with larrabee.
steampoweredgod
07-Dec-2011, 10:46
Maybe, I'm not so sure. If a 8 SPE Cell chip at 45nm is 115mm^2, it stands to reason that a Cell chip with 4 times the SPEs would need 4x the PPE control cores, 4x the EiB, ect. While I don't think it would have to be 4x as big, I think its fairly reasonable to expect it to be at least 3x as big, which is around ~350mm^2, 50% bigger than the original Cell at launch. Putting 3 of these on a board seems unlikely. You're talking over 1000mm^2 worth of dies on a board. Maybe a supercomputing board, but I thought it was about game physics?
The spes do not need heavy involvement from the ppe, from what I hear the involvement can be quite little, and they can be pretty much autonomous, a stronger ppe can very likely deal with more spes. The EiB, isn't that the simple ring bus, while performance might suffer that shouldn't take much space.
The old cell servers can reach 6Tflops, and are in the greenest most power efficient architectures, we could compare the old boards using 45nm 9 core cells or upcoming 28nm tech if you like.
Multi-chip solutions can easily go into consumer boards if we're talking pc consumers.
GF110 shares the same 530 mm2 die... Multiple such can be put in a pc.
Yeah, but the original reason for using EiB over a crossbar was that its smaller than a crossbar and scales better, so that goes the opposite way you want.
Edit: you are correct, but still simple interconnect are said to be viable up to 64 core chips
The old ageia ppu, with about 125M transistors easily had what seems like 17 cores with what is said to be a similar design.
Sounds ancient, like before unified shaders ancient.I think xenos was mentioned in that thread, that is unified, though I'd have to check.
Still doesn't matter in the context of game physics. 10x latency doesn't mean you lose 10x performance, it means you use 10x the latency hiding tricks and tweaks. If performance was lost from latency, no consoles would use unified memory because thats a latency beast. Most devs prefer it though for its benefits.IF cost savings are good enough that can do away with performance losses. The cell is fed by XDR for performance reasons, not sure how that fares latency wise. Considering the importance of ever larger caches in modern cpus, it seems clear latency matters for at least some tasks, even gpus have local memory to help them out.
This is completely true, but you have to take it in context. It means that CPUs aren't 20x as bad at PhysX as a GPU, they are only 3-4x as bad. That article is also fairly old, and PhysX has since been multithreaded and updated, IIRC. That article is about a year old, if what it says is correct. We'd have to check nothing horrible has been going on in more recent physx code. And you have to remember PPUs which resemble cell*(and cell itself) are said to be more apt for physics than gpus. PPUs at their time were claimed to be up to 200 times faster than cpus at some physics tasks.
According to some comments here in beyond3d the cell outclassed 4 core i7 in a developer's physics code, to the point that it was visually significantly noticeable causing them to use the ps3 version of the software.
If cell and ppu like hardware is truly more optimal design for physics calculations, I have a hard time seeing more optimal silicon suffering at tasks as compared to less optimal use of said silicon.
rpg.314
07-Dec-2011, 17:11
Not everything, yet, but we know miles more about it than we did about larrabee. Its already far more a real product that any stage of larrabee was. You can actually get 32 core demo boards from intel right now, I don't think that was ever really the case with larrabee.
No. There was a lot of technical disclosure regarding Larrabee and there has been none so far for KC. The 32 core boards are rebadged larrabee chips.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.