Larrabee and Intel's acquisition of Neoptica

Arun · Dec 1, 2007

Uhm, I think you're both thinking of two different advantages of caches: saving memory bandwidth and hiding memory latency. SoEMT will reduce the importance of caches for hiding memory latency, but may actually increase their importance in terms of saving bandwidth, as trashing may go up. The same principles also apply to GPUs.

nAo · Dec 1, 2007

ShaidarHaran said:
Assuming SoEMT, it's just as hoho said. In the case of a cache miss, a core can simply switch to another thread until the data for the previous thread is retrieved.

The other thread will at best re-use the same data the first thread was using. This means that it can potentially evict data from the cache that the first thread is going to need in the near future, at least in a worst case scenario.
In the general case 2 threads working on different data sets will require a larger cache, not a smaller one.

A side-effect of this is that smaller caches could theoretically be used.

Again, how can a second thread increase the efficiency of your cache so that yout might endup using a smaller one?

Arun · Dec 1, 2007

I guess the question is how often you're bandwith limited vs how often you're idling because of bandwidth latency. For the quad-core Nehalem though, you're looking at a 192-bit DDR3 IMC... So I don't think latency is too much of a concern for once!

nAo · Dec 1, 2007

Arun said:
Uhm, I think you're both thinking of two different advantages of caches: saving memory bandwidth and hiding memory latency. SoEMT will reduce the importance of caches for hiding memory latency, but may actually increase their importance in terms of saving bandwidth, as trashing may go up. The same principles also apply to GPUs.

I see your point but SoEMT is mostly a cheap way to put at some use idle units while paying a fairly small cost for it, it's not an opportunity to constantly trash your caches hoping that the your improved ability to hide memory accesses latency will save your arse, not in the general case for sure.
Unless you never re-use your data, but then you don't need a cache in the first place.

Silent_Buddha · Dec 1, 2007

Add to that, has even a 4 core Penryn come even remotely close the amount of math it can do in comparison to years old R580?

If Folding at Home is any indication of relative realworld workload and potential Physics performance then it'll still be years (Larrabee perhaps?) until Intel matches R580 much less anything newer.

Then again I'm not an expert in this area so I could be attributing far too much importance to GPU performance in FAH.

And the amount of physics used in Crysis is indeed magnitudes higher than those used in Farcry. It's not just a simple evolutionary rise in useage. The complexity of the calculations might have only gone up slightly but the sheer number of calculations going on in any given scene absolutely dwarfs those used in Farcry.

It can, of course, be argued that games don't need that level of physic nor that number of calculations per scene. But I would argue it goes a long ways towards immersion and the all important WOW factor. I'll be a sad buddha if the trend reverts to doing less because it isn't "needed."

Regards,
SB

Nick · Dec 3, 2007

Arun, your arguments are correct but you're making the general mistake of only looking at the high-end. The average system sold today does not come with a GeForce 8800, but it does have dual-core. In fact more than half of all systems are laptops, so 200+ Watt GPU's will never become the norm. Lots of people, including occasional gamers, are even content with integrated graphics or a low-end card. Sure we'll see multi-teraflop GPU's in the not too distant future, but we'll see the average system equipped with quad-cores sooner.

So basically the average system has a potent CPU, but a modest GPU. For games this means that the GPU should be used only for what it's most efficient at; graphics. CPU's are more interesting for everything else. I come to the same conslusion when looking at GPGPU applications. Only with high-end GPU's they achieve a good speedup, but actual work throughput versus available GFLOPS is often laughable.

So I believe that the arrival of teraflop GPU's will have an insignificant effect on the current balance. Advanced multi-core CPU's on the other hand will be adopted relatively quickly, making it ever more interesting not to offload things like physics anywhere else. The megaherz race is over, but the multi-core race has just begun and has some catching up to do.

From an architectural point of view, GPU's have only three limited types of data access:
- Registers: very fast, but not suited for storing actual data structures.
- Texture cache: very important to reduce texture sampling bandwith. Close to useless for other access patterns.
- RAM: high bandwidth but high latency. Compression techniques only useful for graphics.

For CPU's this becomes:
- Registers: extremely fast and since x64 no longer a big performance limiter.
- L1 cache: very fast and practically an extension of the register set. Suited for holding actual data sets.
- L2 cache: a bit slower but can hold the major part of the working set.
- RAM: not that high bandwith, but still tons of potential when multi-core increases the need.

So we'd have to see major changes in GPU architecture to make them more suited for non-graphics tasks, likely affecting their graphics performance. That might be ok for the high-end but the mid- and low-end have no excess performance for anything else. The CPU on the other hand is already well on its way to be able to handle larger workloads, and effectively ending up in every system.

nAo · Dec 3, 2007

Nick said:
From an architectural point of view, GPU's have only three limited types of data access:
- Registers: very fast, but not suited for storing actual data structures.
- Texture cache: very important to reduce texture sampling bandwith. Close to useless for other access patterns.
- RAM: high bandwidth but high latency. Compression techniques only useful for graphics.

DX10 GPUs have constant buffers and associated caches.
G80 also exposes a fast on chip memory through CUDA

Arun · Dec 3, 2007

Nick said:
Arun, your arguments are correct but you're making the general mistake of only looking at the high-end. The average system sold today does not come with a GeForce 8800, but it does have dual-core.

And they'll play old games that already use the CPU for physics anyway. Your arguement only stands if this dynamic remains true going forward, which it very possibly won't IMO (more below).

In fact more than half of all systems are laptops

I don't know where you get your stats, but I suspect you should reconsider your source. Even the most optimistic estimates don't expect that to happen before 2010.

Lots of people, including occasional gamers, are even content with integrated graphics or a low-end card.

There's the gaming market, and then there's the gaming market. I could have a lot of fun playing casual games and 3-5 years old games, but that's not what I'm doing. There will ALWAYS be a market for games that run on IGPs. I say IGPs because low-end cards will die within the next 2 years, as they'll become essentially senseless: if you look at AMD's and NVIDIA's upcoming DX10 IGPs, they're practically good enough for Windows 7 and for a few years after that. All you'll see after that are incremental increases in performance & video decoding quality, imo.

But it's not because there is a segment for games on what will be a $300 PC market that there won't be a market for games above that; and as has traditionally been the case, these two will be completely separate.

Sure we'll see multi-teraflop GPU's in the not too distant future, but we'll see the average system equipped with quad-cores sooner.

I have massive doubts that more the 2-3 cores makes sense in the 'low-end commodity PC market'. If a game aims at the low-end, it should aim at that, and that some artificial segment with average performance that nobody really fits in. Anyway, that's arguable, but to the next point now...

So basically the average system has a potent CPU, but a modest GPU. For games this means that the GPU should be used only for what it's most efficient at; graphics.

You assume this to remain true: once again, it will not. Rather than looking at the present, it might be a good idea to try and look at the future instead. In the 2010-2011 timeframe (32nm), dual-cores will remain widely available for the low-end of the market. These will be paired with G86-level graphics performance in the ultra-low-end, with probably a higher ALU:TEX ratio. So in that segment of the market, you'll see maybe 200GFlops on the GPU and 75GFlops on the CPU.

And indeed, I can't really imagine any circumstance where offloading the physics to the GPU makes sense there, but GFlops still aren't massively in the CPU's favor and this market would mostly play casual and old games; amusingly, given the performance of current GPUs in DX10 games, they would presumably still play DX9 games!

Now, look at another segment of the market: $120 CPU, $120 GPU, $60 Chipset. In 2010-2011, that would probably correspond to a 3GHz+ quad-core on the CPU side of things (with a higher IPC than Penryn). That's about 150GFlops, maybe. On the GPU side of things, however, you'll easily have more than 1TFlop: just take RV670, which manages 500GFlops easily on 55nm at 190mm2. It's not exactly hard to predict where things will go with 40nm and 32nm...

Only with high-end GPU's they achieve a good speedup, but actual work throughput versus available GFLOPS is often laughable.

Uhm, that's just wrong. In apps that make sense for their current architecture, and there are *plenty* of them, the efficiency in terms of either GFlops or bandwidth (whichever is the bottleneck) is perfectly fine.

The megaherz race is over, but the multi-core race has just begun and has some catching up to do.

Yes, it has a lot of catching up to do in terms of, as you kinda said yourself, politics and hype. You don't see to realise that when predicting the future configurations of PCs (i.e. are most consumers going to go with a $300 CPU with a $150 GPU, or a $100 CPU with a $350 GPU?) what maters is what decisions the developers take. If the CPU is never the bottleneck, then why would you want more than a $100 CPU anyway?

If physics acceleration on the GPU doesn't take off, then obviously you'll want more than a $100 CPU. But if it does, then who knows - and that's why NVIDIA and ATI are so interested in it. They want to increase their ASPs at the CPU's expense, and there is no fundamental reason why they cannot succeed. It's all about their execution against Intel's.

And if what happens is that GPUs capture more out of a PC's ASPs, then you're looking at $100 CPUs being paired with $500 GPUs. Heck, as I said I'm already a pioneer in that category - slightly overclocked E4300 ($150) with a $600 GPU. The difference in GFlops between the two is kind of laughable, really, and in this case GPU Physics would clearly make sense. *That* is the dynamic that NVIDIA and AMD are trying to encourage, and that's why it's a political question, not really a technical one (although perf/watt for CPUs vs GPUs for physics also matters).

I might be right, or you might be right, but we aren't personally handling any of these companies' mid-term strategies so I wouldn't dare claiming anything with absolute certainty given that it's not even really a technical debate from my POV. I do agree that a 100GFlops CPU is just fine for very nice physics, but I do not believe that closes the debate either.

So we'd have to see major changes in GPU architecture to make them more suited for non-graphics tasks, likely affecting their graphics performance.

Those are already happening and barely affecting graphics performance, as they are mostly minor things and the major things can easily be reused for graphics. There is no fundamental reason why this will not keep happening.

P.S.: I just thought I'd point out that I do NOT consider Larrabee to be a CPU here, but that obviously it might be a very interesting target for GPGPU/Physics. An heterogeneous chip with Sandy Bridge and Larrabee cores on 32nm, if the ISA takes off, ought to be much better than a GPU for Physics and other similar workloads either way (assuming there's enough memory bandwidth). So this is another important dynamic of course, and arguably much nearer to the discussion subject of this thread...

3dilettante · Dec 4, 2007

nAo said:
I see your point but SoEMT is mostly a cheap way to put at some use idle units while paying a fairly small cost for it, it's not an opportunity to constantly trash your caches hoping that the your improved ability to hide memory accesses latency will save your arse, not in the general case for sure.
Unless you never re-use your data, but then you don't need a cache in the first place.

Wouldn't fine-grained round-robin or a hybrid scheme like Niagra be more effective?

SoEMT is rather pessimistic about data sharing and coherent thread behavior.
It's suited to long-latency events where speculation within the same thread is mostly pointless, but it also assumes that there's a somewhat limited amount of non-speculative work in other threads, which is why it will stick with a thread for quite a stretch between events.

If the workload has massive amounts of non-speculative work available in other threads with a high likelihood that they are working in the same place, why bother running with just one thread many instructions ahead of the pack when each instruction taken increases the chance of tripping up one of the other threads through a cache invalidation?

Larrabee and Intel's acquisition of Neoptica

Arun

Unknown.

nAo

Nutella Nutellae

Arun

Unknown.

nAo

Nutella Nutellae

Silent_Buddha

Nick

nAo

Nutella Nutellae

Arun

Unknown.

3dilettante

Similar threads