22 nm Larrabee

If it is used like a register file, yes.
The reason is simple, a reg file is (normally) directly adressed, you don't need to check if it is in there, and when it is in, in which way of the highly associative cache it is. Furthermore a Cache is likely physically further away than the regfile (especially the splitted ones for each lane of an SIMD unit) in current GPUs. And driving data over lines costs a lot of energy.

you aren't taking into account the relative size of a cache and a register file per bit. A cache will generally have much less power per bit than a register file.
 
Both the register file and L1 cache consume a significant amount of power as a result of their nearly constant use.
Designing a cache to provide the same level of porting and bandwidth as a register file would significantly add to the cost.

it just devolves into a register file as you add ports. As far as bandwidth, it is fairly easy.

The space of register identifiers is vanishingly small compared to that of the address space, as such it is much simpler to check for dependences amongst a few hundred IDs than 2^64 addresses.

you don't need to compare the whole address space, only a hash of it. And not a big hash either.
 
it just devolves into a register file as you add ports. As far as bandwidth, it is fairly easy.
It would devolve into a register file that needs an address translation for each access and would have latency ranging from several cycles to hundreds or thousands.
Each access would in a speculative memory pipeline would also go through a decent chunk of the die devoted to the L/S and speculation hardware.
Each port seems noticeably more expensive.

I would be fascinated to see the description of a memory-based register system handling an exception from a register access with a handler whose register accesses can raise exceptions.
I think there's an internet meme for that.

you don't need to compare the whole address space, only a hash of it. And not a big hash either.

A hash that uses a TLB entry handle to reduce the number of bits?
 
Last edited by a moderator:
It would devolve into a register file that needs an address translation for each access and would have latency ranging from several cycles to hundreds or thousands.
Each access would in a speculative memory pipeline would also go through a decent chunk of the die devoted to the L/S and speculation hardware.
Each port seems noticeably more expensive.

No it wouldn't. There is little different between a register file and a cache ram except the cache ram is more compact and significantly lower power. Both can have a lot of sideband hardware but they don't tend to be at all dominate on the characteristics.

I would be fascinated to see the description of a memory-based register system handling an exception from a register access with a handler whose register accesses can raise exceptions.
I think there's an internet meme for that.

You mean the thing that hardly ever occurs, occurs on a sidepath, and doesn't require anything to be done unless it does occur, that one? And FYI, register accesses can raise exceptions. I can give you a list of chips if you want.


A hash that uses a TLB entry handle to reduce the number of bits?

A HASH aka X=^A[Y+Z:Y] etc. You aren't looking for a match, you only need to ensure an anti-collision.
 
You mean the thing that hardly ever occurs, occurs on a sidepath, and doesn't require anything to be done unless it does occur, that one? And FYI, register accesses can raise exceptions. I can give you a list of chips if you want.
Sure, I'd be interested to read up on those.

edit: As far as exceptions for register access, does it include failing a check with ECC ?
 
Last edited by a moderator:
RT generally exhibits a strong locality of memory accesses for neighboring pixels/samples. The caches should do quite okay.

Only for primary rays. And there's no point in using RT in the first place if all you are going to do is shoot primary rays. The real challenge is in ray tracing secondary and occlusion rays efficiently.

Besides, what you said just reaffirms what I said earlier. GPU caches help with horizontal reuse, not vertical. And you NEED high vertical reuse if you want to run with fewer threads at high efficiency. Shoving many low-workitem kernels into a CU/SM doesn't help as you won't have enough threads to get high horizontal re use either.
 
There is additional compute density to be had on GPUs as well.
Doubtful. TMU's and rasterizer can't be removed without nuking your graphics perf. ROPs can't be removed unless you do sort middle rasterization, which is too big a change. There's not much ff hw left to remove without doing serious surgery on GPU arch.

nVidia at least is predicting up to 3Ghz shader clocks in the next few years on GPU parts.
Doubtful. Those clocks would need 2x more threads for same mem latency.
 
Only for primary rays. And there's no point in using RT in the first place if all you are going to do is shoot primary rays. The real challenge is in ray tracing secondary and occlusion rays efficiently.
If your secondary rays show already no locality anymore, your resolution is too low. ;)
 
If your secondary rays show already no locality anymore, your resolution is too low. ;)
…or your surfaces are finely detailed, (like the metalish-reflective spokes in a wheel). Oh wait, that's what we really want, don't we?
 
No it wouldn't. There is little different between a register file and a cache ram except the cache ram is more compact and significantly lower power. Both can have a lot of sideband hardware but they don't tend to be at all dominate on the characteristics.
I think you're using the more hardware-centric definition of a register file as necessarily not being based on SRAM, or at least much more expensive SRAM? If so that doesn't apply because (as far as I can tell) GPUs frequently use L1-like SRAM for their register file as they can tolerate the inherently higher latency.

Doubtful. TMU's and rasterizer can't be removed without nuking your graphics perf. ROPs can't be removed unless you do sort middle rasterization, which is too big a change. There's not much ff hw left to remove without doing serious surgery on GPU arch.
Ah, but who talked of removing them? The ALU-FF ratio will just naturally increase over time.

And although many of them might not really save much hardware because of the increased cost of data movement, there are also tons of ideas that have been proposed over the years to reduce the amount of fixed-function hardware and improve programmability:
- Keep texture addressing in HW, but do texture filtering in the shader core at least for all >8-bit and FP formats. Arguably slightly less likely to be partial now that there's a 8-bit compressed FP format in DX11 (as opposed to FP10 which is really 32-bit); if this happens, it should be for all filtering, which is a pretty controversial step.
- Handle blending in the shader core. This is already done not only on PowerVR hardware where it's easier because it's a TBDR, but also on Tegra which is a IMR. Some of the collision checking means it probably doesn't save much but it's useful. And while you're at it, non-linear color spaces and non-traditional AA techniques will become more frequent so you might as well do MSAA resolve in the shader core as well (but properly, hi R600!)
- Do triangle setup in the shader core. Intel IGPs already do that (or did a few generations ago at least). One historical problem with that, ironically enough, was that FP32 wasn't enough for the corner cases unless you did things rather obtusely iirc. With FP64 becoming mainstream that's no longer a problem, although it may or may not still hurt power efficiency.

Doubtful. Those clocks would need 2x more threads for same mem latency.
2x more threads for the same memory latency compared to the same architecture running the same programs at 1.5GHz, but a bit less than 2x more threads than today - as programs become longer, there will naturally be more Instruction Level Parallelism in them, which helps hide more memory latency for a given RF size. Not that the RF isn't important going forward, as indicated by the amount of effort NVIDIA put into that register file cache paper for relatively minor improvements in the grand scheme of things.
 
2x more threads for the same memory latency compared to the same architecture running the same programs at 1.5GHz, but a bit less than 2x more threads than today - as programs become longer, there will naturally be more Instruction Level Parallelism in them, which helps hide more memory latency for a given RF size.

Actually the ILP seems to be DECREASING instead of increasing. AMD is moving away from ILP-centric (VLIW) approach, because there is not enough ILP on modern shader programs. When shader programs get more complex, they also start having more complex control structures, which limits ILP.
 
Handle blending in the shader core. This is already done not only on PowerVR hardware where it's easier because it's a TBDR, but also on Tegra which is a IMR. Some of the collision checking means it probably doesn't save much but it's useful. And while you're at it, non-linear color spaces and non-traditional AA techniques will become more frequent so you might as well do MSAA resolve in the shader core as well (but properly, hi R600!)

All this plainly screams for sort middle rendering. Or a big ass dram on package.
 
…or your surfaces are finely detailed, (like the metalish-reflective spokes in a wheel). Oh wait, that's what we really want, don't we?
No, I don't want each neighboring pixel/sample to show a completely different color. Then it's a noisy and completely aliased picture. That looks awful. ;)
 
Actually the ILP seems to be DECREASING instead of increasing. AMD is moving away from ILP-centric (VLIW) approach, because there is not enough ILP on modern shader programs. When shader programs get more complex, they also start having more complex control structures, which limits ILP.
That's a good point, it is true that more complex control structures limit ILP, but I think you're reading too much into AMD's move away from VLIW. It is GPGPU-centric, where the control structures have always been more complex. And some of their limitations did not come from an inherent lack of ILP but rather than restrictive nature of their multibanked register file (which I'm sure is extremely efficient in hardware so I'm not saying it's a bad design decision - only that it obscures the real amount of ILP available).

Secondly, we're not talking about the same kind of ILP - AMD's VLIW needs enough independent ALU instructions, whereas hiding memory latency only requires enough ALU instructions (independent or not) between texture/memory accesses to hide the latency. In general, the number of ALU instructions between accesses to external memory MUST increase because ALU performance will increase faster than bandwidth.

And finally using ILP to hide memory latency on GPUs isn't like using ILP to improve single-threaded performance on CPUs at all. What matters is the AVERAGE amount of ILP over many threads, not the fluctuating amount of it available inside a single thread. There will often be parts of a program which are a sort of 'serial bottleneck' but those will be compensated by the parts that have a lot of ILP.

I could certainly be wrong and ILP won't increase, but I'd be surprised if it actually decreased.
 
I thought it was already discussed how power hungry that eventually is.
16 registers per thread is plenty to ensure that the vast majority of accesses to reused data are register accesses. I sincerely doubt that having any more registers would have a significant effect on power consumption.

Also, you can't force data accesses to be register accesses. What makes it even more ironic is that on a GPU you want to minimize the number of registers to maximize the number of wavefronts. And that again also worsens cache contention.
I remember the FMA specification first turned up years ago. AMD announced that Bulldozer will support FMA(4) more than two years ago (and before, SSE5 proposed in 2007 also included [differently encoded 3 operand] FMA instructions). The confusion about FMA4 and FMA3 is also already 2 years old. So someone who didn't know that intel will implement FMA(3) must have lived under a stone in a cave somewhere in the middle of nowhere for the last years. :LOL:
Some people still doubted that Intel would introduce FMA with Haswell, because Sandy Bridge doubled the ALU width already.
For conventional GPUs it does not matter, as you can stream with a higher bandwidth directly from memory. So a huge LLC it is basically wasted. But look on intels iGPU in Sandybridge! They share already the L3. Why do you think it will be any different in future versions of it (or AMD's)?
RAM bandwidth is highly proportional to computing power, both for discrete cards as well as IGPs. However, developers won't create GPGPU applications for something as weak as an IGP, regardless of whether or not it has access to an L3 cache. In other words, even though a large L3 cache can have a profound effect on the performance of a CPU, you don't fix all of a GPU's problems by throwing an L3 cache at it.
And how does a compilation profit from wide vector units in that course?
Again, it doesn't. And it doesn't have to. The CPU is already great at tasks of this complexity. It's the GPU that has a long way to go to become any better at ILP and TLP.
Yes, in the same 65nm process the core area (i.e. just the core including L1) grew from 19.7mm² to 31.5mm². And the complete die was 57% larger. Doesn't look like a doubling of compute density to me. :rolleyes:
Aside from supporting x64, Core 2 also widened from a 3 to a 4 instruction architecture and implemented many other features. Unfortunately these major changes make it impossible to assess the isolated cost of widening the SSE paths to 128-bit.

It turns out a much better comparison is Brisbane versus Barcelona. Together with a slew of other changes which probably don't take a lot of space each, the widening of the SSE path made the core grow by only 23%. That's only 8% of Barcelona's entire die. So 5% for doubling the throughput probably isn't a bad approximation.

Doubling it again obviously costs more in absolute terms, but the rest of the core has grown / become more powerful as well. Sandy Bridge already widened part of the execution paths. Suffice to say that implementing AVX2 in Haswell will be relatively cheap and we can consider it to have twice the throughput at a negligible cost.

That's absolutely not the case for GPUs, unless they start trading fixed-function hardware for programmable cores...
 
All this plainly screams for sort middle rendering. Or a big ass dram on package.
Heh, all of 3D rendering screams for TSV (true 3D packaging) or even better cheap silicon photonics. Not that I'd complain about a TBDR-based console, mind you.

No, I don't want each neighboring pixel/sample to show a completely different color. Then it's a noisy and completely aliased picture. That looks awful. ;)
Well ideally you'd try to automatically raytrace as many rays as necessary to avoid that (effectively dynamic SSAA) - which will also naturally result in a moderate amount of secondary ray coherence.
 
Back
Top