22 nm Larrabee

And what should we do with microtriangles?
The still form a smooth shape, isn't it?

The argument is actually a very general one. The complete absence of locality in an image means it is just a random coloring of pixels. You won't be able to recognize anything in it (which would basically be certain structures, like a cube, a square or whatever). Therefore each meaningful picture whill show some locality. And on the way to generate this image, your algorithm will experience necessarily also locality in the accessed data structures (describing the scene).

Just take the simple example of a raytraced image of a half transparent half reflective sphere in some environment. Neighboring rays intersecting the sphere giving rise to secondary rays. But those secondary rays still intersect very likely the same (or very close) objects. After all the reflection will show just a (distorted) image of the environment, same as the refracted rays. So if you don't have a surface with pixel sized small mirrors each pointing to random directions (which would create simply noise), also secondary rays will benefit from locality in the accessed data structures.
 
- Do triangle setup in the shader core. Intel IGPs already do that (or did a few generations ago at least). One historical problem with that, ironically enough, was that FP32 wasn't enough for the corner cases unless you did things rather obtusely iirc. With FP64 becoming mainstream that's no longer a problem, although it may or may not still hurt power efficiency.

The problem there isn't the math, or available precision, IMHO (special case BFT - Big Fucking Triangles - and voila!), but rather the data-flow.
 
So, if widening AVX to 1024 bits will bring such large performance improvements at such low cost, why stop at 1024? Why not 2048 bits? Or even more?
Widening AVX without widening the execution units won't bring a large performance improvement. It mainly lowers power consumption, and would help hide latency. There's no point in anything beyond a 4:1 ratio due to diminishing returns.

Widening the execution units would increase throughput, but beyond dual 256-bit FMA units it would have a significant cost and require sacrificing scalar performance. It would also increase the instruction rate again and thus all related power consumption, and worsen the latency hiding. I doubt these compromises are worth it.

2048-bit and beyond isn't practically feasible since AVX is limited to 1024-bit. But it's a very reasonable limit since wider vectors would also worsens branch granularity and task granularity.
 
Yes, everyone knows they will converge but the argument is whether the resulting architecture will look more like a GPU or a CPU. I don't understand why you're so convinced that today's CPUs are more representative of future many-core achitectures than today's GPUs.
That's easy. Workloads depend on ILP, TLP or DLP for high performance, and increasingly a combination of these. GPUs still only offer good DLP, with TLP improving but still suffering from cache contention. CPUs are great for both ILP and TLP, and are catching up really fast in DLP.
There is additional compute density to be had on GPUs as well. nVidia at least is predicting up to 3Ghz shader clocks in the next few years on GPU parts.
Which converges it toward the CPU...
What are we expecting from Haswell in terms of fixed function hardware?
Nothing has been confirmed, but my personal expectation is that Intel won't risk any radical changes yet and will just include an enhanced DX11 IGP. They'll be able to seriously experiment with having the CPU cores assist in vertex and/or geometry processing though. If Skylake features AVX-1024 then a mainstream chip would deliver ~1 TFLOP at low power consumption so it can definitely entirely take over the IGP's task. By then things like texture sampling will likely require more programmability anyway so there's no need for any fixed-function hardware, although of course AVX can be extended with a few more valuable instructions.
 
No, I don't want each neighboring pixel/sample to show a completely different color. Then it's a noisy and completely aliased picture. That looks awful. ;)

Everything that's a physically correct reflection shows a color based on the curvature of the reflective surface and the distance of the reflected object. Or do you propose to alter the reflection calculations in raytracing to make them more … dramatic (?) instead of realistic? ;)

Just let me clarify:
You'll be getting this kind of thrashing around of secondary or even worse tertiary rays also if you choose to apply some form of antialiasing in order to remove some of the noise from you rendered picture.
 
They're not targeting Llano. And CPUs run OpenCL too.
I think what you mean is that they are not specifically targeting Llano. But it is clearly compatible with Llano and given that there's nothing fancy about Llano's CPU-GPU integration and the architecture is the same one used in many other AMD GPUs, I'm really not sure what the benefit of that could possibly be?

Everything that's a physically correct reflection shows a color based on the curvature of the reflective surface and the distance of the reflected object. Or do you propose to alter the reflection calculations in raytracing to make them more … dramatic (?) instead of realistic? ;)
I think there's a solid argument for downright faking it when it gets way too random. The human brain is unable to make anything useful out of it but the aliasing will still annoy it so it's a lose-lose situation. Anyway that's a last recourse (most of the time sane content and/or local AA will be enough) but I think it should be considered - physically accurate rendering for the sake of physical accuracy isn't a viable strategy.
 
It would really help if you could explain in some detail.

I can't speak for aaronspink, but I do recall some differences were mentioned in the context of some recent CPUs.
The register file uses 8T SRAM, while caches use 6T, though in the case of Atom the L1 data cache also shifted to use 8T, which has a commensurate cost in storage per transistor.
The other caches stuck with 6T.

The upshot is that the use 8T SRAM allowed the design to run reliably at lower voltages. SRAM reliability at a given voltage level has come up as a design consideration in discussions or articles about the latest designs.

I know that caches tend to favor density while register files tend to favor high performance.
I had thought that pushing a cache to the same level of porting as a register file would make it noticeably more bloated than a register file due to the scaling of the cache's ancilliary hardware, but the most recent posts on the matter indicate the RAM would dominate.
 
Last edited by a moderator:
Everything that's a physically correct reflection shows a color based on the curvature of the reflective surface and the distance of the reflected object. Or do you propose to alter the reflection calculations in raytracing to make them more … dramatic (?) instead of realistic? ;)
No, of course not. The argument goes actually in another direction: As long as the whole picture does not show only (pseudo) random colors, there is inherent locality to exploit.
 
I think there's a solid argument for downright faking it when it gets way too random. The human brain is unable to make anything useful out of it but the aliasing will still annoy it so it's a lose-lose situation. Anyway that's a last recourse (most of the time sane content and/or local AA will be enough) but I think it should be considered - physically accurate rendering for the sake of physical accuracy isn't a viable strategy.

I see your (and probably Gipsels') point, but nonetheless do secondary rays exhibit a tendency to not behave nearly as nice as primary ones - and ray tracing with primary rays only is moot as well most of the time.


edit: And that's where the argument started, we can now move on with the primary topic :) Surely, there's a certain level of parallelism to exploit in some of the secondary rays, but equally surely, it won't be as high as for primary rays. And it's not only reflections adding to cache pollution.
 
Last edited by a moderator:
I can't speak for aaronspink
But we can let him speak. The original question was about the power consumption of a cache, which was beefed up to serve the same usage profile as the register files of GPUs. So while this sentence is normally true:
A cache will generally have much less power per bit than a register file.
, it only applies to the "normal" usage scenarios and designs as found in CPUs for instance. Aaron writes also, that a cache
just devolves into a register file as you add ports.
So where should the power consumption advantage come from, if the actual memory arrays doesn't differ anymore?

If you add the few additional tasks a cache must be able (how expensive or cheap that may be) to handle and the simple fact that the cache is very likely physically further away from the units than the register files (which are even splitted, so each lane of a vector ALU has its own register file to place it closer to the individual ALU) and it costs energy to drive data over a distance, it necessarily follows, that it would have a higher power consumption if used as register file. What you gain is some flexibility and the performance will decrease more gracefully, if you need more register space than offered by the register file.

Eventually, we may very well see something like the thing proposed by nvidia in that paper, where you have a few registers basically within each ALU to cover the operands for 4 or 5 instructions only, backed up by a larger register file, backed up by a cache system. That way the data transfer between the levels further away from the ALU decreases, i.e. it requires less transfers over larger distances, lowering the power consumption.
 
I see your (and probably Gipsels') point, but nonetheless do secondary rays exhibit a tendency to not behave nearly as nice as primary ones
But they do behave "nice enough". If they stop doing that, they are creating just noise, what you normally want to avoid anyway, because the picture probably starts to look awful at this point. A bit of noise can improve the perceived realism, but that bit doesn't mean everything.
 
I think what you mean is that they are not specifically targeting Llano. But it is clearly compatible with Llano and given that there's nothing fancy about Llano's CPU-GPU integration and the architecture is the same one used in many other AMD GPUs, I'm really not sure what the benefit of that could possibly be?

Sure, there might be no technical benefit (except for lower latency in CPU<->GPU communication, if it'll ever make a difference?), but Llano is damn cheap.
Looking at APU + motherboard prices, it's almost like a €50 computing-capable graphics card is being offered for free.

From a market standpoint, it should be a game changer, increasing the interest for software in supporting OpenCL\DirectCompute.
Especially if Llano's demand for laptops is as it's been rumoured to be.
 
Back
Top