22 nm Larrabee

As 3D and others pointed out, trying to add wide scatter/gather hurts serial performance, so they are unlikely to be added.
Why does that have to be the case? Intel already added two load units to SNB, without compromising single-threaded scalar performance. So I don't see any reason why a gather instruction with a maximum throughput of one every four cycles can't be implemented in a straightforward way without negatively impacting anything else.

Note that these load units have been capable of handling misaligned loads which straddle a cache line boundary for ages. One line could be in L1, the other in a swapped out page (though typically they'll be close together). The above implementation would merely require extending this to keeping track of four instead of two cache lines.

And from that point forward it seems relatively simple to me to add the ability to fetch up to four 32-bit elements from each cache line. Of course each of these things will require additional transistors and/or latency, but as the transistor budget continues to increase exponentially and clock frequencies increase only at a modest pace, it seems to me this will soon enough pose little of a problem. Doubling the number of load units in SNB also wasn't free, but 32 nm made it feasible without noticeable compromises...
Tex units are non negotiable, and in all likelyhood, to make a competitive gpu you'll need more ff hw as well.
For a competitive GPU, sure, but this was about making the CPU competitive with the IGP (and at the same time turning it into a generic high-performance computing device like Larrabee). Very different design rules apply and everything is negotiable.
 
Nick, as to your points, even if you were right about area (which I'm very skeptical about)...
What area specifically, and what makes you skeptical about it?
I think you're massively underestimating the data movement overhead and in general the power consumption penalty of all this.
Again, which things specifically, and how massive is the overhead exactly?

Gather/scatter support would "massively" reduce data movement overhead. Currently it takes 18 instructions to emulate a gather operation, involving pretty much every part of the core, and constantly moving 128-bits of data back and forth. Gather/scatter would reduce it to a single instruction which doesn't involve the ALU pipelines, and the data movement is reduced to a near minimum.

Executing AVX-1024 on existing 256-bit (and 128-bit) execution units seems like it would require negligible
extra logic, but reduces the instruction fetch/decode/schedule overhead by up to a factor four. So again this reduces data movement and power consumption.

Last but not least, while I'm aware this still leaves a higher overhead than an IGP, there's potential to do things more efficiently at a higher level by having no API limitations. Some data movement is also avoided by keeping it more local.
And chip designers are more and more willing to sacrifice a LOT of area to save power consumption, both directly at the architectural level and indirectly by reducing voltages.
Why would that be a problem?
 
I think the pipeline structure gets much more complexed, and is much harder to load-balance them like DX9 does without using off chip memory stream out.
What makes you think that?
Not to mention those new features like GS, Append/Consume buffers and Tessellation units.They might require even more bandwidth compared to GPUs because of the lack of a huge FIFOs.
You only pay the price for features that are actually in use. Everything else can be eliminated through dynamic code generation.

And why would these FIFOs have to be huge? The aim is to approach IGP performance, not the performance of big discrete GPUs. The latest CPUs have plenty of cache space to keep the data local as long as possible. It also easily adapts to any workload.
PS: I ran SwiftShader with my raytracing demo, the render result looks incorrect. I understand the need to optimize the texture unit, but what change affacts precision?
SwiftShader passes the (relevant) WHQL tests, so it complies with the minimal precision requirements. That said, I replaced all textures with a floating-point format and the artifacts were gone. It's hard to tell what really causes the difference, but I might look into this in my spare time...
 
What makes you think that?

You only pay the price for features that are actually in use. Everything else can be eliminated through dynamic code generation.

And why would these FIFOs have to be huge? The aim is to approach IGP performance, not the performance of big discrete GPUs. The latest CPUs have plenty of cache space to keep the data local as long as possible. It also easily adapts to any workload.

Obviously you know software renderer much much better, so I'm just been curious.

The FIFO thought came across my mind because of this paper:
http://graphics.stanford.edu/papers/gramps-tog/gramps-tog08.pdf

SwiftShader passes the (relevant) WHQL tests, so it complies with the minimal precision requirements. That said, I replaced all textures with a floating-point format and the artifacts were gone. It's hard to tell what really causes the difference, but I might look into this in my spare time...

Good to know. I'll play with SwiftShader for a while and maybe I'll tell you what causes that.
 
What makes you think that?

You only pay the price for features that are actually in use. Everything else can be eliminated through dynamic code generation.

And why would these FIFOs have to be huge? The aim is to approach IGP performance, not the performance of big discrete GPUs. The latest CPUs have plenty of cache space to keep the data local as long as possible. It also easily adapts to any workload.

SwiftShader passes the (relevant) WHQL tests, so it complies with the minimal precision requirements. That said, I replaced all textures with a floating-point format and the artifacts were gone. It's hard to tell what really causes the difference, but I might look into this in my spare time...


"Approaching IGP performance" with multiple cores isn't exactly an achievement. It's a lot more like a giant step backwards in performance and power efficiency.

DK
 
Thanks to my cheer mind power I feel liek we're striving farther and father from an agreement on the matter... :LOL:
Anyway Intel so far has no plan to use standard CPU cores as a replacement for the IGP. We're not even sure about Intel replacing the IGP with larrabee cores.

Back to larrabee cores if they are ever to exist in public space, I believe their a concensus here about what should be change even though there is a disagreement on "would that be enough?".
Larrabee V2 cores:
*should support 4 hardawre threads
*there should be really minor changes to the ISA (mostly a scater gather implementation as described in the aforementioned paper)
*minor change to the caches (only L1 associativity could change from 2 to 4)
That what we have from Intel owns papers.

From the discussion going on here:
* extending registers form 16 wide to 32 wide (512 to 1024 bits) could help hiding memory operations as scater/gather and "atomic reductions".

I thing it can be fair to specifying against which kind of system the chip would compete ie AMD APU.
So the chip has to be a match or exceed llano's successors. Does that look possible?
For me assuming llano will be bandwidth constrained whereas Intel TBR should do better making up for some of the architecture default inefficiency. Bulldozer won't allow AMD to catch up with INtel as far as IPC is concerned Intel may go with less cores and have more space available for larrabee cores (ie two Snb or better cores vs two BD modules may not be that suicidal as far as perfs are concerned).
From what we know the previous Intel presentation is that scaling for quiet some workload was excellent till 8 cores, pretty good till 16 cores.
So taking in acount Intel growing process advantage my belief that larrabee cores may be a relevant option for Haswell. Whereas INtel do it or not is more a matter of their will as Intel may have more important agenda than pushing larrabee cores into to their CPUs (complete redisgn of ATom/low end offering is likely to provide order of magnitude higher ROI, so no matter what larrabee team managed /were close to achieve in the end it could end in the junk).
 
"Approaching IGP performance" with multiple cores isn't exactly an achievement. It's a lot more like a giant step backwards in performance and power efficiency.
You're (still) not looking at the entire picture.

If maximum performance and low power consumption was the only thing that mattered, we'd have a custom chip for every application out there. Clearly though there's value in having chips which can perform more than one task adequately. Even GPUs are continually evolving into more generic devices, sacrificing raw performance and power efficiency, because that makes them more efficient at complex workloads. The value of a homogeneous CPU capable of performing the IGP's task goes way beyond graphics. It even goes beyond the value of Larrabee, without having to compete with high-end dedicated devices.

Besides, you can't claim it's a giant step backwards without proof or strong indications that FMA, gather/scatter and AVX-1024 don't suffice to increase performance and reduce power consumption sufficiently to make it work.
 
We're not even sure about Intel replacing the IGP with larrabee cores.
In my opinion that would be plain silly considering how close AVX can get to LRBni. The overhead of moving data between CPU and LRB cores and managing a lot more threads could easily nullify any benefit. Heck, some papers even show that Knight's Ferry is only marginally faster than Nehalem. Then why not make the CPU faster and more power efficient instead? They already added AVX, and FMA and 1024-bit operations are also already specified. I really don't see the benefit of LRB cores next to the CPU cores.
So the chip has to be a match or exceed llano's successors.
Not necessarily. It all depends on what the market really wants. Do people want to sacrifice CPU performance in favor of a slightly faster IGP? Or do they prefer a powerful CPU suitable for a variety of workloads?

The very existence and relative success of IGPs as a low-cost barely adequate graphics solution tells me that people primarily still look for the fastest possible CPU for their budget. Graphics is not as important as some hardcore gamers here love to think. Real gamers should get a discrete GPU anyway, not an IGP or APU. Besides, even hardcore gamers would benefit more from evolving AVX toward LRBni rather than any transistors invested into a bigger/hotter/more complex IGP.

This discussion is very similar to the one about unifying vertex and pixel shading units. Do we want cores specialized at specific tasks, or is it worth sacrificing some raw performance and power efficiency for the sake of flexibility? Clearly flexibility won over raw performance and power efficiency. Of course it was a bit of a chicken-and-egg issue (developers needed time to take advantage of this new flexibility), but the cost of unification got amortized pretty quickly and today it would be unthinkable to go back to a heterogeneous GPU.

So try to look beyond just the next couple generations. FMA isn't even announced for any of Intel's future CPUs yet, so gather/scatter and AVX-1024 should not be expected to potentially appear on the roadmap for several years. But by that time graphics will have evolved too so things like software rasterization and programmable texture sampling might seem less crazy to some.
 
This discussion is very similar to the one about unifying vertex and pixel shading units. Do we want cores specialized at specific tasks, or is it worth sacrificing some raw performance and power efficiency for the sake of flexibility? Clearly flexibility won over raw performance and power efficiency.
Surely unified shaders offered better raw performance.
 
Geometry will always cost more on a TBDR than an IMR. There's no way around that. So no, it can't win big with high tessellation unless it's the pixel workload that is giving it the advantage (or, naturally, if you only give the TBDR an optimation that is equally valid for the IMR).
I am not sure I follow.

The bw difference between a TBDR and an IMR boils down to geometry r/w bw before rasterization and fragment/z bw after shading.

If you can get away with dumping just the untessellated geometry to memory, then you have just performed geometry compression which an IMR can't copy. Much more so when you crank up the tessellation level. The fragment/z bw has prolly increased for an IMR where a TBDR wins anyway, but it would be a win even if it didn't.

And how about random value functions using bit masks? Procedural noise? It's an interesting paper, though.
Well, I guess just make the common case fast and the rare case correct.
 
Back to larrabee cores if they are ever to exist in public space, I believe their a concensus here about what should be change even though there is a disagreement on "would that be enough?".
Larrabee V2 cores:
*should support 4 hardawre threads
*there should be really minor changes to the ISA (mostly a scater gather implementation as described in the aforementioned paper)
*minor change to the caches (only L1 associativity could change from 2 to 4)
That what we have from Intel owns papers.
This is probably describing the original Larrabee.

So taking in acount Intel growing process advantage my belief that larrabee cores may be a relevant option for Haswell.
I think rumors put a possible change from the CPU + IGP paradigm beyond Haswell.

One thing to note is that apparently Intel is taking a TDP haircut at 22nm for at at least some of its mobile lines. Whatever CPU improvements we speculate to happen need to be evaluated in the context that they will have significantly fewer watts to fit under. Perhaps Intel will create another architectural branch, otherwise, the mobile design is going to lead the direction of the desktop one.
Also, it seems the percentage of die area devoted to the IGP may be increasing.
 
Is it changed during the midway? I haven't focus on it for times, it is 32nm base on Intel's words in ISC2010 and now I can find two expression through Google

That's ridiculous. Another name for Knights Ferry is Aubrey Isle, the silicon for what was known as "Larrabee". It's like the rumors that claimed Oak Trail was 32nm(quite a few).

Look at these two articles:

http://www.techdelver.com/latest-an...ure-computing-intel-32-core-processorcpu.php/

The initial 32-core chip is made using the existing 32-nm process.

http://www.infoworld.com/d/hardware/intels-new-32-core-chip-its-fastest-processor-ever-571

The initial 32-core chip is made using the existing 45-nm process.

Maybe you are late delving in to tech news, but being on 45nm is old, old news.
 
Surely unified shaders offered better raw performance.
R580+: 380 million transistors, 426 GFLOPS @ 650 MHz
R600 XT: 700 million transistors, 475 GFLOPS @ 743 MHz

Of course there were more changes than just the shader unification, but I'm inclined to think it's a major contributor to the lower computing density. Suddenly every core had to have all features to ensure both vertex and pixel shaders ran efficiently.
 
R580+: 380 million transistors, 426 GFLOPS @ 650 MHz
R600 XT: 700 million transistors, 475 GFLOPS @ 743 MHz

Of course there were more changes than just the shader unification, but I'm inclined to think it's a major contributor to the lower computing density. Suddenly every core had to have all features to ensure both vertex and pixel shaders ran efficiently.
R600 was bloated for many reasons including DX10 functionality. Xenos and R770 are more refined unified shader architectures.
 
R600 was bloated for many reasons including DX10 functionality. Xenos and R770 are more refined unified shader architectures.
RV770's shader and texture units take 40% of its die space. Reducing it to R600 computing power would save about 30% of die space. This appears to closely correlate with the transistor count difference (also note its memory controller is less wide).

So there's no reason to assume they were able to significantly "refine" the unified shaders. Which brings me back to the observation that unification sacrificed raw computing density in favor of flexibility.

By the way, the move from VLIW5 to VLIW4 also decreased computing density, but again of course the increased efficiency for today's workloads should make it worth it. Anyhow, it indicates that the software hasn't stopped evolving in the last couple years, and is likely to continue to become more complex and require the hardware to sacrifice raw performance in favor of efficiency/flexibility.

If you look at the radical changes in GPU architectures in the last decade, it's really not that hard to see that by the end of this decade texture sampling could very well be performed in software. It's one of those things which is currently either a bottleneck or idle silicon, and suffers from high latency (even when the data is close and filtering is disabled). Likewise generic load/store units can be a bottleneck or twiddling thumbs. By 'unifying' them you get more bandwidth for any task, and by lowering the latency the number of strands can be reduced (better scaling behavior) and the register set can be reduced.

CPUs can get there sooner.
 
SwiftShader passes the (relevant) WHQL tests, so it complies with the minimal precision requirements. That said, I replaced all textures with a floating-point format and the artifacts were gone. It's hard to tell what really causes the difference, but I might look into this in my spare time...

Hi Nick, I tested SwiftShader a bit and found that point sampling tex2D() a 256x1 ARGB8 texture with (1.f / 256, 0) dosen't return the 1th texel, but 0th. There seems to be a slight sampling offset of -1.0/65536.
 
I am not sure I follow.

The bw difference between a TBDR and an IMR boils down to geometry r/w bw before rasterization and fragment/z bw after shading.

If you can get away with dumping just the untessellated geometry to memory, then you have just performed geometry compression which an IMR can't copy. Much more so when you crank up the tessellation level. The fragment/z bw has prolly increased for an IMR where a TBDR wins anyway, but it would be a win even if it didn't.
IMR can't copy because IMR just read the raw untessellated mesh once, but for TBDR it's three times, one for raw geometry, 2 for "compressed" W/R. Unless the compressed size is 0, it's higher bandwidth for TBDR. I can't see how it could be an advantage.

What kind of lossless compression is feasible for such system BTW?
 
R580+: 380 million transistors, 426 GFLOPS @ 650 MHz
R600 XT: 700 million transistors, 475 GFLOPS @ 743 MHz

Of course there were more changes than just the shader unification, but I'm inclined to think it's a major contributor to the lower computing density. Suddenly every core had to have all features to ensure both vertex and pixel shaders ran efficiently.
I wasn't thinking of monster chips - more of high efficiency mobile devices.
 
Back
Top