22 nm Larrabee

IMR can't copy because IMR just read the raw untessellated mesh once, but for TBDR it's three times, one for raw geometry, 2 for "compressed" W/R. Unless the compressed size is 0, it's higher bandwidth for TBDR. I can't see how it could be an advantage.
You need to remember that RAM is not really RAM - it much prefers very large contiguous reads and writes to be efficient.
 
You're (still) not looking at the entire picture.

If maximum performance and low power consumption was the only thing that mattered, we'd have a custom chip for every application out there.

Have you ever looked a mobile phone?

Clearly though there's value in having chips which can perform more than one task adequately.

Yes there is. However, that says nothing about the individual blocks within said chip.

Even GPUs are continually evolving into more generic devices, sacrificing raw performance and power efficiency, because that makes them more efficient at complex workloads. The value of a homogeneous CPU capable of performing the IGP's task goes way beyond graphics. It even goes beyond the value of Larrabee, without having to compete with high-end dedicated devices.

It's only valuable if it's at a reasonable power, performance and area cost. If your power is 2X worse, you're screwed. And what you've been talking about is something more along the lines of 4-8X worse. Unless the world drastically change and doing full screen 1080p rendering with full special effects only takes a few milliwatts, it's just not going to happen.

Besides, you can't claim it's a giant step backwards without proof or strong indications that FMA, gather/scatter and AVX-1024 don't suffice to increase performance and reduce power consumption sufficiently to make it work.

You know it's funny. I never seem to recall Intel announcing that CPU cores would get 64B vectors or scatter/gather. Maybe you can point me to that roadmap?

David
 
RV770's shader and texture units take 40% of its die space. Reducing it to R600 computing power would save about 30% of die space. This appears to closely correlate with the transistor count difference (also note its memory controller is less wide).
If you are only scaling back the SIMD portion, then RV770 would have been pretty much the same die size, with a hole of unused silicon in the middle. Unless you are also advocating cutting the memory bus, dropping off other parts of the chip, and changing the layout significantly, it would have been pad limited.
Per ATI, two of its SIMD blocks were effectively "free" because of tools improved more than was expected from the problematic set they had for R600, hence the mismatch in interpolator capability and SIMD count.

By the way, the move from VLIW5 to VLIW4 also decreased computing density, but again of course the increased efficiency for today's workloads should make it worth it.
VLIW4's primary justification was that allowed AMD to compact the ALU section further.
This has the upshot that since other (less programmable) parts of the chip did not shrink or were augmented, compute density would go down.
http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/4
Within the compute section, Anandtech cites a 10% improvement in efficiency per mm2, although this could be open to interpretation since I don't think percent of efficiency per area makes sense if taken literally.

If you look at the radical changes in GPU architectures in the last decade, it's really not that hard to see that by the end of this decade texture sampling could very well be performed in software. It's one of those things which is currently either a bottleneck or idle silicon, and suffers from high latency (even when the data is close and filtering is disabled). Likewise generic load/store units can be a bottleneck or twiddling thumbs. By 'unifying' them you get more bandwidth for any task, and by lowering the latency the number of strands can be reduced (better scaling behavior) and the register set can be reduced.
The bandwidth capabilities per area for generic load/store units in desktop CPUs do not match that of the TMUs in a GPU.
The apparent benefit, if we accept your assertion that we can get more for any task, is that we use less cheap SRAM.

CPUs can get there sooner.

We may see it sometime after Haswell's successor, perhaps in 2016.
 
Nick said:
Of course there were more changes than just the shader unification, but I'm inclined to think it's a major contributor to the lower computing density. Suddenly every core had to have all features to ensure both vertex and pixel shaders ran efficiently.
Given the nature of this thread, I do believe this is about the single most ironic statement that could have possibly been made.
 
VLIW4's primary justification was that allowed AMD to compact the ALU section further.
This has the upshot that since other (less programmable) parts of the chip did not shrink or were augmented, compute density would go down.
http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/4
Within the compute section, Anandtech cites a 10% improvement in efficiency per mm2, although this could be open to interpretation since I don't think percent of efficiency per area makes sense if taken literally.
I think that is a misquote by Anandtech. IIRC, one of the slides said 10% more perf/mm2.
 
Hi Nick, I tested SwiftShader a bit and found that point sampling tex2D() a 256x1 ARGB8 texture with (1.f / 256, 0) dosen't return the 1th texel, but 0th. There seems to be a slight sampling offset of -1.0/65536.
1.0 / 256 is in between texel 0 and 1, and due to round-to-nearest-even it selects texel 0. Are you getting different behavior with the reference rasterizer? Does 1.01 / 256 get you texel 1?
 
Have you ever looked a mobile phone?
Yes. But here I'm talking mostly about Larrabee type technology, and how it could relate to the IGP and desktop CPU.

But anyway, like I said before, even mobile chips are rapidly evolving toward higher programmability, which costs performance and power efficiency. Fortunately process technology advances makes these trade-offs feasible, and I don't see it stopping any time soon.
It's only valuable if it's at a reasonable power, performance and area cost. If your power is 2X worse, you're screwed.
Not really. You have to take the cost into account as well. It only takes relatively minor changes to make the CPU a lot more efficient at throughput computing. Several years from now you might basically get these features for free. If you can get 1 TFLOP out of a mainstream CPU while it remains well within its TDP, who's going to buy a coprocessor that offers the same performance at half the power consumption, with a limited programming model? I'm sure there's a market for those coprocessors, but powerful CPUs are anything but "screwed".
You know it's funny. I never seem to recall Intel announcing that CPU cores would get 64B vectors or scatter/gather. Maybe you can point me to that roadmap?
I don't see what's funny. Why does there have to be an announced roadmap? The people on this forum speculate about future architectures, all the time. If you only want to talk about existing and announced products, I suggest you look for another forum. Tom's Hardware perhaps.

Personally I'd rather discuss what would happen if they did add full-fledged FMA support, gather/scatter, and 128B vectors. For what it's worth, FMA instructions are fully specified, gather/scatter is a hot topic and they have it implemented in Larrabee, and the encoding bits for extending AVX to 1024-bit are reserved. So it's not as if I'm talking about things that would require a complete overhaul. It would fit perfectly in the list of things that have been added to the SIMD instruction set...
 
1.0 / 256 is in between texel 0 and 1, and due to round-to-nearest-even it selects texel 0. Are you getting different behavior with the reference rasterizer? Does 1.01 / 256 get you texel 1?

I think it's not round to nearest even. I'm not sure, but I have a little program that does 2 type of precision tests:
Fetched texel precision,
Texture coordinate precision.

Here the my source code(6kb) don't know if it helps:
http://www.freefilehosting.net/gpuprecisiontest

Textue coordinate rounding for reference and recent NVidia/AMD GPUs all similar to (assuming texture width of 256):
[0, 1/256), [1/256, 2/256), [2/256, 3/256),...where "[": inclusive and ")": exclusive.

Actually no, SwiftShader behaves a bit different.

for REF:
8bit average error: 0.000000e+000
10bit average error: 0.000000e+000
16bit average error: 0.000000e+000
I8 Texcorrd offset near 0[0]: 0.000000e+000
I8 Texcorrd offset near 255/256[0]: -2.980232e-008

for NVIDIA: (pretty bad precision for INT10)
8bit average error: 0.000000e+000
10bit average error: 1.855369e-009
16bit average error: 0.000000e+000
I8 Texcorrd offset near 0[0]: 0.000000e+000
I8 Texcorrd offset near 255/256[0]: -2.980232e-008

for Swift:
8bit average error: -9.469659e-009
10bit average error: 1.676381e-008
16bit average error: 1.164153e-010
I8 Texcorrd offset near 0[0]: -1.525879e-005
I8 Texcorrd offset near 255/256[0]: -2.980232e-008

Old AMD gpus(HD3200) which is worse in terms of texcoord rounding
8bit average error: 0.000000e+000
10bit average error: 0.000000e+000
16bit average error: 0.000000e+000
I8 Texcorrd offset near 0[0]: -1.221001e-004
I8 Texcorrd offset near 255/256[0]: -1.221001e-004

New AMD gpus (): best of all above HW, only slight imperfect.
ATI Mobility Radeon HD 5600/5700 Series
8bit average error: 0.000000e+000
10bit average error: 0.000000e+000
16bit average error: 0.000000e+000
I8 Texcorrd offset near 0[0]: -1.192046e-007
I8 Texcorrd offset near 255/256[0]: -2.980232e-008

Why I said there's -1/65536 for Swift is because of the offset found by the program: -1.525879e-005.

I would like to see a software render likes SwiftShader getting better not only in terms of speed but matches GPU's precision. I actually quite amazed by any attempt to make software renderer better.
 
I would like to see a software render likes SwiftShader getting better not only in terms of speed but matches GPU's precision. I actually quite amazed by any attempt to make software renderer better.
You can change some precision settings in the SwiftShader.ini file. The easiest way to edit it is to run your application in windowed mode and go to http://localhost:8080/swiftconfig. Set 'Transcendental function precision' to IEEE, and enable 'Exact color rounding'.

Thanks for the test app! I'll see what can be done to increase accuracy even more.

Edit 1: I got the "8bit average error" down to 0.0.

Edit 2: I also got the "Texcorrd offset near 0[0]" down to 0.0.

The "16bit average error" went down to 4.547686e-013, but interestingly this result even depended on the /fp:strict and /fp:precise compiler flag for your test app! By using float instead of double variables in this test, the error was 0.0.

I also noticed that you're simply adding the differences to compute the mean error. This seems incorrect to me since a difference of X and -X would result in a 0.0 error. So you should sum the absolute values instead, or perhaps you could use the root-mean-square error.
 
Last edited by a moderator:
Not really. You have to take the cost into account as well. It only takes relatively minor changes to make the CPU a lot more efficient at throughput computing.

Minor change my ass. You have no idea what you're talking about. The cost in power and area is incredibly high. Look at the power consumption for Larrabee if you don't believe me, and compare it to say, Nehalem-EX.

I don't see what's funny. Why does there have to be an announced roadmap? The people on this forum speculate about future architectures, all the time. If you only want to talk about existing and announced products, I suggest you look for another forum. Tom's Hardware perhaps.

I prefer informed speculation and rational thought.


Personally I'd rather discuss what would happen if they did add full-fledged FMA support, gather/scatter, and 128B vectors. For what it's worth, FMA instructions are fully specified, gather/scatter is a hot topic and they have it implemented in Larrabee, and the encoding bits for extending AVX to 1024-bit are reserved. So it's not as if I'm talking about things that would require a complete overhaul. It would fit perfectly in the list of things that have been added to the SIMD instruction set...

There's a time and place for it. The real question though, is where will scatter/gather be added. It's already in the GPU...do you really need to reinvent the wheel?

DK
 
Minor change my ass. You have no idea what you're talking about. The cost in power and area is incredibly high. Look at the power consumption for Larrabee if you don't believe me, and compare it to say, Nehalem-EX.
I don't think there's anything you can conclude about the power consumption and area cost of the things I proposed, from comparing Larrabee against Nehalem-EX. But please go ahead and prove me wrong. I'd also love to know exactly what would cause this "incredibly high" power and area cost you talk about (a breakdown into individual components would be nice).
I prefer informed speculation and rational thought.
Me too. That still doesn't mean there has to be a public roadmap for AVX with these features already on it though.
There's a time and place for it. The real question though, is where will scatter/gather be added. It's already in the GPU...do you really need to reinvent the wheel?
Despite some form of gather/scatter support the IGP is pretty terrible at generic computing. So from Intel's perspective there's a gap between the capabilities of CPUs and high-end discrete GPUs. By adding gather/scatter to the CPU they'd fill that gap for the lowest price. So yes they really need to "reinvent the wheel" to keep their dominant position.
 
Haswell will support gather: Haswell New Instructions (AVX2).

This is a tock earlier than I had expected. It seems Intel split the difference on time to market and scatter/gather by adding the first half to the AVX spec.
It also promotes integer SIMD to 256 bits, but defers extending AVX.

With what latency and throughput?

This would depend on implementation, and so could vary significantly. I think at least initially, it looks like it could be microcoded.



The instruction can be suspended in an intermediate state and it partially updates its related mask and destination registers.
There is a general purpose register used as the base.
AVX2 subdivides things into 128-bit lanes, with restricted data transport between them.
It also does not check for alignment.

It sounds like microcode or something like it is involved for gather.
Drawing a general-purpose register can insert latency into the process.
There may be some variation in latency depending on the proximity of addresses, and if they lie within the same 128-bit lane that AVX divides things into.
One simplification could be that a gather is split into multiple 128-bit loads for 256 bit or wider AVX, with at least one load per 128-bit lane.
Another simplification could be that the gather cannot issue multiple sub-loads at the same time, as the load/store units would not need to arbitrate access to the same register on readback for the same instruction.

Internally, there would be a general-purpose register read crossing over to the vector side. Then each 128-bit lane would run a loop of index check/load/permute/merge.
Two bottlenecks would be the L/S units and the permute resources of the core.
It would seem redundant to give the gather unit its own permute capability when there are shuffle and permute units in the AVX units proper. A microcoded solution could leverage this.
The L/S limitations would be based on port width and the number of L/S units.
 
I guess something that just cracks to 8 load micro ops would be the simplest. With 2 load ports, it could complete in 4 clocks in best case.

It seems that full scatter gather wouldn't be available before 2015.
 
There wouldn't be shared write-back to the destination register, so long as it's treated as a contiguous element.
At least on a design like current chips, the operation would serialize and we'd be back to 8 cycles at least.
It think there could be at least one cycle prior to the memory portion devoted to importing the base register value, and then there is possibly a cycle at the end for the completion of the merge.
 
The load/store units need to write the value back to the same register. The uops generated by the microcode engine can't write to the same register at the same time, or at least they can't currently for Intel.

Perhaps a design like Bulldozer's could save some of those cycles, because it doesn't have 256-bit registers and winds up cracking those ops anyway. There would be two physically distinct registers able to absorb new values.
 
Back
Top