If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
|
|
#1 | |
|
Senior Member
Join Date: Jun 2003
Posts: 2,074
|
Quote:
Most MT implementations aren't that complex really.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
|
#2 |
|
Senior Member
Join Date: Jan 2008
Posts: 211
|
Just for completeness, Aaron left out the 4x larger register file. Which usually isn't a huge deal, but could require an extra pipeline stage for the register file read/write. Ironically, x86's small number of architectural registers (16) means make the register file 4x larger is less of a concern than for RISC ISAs that have 32 registers.
|
|
|
|
|
|
#3 |
|
Member
Join Date: Jun 2005
Location: IDF France
Posts: 2,447
|
Doesn't Intel hint at pretty high power consumption?
I mean ok for a larrabee like core the consumption should a tenth of what a penryn would use. But, ALu are likely to be more busy, the bus can be pretty much a power hog, etc. The only value we have is around ~300watts and this is from Intel won mouth. For me it looks like perf per mm˛ for larrabee could be competitive/good enough (now). power consumption and thermal dissipation could be more bothering. Intel based is estimation on a 24cores larrabee running @ 1 Ghz, it's could be ok against actual gpu but by the time larrabbe is released even if Intel packs together more cores they need to hit their frenquencies figure (~2GHz). I think it's basicely the reason why larrabee won't be here in early 2009, even @45nm 16/24 cores larrabe clocked high enough could have a really huge impact on the gpgpu market. Intel has a lot less "software" reasons to hold back on the launch of the larrabee as a general purpose accelarator than as a GPU. I'm looking for their upcoming presentation to learn more about this potential issues (ie power/frequencies).
__________________
What's trying to be a bunch of presentations |
|
|
|
|
|
#4 |
|
Senior Member
Join Date: Jan 2008
Posts: 211
|
As Larrabee hasn't fully "taped out" (meaning, the actual design is still being tweaked and debugged), even Intel probably doesn't know the actual power/frequency numbers. They have estimates and targets, but lots of things can still go wrong with the low-level implementation. A chip frequency is only as fast as its slowest part, so Intel is quite wise not to talk actual frequency numbers quite yet...
|
|
|
|
|
|
#5 |
|
Senior Member
|
Single ended 5+ GHz signalling across 2 sockets with 10+ cm of traces is a little harder than covering a couple of cm between two BGAs ... there's always going to be a huge gap until things go optical.
|
|
|
|
|
|
#6 | |
|
Regular
|
Quote:
The whole thing is here: http://www.pcper.com/images/news/A%2...m%20NVIDIA.pdf Jawed |
|
|
|
|
|
|
#7 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
|
is that document official?
it's not even signed, no nvidia logo, nothing..
__________________
[my blog] Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
#8 |
|
Regular
|
|
|
|
|
|
|
#9 |
|
Member
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
|
Some points are valid and good to remember, it is not as if you are going to get good performance on Larrabee without good vectorization of the code...
__________________
Timothy Farrar :: blog |
|
|
|
|
|
#10 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
|
Shaders wise I expect them to do a very good job, in then end it's the simplified programming model that made modern shaders so successful.
__________________
[my blog] Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
#11 |
|
Regular
|
|
|
|
|
|
|
#12 |
|
Junior Member
Join Date: Aug 2003
Posts: 50
|
"NVIDIA's approach to parallel computing has already proven to scale from 8 to 240 GPU cores."
How seriously can you take Nvidia's comments when they bend terminology like so? Change it to 30 cores or 10 three processor clusters and it doesn't sound quite so impressive. The HPC people I know are very interested in Larrabee. |
|
|
|
|
|
#13 | |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
|
Quote:
__________________
[my blog] Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
|
#14 |
|
Senior Member
|
They will get relatively good utilization of the compute resources with shaders (although small tris and incoherent branching will hit it like a ton of bricks just like other GPUs). How much compute resources will they end up with though? I certainly wouldn't be surprised if ATI gets near twice the FLOPs per mm2 in the end, even with a process disadvantage.
|
|
|
|
|
|
#15 | |
|
Junior Member
Join Date: Aug 2003
Posts: 50
|
Quote:
If Larrabee does more than play games it can succeed even if it isn't the fastest or most efficient 'GPU'. A 4 TFLOP GPU that you use 10% of the time or a 2 TFLOP CPU you use 50% of the time, which would you buy? I know most of you guys would say both. A killer video or killer photo* app could sell a LOT of chips, and there are applications that aren't practical today that could become mass-market. *Have a look at recent SIGGRAPH papers that are to do with imaging rather than rendering, there are some amazing things that could be done with lots of FLOPS. |
|
|
|
|
|
|
#16 | ||
|
hardware monkey
Join Date: Mar 2007
Posts: 3,834
|
Quote:
Quote:
|
||
|
|
|
|
|
#17 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 2,701
|
Right now, the GPU FLOPs are "cheap".
Actually, x86 clusters are popular for HPC because a lot of it benefits from "cheap flops". GPU FLOPs in comparison are not just cheap but "folex" cheap. Larrabee's FLOPs are IEEE compliant and they are tied to a hardware architecture with known memory consistency and coherency, as well as a pipeline capable of supporting precise exceptions and possibly better debugging instrumentation. GPUs are still too rooted at a base architectural level to the idea that what they compute doesn't need to be taken seriously.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#18 | |
|
Senior Member
Join Date: Jun 2003
Posts: 2,074
|
Quote:
It will be interesting to see if the various companies bring the DP capabilities of their designs to rough area efficient equivalence with the SP capabilities (ie. ~1/2 DP flop for every SP flop).
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
|
#19 |
|
Regular
|
With 16 elements per hardware thread, Larrabee will be paying the lowest cost of any GPU for DB incoherency. Additionally, with an x86 thread per core that's able (amongst other things) to "re-pack" elements to minimise incoherency, DB should end up reasonably useful - though I will admit non-D3D programmers will prolly find themselves having to implement re-packing.
As for ALU performance per mm, I think it'll be a lot closer than people are expecting. Having no dedicated interpolation and/or transcendental ALUs should save a fair bit of space. And Larrabee will be way out in front on double-precision. Jawed |
|
|
|
|
|
#20 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,516
|
Quote:
When running with 16 elements per hardware thread, remember that Larrabee switches batches every clock. Achieving that kind of scheduling throughput with a dynamic algorithm (i.e. not a predefined sequence) may not be possible. |
|
|
|
|
|
|
#21 |
|
Member
Join Date: Nov 2006
Posts: 128
|
From reading the paper, it sounds like Larrabee puts more of a burden on the programmer for extracting efficiency than GPUs do (espcially in OGL/D3D but also CUDA). They've got smart people working hard to handle this for D3D and OGL apps, but if you go outside those or other libraries you take it on yourself -- and going outside these libraries is what Intel has been selling to developers.
The two big ones to me are: (a) Hiding memory latency. They don't have lots of hardware-managed threads available per core to automatically hide latency. As far as I can tell from the paper, their renderer makes a "best effort" for general memory accesses, using prefetching and organizing the algorithms and data structures to maximize cache hits. They do have more cache than GPUs, but far less per "strand" than current CPUs do, and when they miss they're quickly going to start stalling. For texturing, they do SW context switching (between "fibers") on top of the HW threads, using a very simplistic scheduler -- just a ring of fibers per HW thread, blindly switch to the next after issuing a texture instruction. The SW is responsible for deciding when to switch and for actually performing the switch (consuming issue slots, etc.). (b) Manually managing the predication stack. They have a vector predicate register, but it sounds like it's up to SW to push and pop it manually. They'll probably have some special support for this, but it looks like at least some of the burden is on SW to deal with this on every potentially-divergent branch/sync. Neither of those sounds like much fun to me, or particularly easy. Both will affect efficiency and delivered "useful" FLOPs. As Aaron pointed out, architecturally Larrabee and G80+ have much in common. The biggest difference, it seems to me, is going to be the programming model. CUDA hides the SIMD-ness of the hardware -- it's just an implementation detail. Nvidia can build wider or narrower cores (including non-SIMD) and the same code would run fine and scale pretty well. You need to be aware of SIMD to extract the maximum performance, but my experience so far is that you get pretty decent performance even if you don't think about it. Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors. If you ignore SIMD, you run on the scalar x86 units only, and get none of the benefit of all that throughput. This is why I think Intel's noise about Larrabee being x86 is pointless -- running pure x86 code on it is a waste, and SSE code won't even run (AFAIK) and would waste 3/4 of the throughput even if it did. I think my preference is clear, though there's still a lot of unknowns about Larrabee that could change things. But regardless of which is better, the contrasts are fascinating. EDIT: Oops, forgot one point I wanted to make. For the reasons above, I'd want to use the CUDA programming model even when programming on Larrabee, and have a compiler+runtime map it to Larrabee vector instructions. Which shouldn't actually be too hard, hopefully someone implements this so I don't have to. Last edited by armchair_architect; 14-Aug-2008 at 05:32. Reason: added the punchline |
|
|
|
|
|
#22 | |||
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
|
Quote:
For what we know it might have even been implemented as an operation that can be scheduled on the U pipe, so that in the same cycle an instruction that operates on the vec registers can be scheduled as well. Quote:
We should also not forget DX11 compute shaders! Quote:
__________________
[my blog] Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|||
|
|
|
|
|
#23 | |
|
Regular
|
Quote:
Jawed |
|
|
|
|
|
|
#24 | |
|
Senior Member
|
Quote:
(-) non portable (-) compiler specific (-) reduced to writing assembly in 2008. From a company which supposedly has the best compilers in the world. WTF Why did they not push for a library (standardized one on x86, atleast, with alternative implementations avl on PPC etc.). Too smitten by IPP revenues? I hope they do better with AVX. Which is why I love CUDA in this regard. They can go from 8 wide to 32 wide SIMD without anyone breaking a sweat. And it's programming model is such that further scaling simd width is much less painful. Intel, however can win a lot of points if it were to release a CUDA compiler for nvidia. They will be able to co-opt all of nvidia's efforts at developer evangelism into a huge codebase running just fine on larrabee. |
|
|
|
|
|
|
#25 | ||
|
Regular
|
Quote:
Quote:
Fibres switch only when latency is incurred. "One remaining issue is texture co-processor accesses, which can have hundreds of clocks of latency. This is hidden by computing multiple qquads [16 elements] on each hardware thread. Each qquad's shader is called a fiber. The different fibers on a thread co-operatively switch between themselves without any OS intervention. A fiber switch is performed after each texture read command, and processing passes to the other fibers running on the thread. Fibers execute in a circular queue. The number of fibers is chosen so that by the time control flows back to a fiber, its texture access has had time to execute and the results are ready for processing." So "every clock" switching only occurs around a texture instruction. As far as the hardware thread is concerned, all of its extant fibres' states exist concurrently in memory (cache) and register file. The switch from one fibre to another is no more costly than in a GPU where a register starts to be used for the first time. As far as I can tell the compiler provides directives to the processor for it to perform a hardware thread switch (I assume these are just normal stream-paradigm clause identifiers). "Finally, Larrabee supports four threads of execution, with separate register sets per thread. Switching threads covers cases where the compiler is unable to schedule code without stalls. [which I assume to mean the evaluation of predicates and choosing the resulting JMP destination, amongst other things] Switching threads also covers part of the latency to load from the L2 cache to the L1 cache, for those cases when data cannot be prefetched in the L1 cache in advance." Doing fibre re-packing to optimise DB needs a combination of predicate evaluation and shuffling of state. I assume that state is manipulated in L1/L2 cache, i.e. this is programmed as a scatter from VPU register file into competing pools then consumed one pool at a time. Clearly if the shader program has a very short DB clause then it's going to be slower to re-pack than to use predication. But with nested DB I presume it won't take many levels before it becomes better to re-pack. The issue, then is how much cache is consumed in pooling state as registers are pulled out of the VPU (sort of creating an F-buffer). Memory latency then becomes an issue too. The algorithm that determines the count of fibres allocated per thread presumably assesses the incoherency/nesting of DB if re-packing is desired. Jawed |
||
|
|
|
![]() |
| Bookmarks |
| Tags |
| graphics, intel |
| Thread Tools | |
| Display Modes | |
|
|