Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 07-Aug-2008, 06:00   #1
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,074
Default

Quote:
Originally Posted by ShaidarHaran View Post
Hardly. If you architect a core from the ground-up with these features in mind, it is simple, as they are complimentary. Slapping them onto a P54c is an entirely different matter.
you do realize that almost ALL MT architectures were slapped on don't you? It doesn't require much, for 4 threads, 2 extra internal bits of register ID, a couple extra bits around the tlbs and memory queues and you are done.

Most MT implementations aren't that complex really.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 07-Aug-2008, 14:12   #2
ArchitectureProfessor
Senior Member
 
Join Date: Jan 2008
Posts: 211
Default

Quote:
Originally Posted by aaronspink View Post
you do realize that almost ALL MT architectures were slapped on don't you? It doesn't require much, for 4 threads, 2 extra internal bits of register ID, a couple extra bits around the tlbs and memory queues and you are done.
Just for completeness, Aaron left out the 4x larger register file. Which usually isn't a huge deal, but could require an extra pipeline stage for the register file read/write. Ironically, x86's small number of architectural registers (16) means make the register file 4x larger is less of a concern than for RISC ISAs that have 32 registers.
ArchitectureProfessor is offline   Reply With Quote
Old 08-Aug-2008, 10:31   #3
liolio
Member
 
Join Date: Jun 2005
Location: IDF France
Posts: 2,447
Default

Doesn't Intel hint at pretty high power consumption?
I mean ok for a larrabee like core the consumption should a tenth of what a penryn would use.

But, ALu are likely to be more busy, the bus can be pretty much a power hog, etc.

The only value we have is around ~300watts and this is from Intel won mouth.

For me it looks like perf per mm˛ for larrabee could be competitive/good enough (now).
power consumption and thermal dissipation could be more bothering.

Intel based is estimation on a 24cores larrabee running @ 1 Ghz, it's could be ok against actual gpu but by the time larrabbe is released even if Intel packs together more cores they need to hit their frenquencies figure (~2GHz).

I think it's basicely the reason why larrabee won't be here in early 2009, even @45nm 16/24 cores larrabe clocked high enough could have a really huge impact on the gpgpu market.
Intel has a lot less "software" reasons to hold back on the launch of the larrabee as a general purpose accelarator than as a GPU.

I'm looking for their upcoming presentation to learn more about this potential issues (ie power/frequencies).
liolio is online now   Reply With Quote
Old 08-Aug-2008, 15:13   #4
ArchitectureProfessor
Senior Member
 
Join Date: Jan 2008
Posts: 211
Default

Quote:
Originally Posted by liolio View Post
I'm looking for their upcoming presentation to learn more about this potential issues (ie power/frequencies).
As Larrabee hasn't fully "taped out" (meaning, the actual design is still being tweaked and debugged), even Intel probably doesn't know the actual power/frequency numbers. They have estimates and targets, but lots of things can still go wrong with the low-level implementation. A chip frequency is only as fast as its slowest part, so Intel is quite wise not to talk actual frequency numbers quite yet...
ArchitectureProfessor is offline   Reply With Quote
Old 08-Aug-2008, 20:09   #5
MfA
Senior Member
 
Join Date: Feb 2002
Posts: 4,277
Send a message via ICQ to MfA
Default

Single ended 5+ GHz signalling across 2 sockets with 10+ cm of traces is a little harder than covering a couple of cm between two BGAs ... there's always going to be a huge gap until things go optical.
MfA is offline   Reply With Quote
Old 13-Aug-2008, 19:28   #6
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,262
Send a message via Skype™ to Jawed
Default NVIDIA tackles the Larrabee question - FUD alert

Quote:
Intel claims the X86 instruction set makes parallel computing easier to accomplish
And then I got bored.

The whole thing is here:

http://www.pcper.com/images/news/A%2...m%20NVIDIA.pdf

Jawed
Jawed is offline   Reply With Quote
Old 13-Aug-2008, 20:21   #7
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
Default

is that document official?
it's not even signed, no nvidia logo, nothing..
__________________
[my blog]
Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 13-Aug-2008, 20:28   #8
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,262
Send a message via Skype™ to Jawed
Default

Sigh, I forgot to link to the article:

http://www.pcper.com/comments.php?nid=6008

Jawed
Jawed is offline   Reply With Quote
Old 13-Aug-2008, 20:50   #9
TimothyFarrar
Member
 
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
Default

Some points are valid and good to remember, it is not as if you are going to get good performance on Larrabee without good vectorization of the code...
__________________
Timothy Farrar :: blog
TimothyFarrar is offline   Reply With Quote
Old 13-Aug-2008, 20:52   #10
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
Default

Shaders wise I expect them to do a very good job, in then end it's the simplified programming model that made modern shaders so successful.
__________________
[my blog]
Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 13-Aug-2008, 21:55   #11
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,262
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by nAo View Post
Shaders wise I expect them to do a very good job [...]
And since it looks like Larrabee will be "serial-scalar done right" it'll prolly work quite well

Jawed
Jawed is offline   Reply With Quote
Old 13-Aug-2008, 22:07   #12
glw
Junior Member
 
Join Date: Aug 2003
Posts: 50
Default

"NVIDIA's approach to parallel computing has already proven to scale from 8 to 240 GPU cores."

How seriously can you take Nvidia's comments when they bend terminology like so? Change it to 30 cores or 10 three processor clusters and it doesn't sound quite so impressive.

The HPC people I know are very interested in Larrabee.
glw is offline   Reply With Quote
Old 13-Aug-2008, 22:09   #13
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
Default

Quote:
Originally Posted by glw View Post
"NVIDIA's approach to parallel computing has already proven to scale from 8 to 240 GPU cores."

How seriously can you take Nvidia's comments when they bend terminology like so? Change it to 30 cores or 10 three processor clusters and it doesn't sound quite so impressive.
I was just waiting a post from Aaron about this matter
__________________
[my blog]
Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 13-Aug-2008, 22:09   #14
MfA
Senior Member
 
Join Date: Feb 2002
Posts: 4,277
Send a message via ICQ to MfA
Default

They will get relatively good utilization of the compute resources with shaders (although small tris and incoherent branching will hit it like a ton of bricks just like other GPUs). How much compute resources will they end up with though? I certainly wouldn't be surprised if ATI gets near twice the FLOPs per mm2 in the end, even with a process disadvantage.
MfA is offline   Reply With Quote
Old 13-Aug-2008, 22:27   #15
glw
Junior Member
 
Join Date: Aug 2003
Posts: 50
Default

Quote:
Originally Posted by MfA View Post
I certainly wouldn't be surprised if ATI gets near twice the FLOPs per mm2 in the end, even with a process disadvantage.
I think that's quite probable, but it's delivered performance not peak that matters.

If Larrabee does more than play games it can succeed even if it isn't the fastest or most efficient 'GPU'. A 4 TFLOP GPU that you use 10% of the time or a 2 TFLOP CPU you use 50% of the time, which would you buy? I know most of you guys would say both.

A killer video or killer photo* app could sell a LOT of chips, and there are applications that aren't practical today that could become mass-market.

*Have a look at recent SIGGRAPH papers that are to do with imaging rather than rendering, there are some amazing things that could be done with lots of FLOPS.
glw is offline   Reply With Quote
Old 13-Aug-2008, 22:35   #16
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,834
Default

Quote:
Originally Posted by glw View Post
I think that's quite probable, but it's delivered performance not peak that matters.

If Larrabee does more than play games it can succeed even if it isn't the fastest or most efficient 'GPU'. A 4 TFLOP GPU that you use 10% of the time or a 2 TFLOP CPU you use 50% of the time, which would you buy? I know most of you guys would say both.
That's a tough one... Peak performance wins benchmarks, and benchmark wins sell hardware. I guess we'd have to get into specific cases to really answer that question.

Quote:
Originally Posted by glw View Post
A killer video or killer photo* app could sell a LOT of chips, and there are applications that aren't practical today that could become mass-market.

*Have a look at recent SIGGRAPH papers that are to do with imaging rather than rendering, there are some amazing things that could be done with lots of FLOPS.
I've been espousing this view about GPGPU for awhile now.
ShaidarHaran is offline   Reply With Quote
Old 13-Aug-2008, 22:38   #17
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 2,701
Default

Right now, the GPU FLOPs are "cheap".
Actually, x86 clusters are popular for HPC because a lot of it benefits from "cheap flops".
GPU FLOPs in comparison are not just cheap but "folex" cheap.

Larrabee's FLOPs are IEEE compliant and they are tied to a hardware architecture with known memory consistency and coherency, as well as a pipeline capable of supporting precise exceptions and possibly better debugging instrumentation.

GPUs are still too rooted at a base architectural level to the idea that what they compute doesn't need to be taken seriously.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 13-Aug-2008, 23:27   #18
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,074
Default

Quote:
Originally Posted by 3dilettante View Post
GPUs are still too rooted at a base architectural level to the idea that what they compute doesn't need to be taken seriously.
very much true, I wouldn't want to rely on the current GPUs for anything life critical or computationally critical like nuclear simulation.

It will be interesting to see if the various companies bring the DP capabilities of their designs to rough area efficient equivalence with the SP capabilities (ie. ~1/2 DP flop for every SP flop).
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 13-Aug-2008, 23:00   #19
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,262
Send a message via Skype™ to Jawed
Default

With 16 elements per hardware thread, Larrabee will be paying the lowest cost of any GPU for DB incoherency. Additionally, with an x86 thread per core that's able (amongst other things) to "re-pack" elements to minimise incoherency, DB should end up reasonably useful - though I will admit non-D3D programmers will prolly find themselves having to implement re-packing.

As for ALU performance per mm, I think it'll be a lot closer than people are expecting. Having no dedicated interpolation and/or transcendental ALUs should save a fair bit of space. And Larrabee will be way out in front on double-precision.

Jawed
Jawed is offline   Reply With Quote
Old 14-Aug-2008, 01:32   #20
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,516
Default

Quote:
Originally Posted by Jawed View Post
With 16 elements per hardware thread, Larrabee will be paying the lowest cost of any GPU for DB incoherency. Additionally, with an x86 thread per core that's able (amongst other things) to "re-pack" elements to minimise incoherency, DB should end up reasonably useful - though I will admit non-D3D programmers will prolly find themselves having to implement re-packing.
There's a lot of unknowns about Larrabee and DB. That paper just uses simple cycling between the batches, so there's no real scheduling. Who knows how well the x86 thread can handle this task.

When running with 16 elements per hardware thread, remember that Larrabee switches batches every clock. Achieving that kind of scheduling throughput with a dynamic algorithm (i.e. not a predefined sequence) may not be possible.
Mintmaster is offline   Reply With Quote
Old 14-Aug-2008, 05:28   #21
armchair_architect
Member
 
Join Date: Nov 2006
Posts: 128
Default

From reading the paper, it sounds like Larrabee puts more of a burden on the programmer for extracting efficiency than GPUs do (espcially in OGL/D3D but also CUDA). They've got smart people working hard to handle this for D3D and OGL apps, but if you go outside those or other libraries you take it on yourself -- and going outside these libraries is what Intel has been selling to developers.

The two big ones to me are:

(a) Hiding memory latency. They don't have lots of hardware-managed threads available per core to automatically hide latency. As far as I can tell from the paper, their renderer makes a "best effort" for general memory accesses, using prefetching and organizing the algorithms and data structures to maximize cache hits. They do have more cache than GPUs, but far less per "strand" than current CPUs do, and when they miss they're quickly going to start stalling. For texturing, they do SW context switching (between "fibers") on top of the HW threads, using a very simplistic scheduler -- just a ring of fibers per HW thread, blindly switch to the next after issuing a texture instruction. The SW is responsible for deciding when to switch and for actually performing the switch (consuming issue slots, etc.).

(b) Manually managing the predication stack. They have a vector predicate register, but it sounds like it's up to SW to push and pop it manually. They'll probably have some special support for this, but it looks like at least some of the burden is on SW to deal with this on every potentially-divergent branch/sync.

Neither of those sounds like much fun to me, or particularly easy. Both will affect efficiency and delivered "useful" FLOPs.

As Aaron pointed out, architecturally Larrabee and G80+ have much in common. The biggest difference, it seems to me, is going to be the programming model. CUDA hides the SIMD-ness of the hardware -- it's just an implementation detail. Nvidia can build wider or narrower cores (including non-SIMD) and the same code would run fine and scale pretty well. You need to be aware of SIMD to extract the maximum performance, but my experience so far is that you get pretty decent performance even if you don't think about it.

Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors. If you ignore SIMD, you run on the scalar x86 units only, and get none of the benefit of all that throughput. This is why I think Intel's noise about Larrabee being x86 is pointless -- running pure x86 code on it is a waste, and SSE code won't even run (AFAIK) and would waste 3/4 of the throughput even if it did.

I think my preference is clear, though there's still a lot of unknowns about Larrabee that could change things. But regardless of which is better, the contrasts are fascinating.

EDIT: Oops, forgot one point I wanted to make. For the reasons above, I'd want to use the CUDA programming model even when programming on Larrabee, and have a compiler+runtime map it to Larrabee vector instructions. Which shouldn't actually be too hard, hopefully someone implements this so I don't have to.

Last edited by armchair_architect; 14-Aug-2008 at 05:32. Reason: added the punchline
armchair_architect is offline   Reply With Quote
Old 14-Aug-2008, 07:26   #22
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
Default

Quote:
Originally Posted by armchair_architect View Post
The SW is responsible for deciding when to switch and for actually performing the switch (consuming issue slots, etc.).
This operation is likely to be so rare that I wouldn't worry about it consuming slots.
For what we know it might have even been implemented as an operation that can be scheduled on the U pipe, so that in the same cycle an instruction that operates on the vec registers can be scheduled as well.

Quote:
Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors. If you ignore SIMD, you run on the scalar x86 units only, and get none of the benefit of all that throughput. This is why I think Intel's noise about Larrabee being x86 is pointless -- running pure x86 code on it is a waste, and SSE code won't even run (AFAIK) and would waste 3/4 of the throughput even if it did.
Isn't it a bit far fetched to say that the programming model you describe here will be the only one available? AFAIK Intel is part of the group working on OpenCL, they also have Ct, which I hope will see the light of the day as some point in the future.
We should also not forget DX11 compute shaders!

Quote:
EDIT: Oops, forgot one point I wanted to make. For the reasons above, I'd want to use the CUDA programming model even when programming on Larrabee, and have a compiler+runtime map it to Larrabee vector instructions. Which shouldn't actually be too hard, hopefully someone implements this so I don't have to.
Exactly my thought, I'm no compilers expert but having CUDA or something CUDA-like sounds quite feasible.
__________________
[my blog]
Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 14-Aug-2008, 12:44   #23
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,262
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by armchair_architect View Post
CUDA hides the SIMD-ness of the hardware [...]
Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself.
But doesn't this distinction become moot as soon as you start programming the memory system for maximum performance - now you're trying to maximise cache hits, maximise scatter/gather bandwidth or traverse sparse data sets with maximum coherence. All those tasks become intimately tied into the SIMD width and corresponding path sizes/latencies within the memory hierarchy.

Jawed
Jawed is offline   Reply With Quote
Old 14-Aug-2008, 14:26   #24
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 2,369
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by armchair_architect View Post

Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors.
This is what drives me mad about the present state of SSE. Why the hell is the only way to code for it is to go to compiler intrinsics.

(-) non portable
(-) compiler specific
(-) reduced to writing assembly in 2008. From a company which supposedly has the best compilers in the world. WTF

Why did they not push for a library (standardized one on x86, atleast, with alternative implementations avl on PPC etc.). Too smitten by IPP revenues? I hope they do better with AVX. Why the hell should I rewrite every fucking bit of my most optimized code inside out just because they added new instructions?
Which is why I love CUDA in this regard. They can go from 8 wide to 32 wide SIMD without anyone breaking a sweat. And it's programming model is such that further scaling simd width is much less painful.

Intel, however can win a lot of points if it were to release a CUDA compiler for nvidia. They will be able to co-opt all of nvidia's efforts at developer evangelism into a huge codebase running just fine on larrabee.
__________________
The views presented here are my own and do not represent my present or past employers' views in any way.
My blog
Eigen : simd done right
rpg.314 is offline   Reply With Quote
Old 14-Aug-2008, 12:18   #25
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,262
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Mintmaster View Post
There's a lot of unknowns about Larrabee and DB. That paper just uses simple cycling between the batches, so there's no real scheduling. Who knows how well the x86 thread can handle this task.
I think it's reasonable to assume the x86 part of each core (split across 4 hardware threads) is close to being fully dedicated to "managing" the execution of the 3D pipeline upon the VPU and all associated housekeeping.

Quote:
When running with 16 elements per hardware thread, remember that Larrabee switches batches every clock. Achieving that kind of scheduling throughput with a dynamic algorithm (i.e. not a predefined sequence) may not be possible.
I'm not sure how you're defining a batch here, because there is no "per-clock" switching.

Fibres switch only when latency is incurred. "One remaining issue is texture co-processor accesses, which can have hundreds of clocks of latency. This is hidden by computing multiple qquads [16 elements] on each hardware thread. Each qquad's shader is called a fiber. The different fibers on a thread co-operatively switch between themselves without any OS intervention. A fiber switch is performed after each texture read command, and processing passes to the other fibers running on the thread. Fibers execute in a circular queue. The number of fibers is chosen so that by the time control flows back to a fiber, its texture access has had time to execute and the results are ready for processing."

So "every clock" switching only occurs around a texture instruction.

As far as the hardware thread is concerned, all of its extant fibres' states exist concurrently in memory (cache) and register file. The switch from one fibre to another is no more costly than in a GPU where a register starts to be used for the first time.

As far as I can tell the compiler provides directives to the processor for it to perform a hardware thread switch (I assume these are just normal stream-paradigm clause identifiers). "Finally, Larrabee supports four threads of execution, with separate register sets per thread. Switching threads covers cases where the compiler is unable to schedule code without stalls. [which I assume to mean the evaluation of predicates and choosing the resulting JMP destination, amongst other things] Switching threads also covers part of the latency to load from the L2 cache to the L1 cache, for those cases when data cannot be prefetched in the L1 cache in advance."

Doing fibre re-packing to optimise DB needs a combination of predicate evaluation and shuffling of state. I assume that state is manipulated in L1/L2 cache, i.e. this is programmed as a scatter from VPU register file into competing pools then consumed one pool at a time.

Clearly if the shader program has a very short DB clause then it's going to be slower to re-pack than to use predication. But with nested DB I presume it won't take many levels before it becomes better to re-pack. The issue, then is how much cache is consumed in pooling state as registers are pulled out of the VPU (sort of creating an F-buffer). Memory latency then becomes an issue too. The algorithm that determines the count of fibres allocated per thread presumably assesses the incoherency/nesting of DB if re-packing is desired.

Jawed
Jawed is offline   Reply With Quote

Reply

Bookmarks

Tags
graphics, intel

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:13.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.