Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 17-Sep-2010, 11:25   #751
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by trinibwoy View Post
I thought the grief was from maintaining triangle order in a distributed environment? That challenge remains whether you're doing rasterization in fixed-function units or software.
Precisely my point. Intel's software rasterisation obviates the triangle ordering problem with its tiled approach. The struggle NVidia had is analogous to the struggle Intel had in distributing work across the cores.

I think the problem lies elsewhere. e.g. $2 billion yearly TAM, say, for performance/enthusiast discrete just isn't worth chasing in comparison with server/cloud/HPC.

Also life's simpler for Intel if it doesn't have to write drivers for D3D. There was always the question hanging over the architecture of how long it would take Intel to get a game's performance right, with worrying statements that months after game release would be required. (AMD doesn't seem to have much of a different attitude, though.)
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 17-Sep-2010, 13:02   #752
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Jawed View Post
Precisely my point. Intel's software rasterisation obviates the triangle ordering problem with its tiled approach.
Off hand, I can't see how it obviates the need. You just moved the serialization point from rasterization to spatial binning. Scaling spatial binning across cores while maintaining triangle order isn't exactly easy.

Quote:
I think the problem lies elsewhere. e.g. $2 billion yearly TAM, say, for performance/enthusiast discrete just isn't worth chasing in comparison with server/cloud/HPC.
And by the looks of it, they have got it almost right for SNB.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 17-Sep-2010, 14:17   #753
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
Off hand, I can't see how it obviates the need. You just moved the serialization point from rasterization to spatial binning. Scaling spatial binning across cores while maintaining triangle order isn't exactly easy.
The serialisation is actually per tile-pixel (or more granular, e.g. per tile qquad), and local to a single core since tiles in rasterisation (stages post setup until back-end) don't span cores.
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 17-Sep-2010, 15:09   #754
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

Is that assuming the implementation placed the tesselation stage in the front-end and not the back?

It would have been an interesting exercise to see what numbers Larrabee could have pulled in Heaven, its applicability to current workloads aside and assuming that the software renderer had been functionally coded to DX11 spec.

This latest Intel statement is far more down on Larrabee graphics than I've seen thus far, and is a noticeable drop from a position I have already perceived as being rather lukewarm. I suppose Tim Sweeny will need to wait a little longer for his software rendering dream to come true.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 17-Sep-2010, 15:41   #755
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by 3dilettante View Post
Is that assuming the implementation placed the tesselation stage in the front-end and not the back?
I don't understand what you're suggesting.

Tessellation was very much an open question, I don't remember any of Intel's materials covering it.
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 17-Sep-2010, 16:05   #756
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

The option existed to run the tesselation stages either in the front-end or back-end.
Whether there was ever an implementation of it for Larrabee is something I do not know, but Intel did discuss the possibility.

If a primitive is allocated to a bin and the back-end is responsible for performing tesselation, the generated triangles on one core could cross the bin's tile boundaries.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 17-Sep-2010, 16:17   #757
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Logarithmic shadow maps: now even more of a pipe dream! Oh well.

Anyway... if you look at communications, the vast vast majority of software-centric architectures still do problematic algorithms like Turbo Coding and Viterbi in hardware blocks. But there are exceptions that do those very efficiently in software - the trick is their architecture is incredibly unusual and very different from a traditional processor, even though it could afaict rightfully be called Turing Complete (as long as you look at a large enough piece of it rather than just a subsystem).

The basic problem with graphics is that the number of blocks that would benefit from such exotic architectures is actually very small, and their data flow is very complex (rasterisation being the poster child). And going down that route would create a lot of complexity at the compiler for more normal shading workloads, so overall it just doesn't make any sense and the best approach remains fixed-function.

The one thing Larrabee did provide above and beyond any current desktop GPU architecture is scalar/MIMD, and interestingly on-core rather than as a separate on-chip block. I'm honestly unsure whether there is much benefit to on-core SIMD+MIMD in either graphics or GPGPU compared to separate SIMD and MIMD cores, but a frequent problem of the latter in 80s/90s architectures is the lack of bandwidth between the scalar and the vector part. With the power consumption of data communication even on-chip increasing to dramatic levels, there might be something to be said for on-core integration of the two not (just?) from a software level but from a hardware level. Some sort of close coupling at least would make sense.

Of course, ideally we'd all go pure MIMD. Rys, can I haz Series6? (and please don't break my heart and tell me it's SIMD now )
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 17-Sep-2010, 16:27   #758
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by 3dilettante View Post
The option existed to run the tesselation stages either in the front-end or back-end.
Can't remember seeing that

Quote:
If a primitive is allocated to a bin and the back-end is responsible for performing tesselation, the generated triangles on one core could cross the bin's tile boundaries.
Tessellation consumes patches. I don't think patches would be screen-space binned.

The patches should be able to run in parallel through VS/HS to generate input to TS and DS. Ordering of triangles coming out of DS should be keyed by Patch ID, I presume (TS generating sub-patch triangle ID).

Is there a serialisation I'm missing?

Anyway, screen-space tiling for binning of triangles involved in tessellation (input or output) would be done post-GS.
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 17-Sep-2010, 16:55   #759
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

Quote:
Originally Posted by Jawed View Post
Can't remember seeing that
Tom Forsyth's SIGGRAPH 2008 presentation touted the flexibility in assigning stages either to the front or back-end.
Included in that set is GS and tesselation.

Quote:
Tessellation consumes patches. I don't think patches would be screen-space binned.
Wouldn't this mean it occurs in the front end? If it's not in a bin, the back end would not be able to grab it.

Quote:
The patches should be able to run in parallel through VS/HS to generate input to TS and DS. Ordering of triangles coming out of DS should be keyed by Patch ID, I presume (TS generating sub-patch triangle ID).

Is there a serialisation I'm missing?
VS is listed as a front-end capability. It was not clear to me that VS is one of the stages that could be put in either front or back.

Quote:
Anyway, screen-space tiling for binning of triangles involved in tessellation (input or output) would be done post-GS.
GS could be either front or back as well.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 17-Sep-2010, 17:22   #760
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by 3dilettante View Post
Tom Forsyth's SIGGRAPH 2008 presentation touted the flexibility in assigning stages either to the front or back-end.
Included in that set is GS and tesselation.
Looking at slide 22, the only way I can interpret tessellation being done in the back-end (along with GS) is if TS is synonymous with VS->HS->TS->DS (i.e. it is not a reference purely to the TS stage). In my interpretation, DS would be split between front-end and back-end:
  • Front-end DS would generate screen-space coordinates for the purposes of binning.
  • Back-end DS would generate all the other attributes of each vertex.
The advantages of delaying some DS work would include reduced storage in global memory and re-distribution of workload (e.g. later DS might lead to better load-scheduling).

Quote:
Wouldn't this mean it occurs in the front end? If it's not in a bin, the back end would not be able to grab it.
Precisely. But tessellation doesn't have to be complete for binning to start (position attribute of each vertex is mandatory for binning). Which leads me to suggest what I posted above.

Quote:
GS could be either front or back as well.
GS can do a variety of things. If GS is used merely to delete vertices/triangles then in theory it can be delayed until after binning - again this is a load-balancing question, I think. i.e. run GS across lots of cores as they do binning, rather than on a few cores while creating bins.

Maybe there are some other usages of GS that are amenable to delayed execution (e.g. generating attributes)?

---

By the way, the term "rasteriser" is often used to describe all of these stages: setup->rasterisation->pixel shading->output merger (ROP). So it's possible to interpret the statement about the lack of a fixed-function rasteriser as actually descriptive of lack of "setup->rasterisation->pixel shading->output merger". To be honest I think this is very likely the correct interpretation.

I pretty much always thought it would be years before Intel was competitive at the enthusiast end, but process would eventually allow it to catch up. A major question for the other IHVs is what proportion of die space ends up being programmable compute, and the higher that rises the more competitive Intel becomes.
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 17-Sep-2010, 18:29   #761
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

Quote:
Originally Posted by Jawed View Post
Looking at slide 22, the only way I can interpret tessellation being done in the back-end (along with GS) is if TS is synonymous with VS->HS->TS->DS (i.e. it is not a reference purely to the TS stage).
That seems reasonable. The description appears to be at a somewhat higher level, where the particular implementation details of the DX11 pipeline would not be material.

Quote:
The advantages of delaying some DS work would include reduced storage in global memory and re-distribution of workload (e.g. later DS might lead to better load-scheduling).
The reduced amount of data passed between the front and back ends would lead to bandwidth savings. Each core could potentially try to read from the same set of attributes, but this should be forwarded as needed within the cache hierarchy relatively quickly, and there is hopefully no write traffic to those locations in this phase.

Quote:
GS can do a variety of things. If GS is used merely to delete vertices/triangles then in theory it can be delayed until after binning - again this is a load-balancing question, I think. i.e. run GS across lots of cores as they do binning, rather than on a few cores while creating bins.

Maybe there are some other usages of GS that are amenable to delayed execution (e.g. generating attributes)?
There may be ordering and atomicity constraints for GS. Perhaps if the scheduler can determine that there is no interaction between invocations, they can be allowed to persist over the non-deterministic delay between binning and bin pickup.

Quote:
By the way, the term "rasteriser" is often used to describe all of these stages: setup->rasterisation->pixel shading->output merger (ROP). So it's possible to interpret the statement about the lack of a fixed-function rasteriser as actually descriptive of lack of "setup->rasterisation->pixel shading->output merger". To be honest I think this is very likely the correct interpretation.
This appears possible. Intel claimed earlier that the rasterizer part of the pipeline wasn't the hard part.

Quote:
I pretty much always thought it would be years before Intel was competitive at the enthusiast end, but process would eventually allow it to catch up. A major question for the other IHVs is what proportion of die space ends up being programmable compute, and the higher that rises the more competitive Intel becomes.
Larrabee may have been near the top end of what is possible for a PCIe graphics accellerator, at about 2/3 of the die. The rest of the die had IO, controllers, UVD, texture blocks, and miscellaneous logic.
A good amount of the uncore would need to scale as well, otherwise the comput portion would be strangled.
x86 penalty aside, the decision to use full cores for that 2/3 of the die was also a contributing factor to the size and power concerns.
There can be programmable processing units either way, but past a certain number of fully-fledged CPU cores the utility of having even more would have been reduced. There was a lot of front-end and support silicon for the amount of vector resources one got per core.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 19-Sep-2010, 07:23   #762
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,762
Default

Quote:
Originally Posted by Arun View Post
Of course, ideally we'd all go pure MIMD. Rys, can I haz Series6? (and please don't break my heart and tell me it's SIMD now )
OT: I have the feeling you'll be tracking that pipe dream from generation to generation. Personally I'd prefer anything to have a "HE-IMD" (HE= highest efficiency) and you might as well kill the D at the end and you're "home" LOL.

Stupid acronyms aside, he won't be telling you but my old sniffing nose tells me that one of the aces of S6 has gotten Intel to sit up when they first saw it. Wrong thread, bad timing and I urgently need some coffee
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 23-Sep-2010, 22:12   #763
Harison
Member
 
Join Date: Mar 2010
Posts: 195
Default

Quote:
Originally Posted by liolio View Post
I hope Charlie will have the chance to re-encode the vids, I've a tough time understanding their talk
There is full transcript now:

http://www.semiaccurate.com/forums/s...ead.php?t=3361

On topic, its a bit sad Intel pushed Larrabee indefinitely, it would be nice to have a 3rd player (or even 2nd, NV future is a bit unclear atm in mass GPU business), even though in the beginning they would be behind competition. But since GPUs are more and more programmable, IMO its just a matter of time till they release it, even if it will be Larrabee 10th incarnation.

P.S. I can almost see Tim Sweeney crying somewhere
Harison is offline   Reply With Quote
Old 25-Sep-2010, 18:22   #764
Kaotik
yes, i'm drunk
 
Join Date: Apr 2003
Posts: 4,801
Send a message via ICQ to Kaotik
Default

Apparently Knights Ferry, despite being marketed a bit different, is the first iteration of Larrabee as it is, so there might be chance for future developement to enter gfx markets too?
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is online now   Reply With Quote
Old 25-Sep-2010, 19:35   #765
Harison
Member
 
Join Date: Mar 2010
Posts: 195
Default

Quote:
Originally Posted by Kaotik View Post
Apparently Knights Ferry, despite being marketed a bit different, is the first iteration of Larrabee as it is, so there might be chance for future developement to enter gfx markets too?
It should be, and maybe it wont take that long, because after servers they are implementing this tech. in laptops too, after that it should be specialized desktops, after that - released drivers for games rendering too, in the beginning probably Fusion-like. Technology is already there, the question is, how long drivers team will get the chips usable for games. Even with initial cancellation of Larrabee, I have no doubt Intel keeps working in this direction.

http://www.pcworld.com/businesscente..._32_cores.html

Quote:
The company will merge the CPU cores and vector units into a single unit as chip development continues, Skaugen said.


The Knights architecture is the biggest server architecture shift since Intel launched Xeon chips, Skaugen said. The chip includes elements of the Larrabee chip, was characterized as a highly parallel, multicore x86 processor designed for graphics and high-performance computing. However, Intel last week said it had cancelled Larrabee for the short term, but said elements of the chip would first be used in server processors, and later in laptops.


The new architecture could also fend off competition from Nvidia's Tesla and Advanced Micro Devices' FireStream graphics processors, which pack hundreds of computing cores to boost application performance. The graphics processors are faster at executing certain specialized applications. The second fastest supercomputer in the world, Nebulae in China, combines CPUs with GPUs to boost application performance.
http://www.intel.com/pressroom/archi...100531comp.htm
Quote:
"The CERN openlab team was able to migrate a complex C++ parallel benchmark to the Intel MIC software development platform in just a few days," said Sverre Jarp, CTO of CERN openlab. "The familiar hardware programming model allowed us to get the software running much faster than expected."
Harison is offline   Reply With Quote
Old 10-Mar-2011, 18:26   #766
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Didn't feel like making a new thread for this

Quote:
Oregon Computer Architecture Team Award, Intel Corp, 2009, for successfully driving the ISA/architecture and implementation of LRB'3 general-purpose scatter/gather instructions, the first clean scatter/gather architecture for Intel.
http://sites.google.com/site/ykchen/honors

So cleaner than LRB1. What might be considered ugly about LRB1's scatter/gather? It seemed very nice to me. O(1) time gather for any datum in L1, unlike LRB1 which had O(n)
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 10-Mar-2011, 18:34   #767
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

I'm not sure how to interpret "LRB'3".
LRB's or LRB3?

edit:
"Clean" may mean done in hardware without microcode or software with hardware assist.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 10-Mar-2011, 18:37   #768
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by 3dilettante View Post
I'm not sure how to interpret "LRB'3".
LRB's or LRB3?

edit:
"Clean" may mean done in hardware without microcode or software with hardware assist.
IIRC, charlie has often referred to the second coming of LRB as LRB3.0. Apparently LRB1.0 was uncovered and canned and LRB2.0 went with it. The third version was supposed to be the game changer. We'll see.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 10-Mar-2011, 19:04   #769
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

The award was for 2009, while the cancellation of the first LRB took place in 2010.
I find the time frame interesting, since it would imply significant progress in implementing the design by the time the shuffle was reported.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 10-Mar-2011, 19:24   #770
glw
Junior Member
 
Join Date: Aug 2003
Posts: 64
Default

That is one of the authors of a paper I linked to back in 2008.
http://forum.beyond3d.com/showpost.p...2&postcount=31

"Atomic Vector Operations on Chip Multiprocessors"
http://doi.acm.org/10.1145/1394608.1382154
glw is offline   Reply With Quote
Old 10-Mar-2011, 21:39   #771
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,221
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by rpg.314 View Post
O(1) time gather for any datum in L1, unlike LRB1 which had O(n)
Generally impossible without a generic n-ported cache, but you might be able to implement a n-banked cache (like local data in GPUs) cheaply to get it down to O(log n) for the general case of cache hits (or even better with some buffering of accesses) and O(1) for any access which hits all the banks. Being able to service cache accesses that fast can interact in annoying ways with the coherency though (up to n times the traffic).
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 23:25.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.