Either Im totally delusional or this is the inspiration of durango's design.

dobwal

Legend
So Im on MS research site looking for relevant infomation that may apply to durango and I run across a paper from the late 70s on a cpu arch. Normally i would of look past it because its old as dirt. But its the name of the cpu thats stands out and piques my interest.

The CPU is called Dorado. Its one of the sucessors to the Alto a cpu designed by Xerox, which i am sure alot of you know the background of alto and xerox during the early and late 70s.

Here some backgound on dorado.

http://mirrorservice.org/sites/www....rado_A_High-Performance_Personal_Computer.pdf

The next thing that piques my interest is its design.

The Dorado provides the hardware base for the current generation of system research within Xerox PARCo It supports a variety of high-level language environments and high-bandwidth devices...The microarchitecture allows all the device controllers to share the full power of the processor, rather than having independent access to the memory. As a result controllers can be small and yet the 110 interface provided to programs can be powerful. This concept of processor sharing is fundamental to the Dorado...Device controllers are first-class citizens, serviced on demand from the processor via the virtual memory system.

As im reading i am reminded of a vgleaks diagram

http://www.vgleaks.com/wp-content/uploads/2013/02/memory_system.jpg

The diagram shows the gpu is fed with vertices and commands over the i/o fabric. I wonder why commands from the cpu would be fed in this manner instead of directly through the memory controller. When reading up on dorado, this really stuck out.

The Dorado is optimized for the execution of languages that are compiled into a stream of byte codes; this execution is called emulation... An instruction fetch unit (IFU) in the Dorado fetches bytes from such a stream, decodes them as instructions and op- erands, and provides the necessary control and data information to the processor...When a device acquires the processor (that is, the processor is running at the requested priority level and executing the microcode for that task), the device will presumably receive service from its microcode. Eventually the microcode will block, thus relinquishing the processor to lower priority tasks until it next requires service.

I dont presume that the whole durango design works in such fashion but the data streaming from kinect may use this scheme as the diagram seems to allow such a setup.

Dorado and durango both share low latency ram which in the dorado paper, the memory is considered not heart of the system not the actual processor.

Dorado's two senior architects were Chuck Thacker and Butler Lampson both now work for MS Research Cambridge who contributed a lot of research that went into Kinect.

Im betting that I am probably wrong about the relationship between Durango and Dorado but hey its all in fun. The paper is a really interesting in and of itself due to Xerox level of advancement during the this time.
 
The names being similar is coincidental (or so I wager anyway), and a system design from the 70s will not be applicable to modern computing designs.

Also, you can be sure that the GPU reads command lists via DMA from main RAM rather than being hand-fed from anywhere else. Doing it any other way would be inefficient; large burst reads and writes are fast in modern computers, puttering around with small bits and pieces is slow. Also, tying up a CPU core just to feed the GPU would be even more inefficient, if it's even at all possible to directly write from the CPU into the GPU's registers...
 
The design for Durango comes from AMD. There's nothing there they haven't done before in other devices (eDRAM, SOC, APU).
 
The design for Durango comes from AMD. There's nothing there they haven't done before in other devices (eDRAM, SOC, APU).

The GPU core comes from AMD but there is nothing that says the highspeed i/o, DMEs nor the eSRAM is strictly an AMD contribution. In fact there are plenty of evidence that MS is the leading contributor of those components inclusion into durango.

MS is investing into hardware design. Chuck Thacker was chief architect for Alto and Dorado at Xerox and has been spearheading MS push into hardware design. He set up the Cambridge unit which contributed heavily to Kinect. He's now in Silicon valley leading MS hardware research arm. He is one of MS's leader in their BEE3 project using FPGAs.

http://research.microsoft.com/apps/pubs/default.aspx?id=80369

Thacker also founded DEC which produce the Alpha processor. DEC also produced a multi-media video adaptor for the Alpha workstation.

http://research.microsoft.com/en-us/um/people/bahl/papers/dtj-a/j300_dtj.pdf
Jvideo was a three-board, single-slot TURBO channel adapter capable of supporting JPEG compression and decompression, video scaling, video rendering, and audio compression and decompression—all at realtime rates. Two JPEG codec chips provided simultaneous compression and decompression of video streams. A custom application-specific integrated circuit (ASIC) incorporated the bus interface with a direct memory access (DMA) controller, filtering, scaling, and Digital’s proprietary video rendering logic. Jvideo’s software consisted of a device driver, an audio/video library, and applications.

Its very similar to Durango's setup with jpeg decompression facillated by a DMA over a i/o bus.

Microsoft has its own hardware patents that would apply to Durango's i/o path.

High speed i/o data system
http://appft1.uspto.gov/netacgi/nph...high+speed"&RS=(AN/Microsoft+AND+"high+speed")
In embodiments, the device may be implemented as any one or combination of a fixed or mobile device, in any form of a consumer, computer, portable, user, communication, phone, navigation, television, appliance, gaming, media playback, and/or electronic device. The device may also be associated with a user (i.e., a person) and/or an entity that operates the device such that a device describes logical devices that include users, software, firmware, hardware, and/or a combination of devices.

MS has been using FPGAs through its BEE3 project for hardware design.

The DMEs compression logic.
http://appft1.uspto.gov/netacgi/nph...Microsoft+AND+"field+programmable+gate+array")
A bit-map based data compression/decompression method for the architecture may be implemented to increase memory capacity and bandwidth available in the accelerator system. Training data may be compressed by conventional compression software and stored in the memories associated with the acceleration device. The FPGA may then read and decompress the data before performing computations. Implementing compression and decompression techniques with the FPGA may increase the virtual bandwidth from a DDR to a PE by 2-4 times the virtual bandwidth for uncompressed data.
Possible inclusion of microcoded instruction into Kinect data and other data that goes over the high bandwidth i/o path.

PRE-PROCESSING OF IMAGE DATA FOR ENHANCED COMPRESSION
http://appft1.uspto.gov/netacgi/nph.../Microsoft+and+jpeg&RS=(AN/Microsoft+AND+jpeg)
[0003] One problem with current implementations of bulk compressors is that they are limited in the amount of compression that they can perform. A frequent restriction for bulk compression in a remote presentation session is a restriction on the amount of available time with which to perform the compression. In a remote presentation session, it is generally desirable to reduce the amount of time between when a user at the client provides input and when that user is displayed graphical output corresponding to that input being performed. Given this restriction on time, it is generally advantageous for a remote presentation session bulk compressor to compress data well while still performing that compression in a limited amount of time.

[0004] The present invention offers improved data compression. In embodiments of the present invention, the amount of compression performed under the constraints of available processing resources and/or time is improved. In embodiments, data to be compressed is evaluated and portions thereof are classified with "hints," or techniques for compressing that portion of data--meta-data generated from the data, or by the process which assembled the data, that describes a characteristic about the compressibility of the source data. For example, a given input data may be classified in three separate portions, such that one portion is to be compressed normally, one portion is to be literally copied or transferred to an output stream rather than being compressed (or compressed further), and one portion is recognized as a match of another portion, and is to be encoded in the output stream as a reference to the first match.

[0011] FIG. 4 depicts an example architecture that combines a hint generator with the data compressor of FIG. 3.

Hint generator 400 may determine a plurality of hints for the data--a hint may cover only a portion of the data. For instance, hint generator 400 may determine, based on the contents of a first portion of the data, to compress the first portion of the data with a first technique, and also determine, based on the contents of a second portion of the data, to encode the second portion of the data with a second technique. These first and second portions may then be encoded with the first technique and the second technique, respectively.

Hint generator 400 may also determine that a portion of the data should not be compressed by compressor 350 and store a hint that indicates this. Hint generator 400 may determine that a portion of the data should not be compressed by compressor 350 such as where the data has already been compressed, or it is determined that an amount of computing resources required to compress the portion of the data outweighs an associated compression benefit. The portion of the data may have already been compressed such as where the portion of the data is image data in a compressed format, such as JPEG.

MS has been using FPGA to facilitate Kinect incoporation into Xbox hardware.

http://research.microsoft.com/apps/video/default.aspx?id=157648
http://research.microsoft.com/apps/pubs/default.aspx?id=170804

This may help explains why Kinect is so data intensive

http://research.microsoft.com/pubs/155378/ismar2011.pdf

Thacker may have very well dusted off an old design and used it as a basis for Durango i/o design. You don't find many archs similar to Dorado that has a high speed i/o bus that has 2X the bandwidth, 8X less latency and a data path thats 16X as wide as path that the processor and everything other than the display interface sits on. A wide and fast path to the GPU will probably serve Kinect 2.0 very well. Thacker claimed that the 2002 Tablet PC that he helped MS produced was influenced by his experience with the Dynabook a tablet proposed in 1972 by Alan Key while working at PARC.

Dorado and Durango share a 256bit high speed i/o bus. There are differences as the only things that interfaces with the highspeed i/o bus on Dorado is the memory interface, display interface and the interface with the slower bus. On Durango the only thing that interfaces with the highspeed i/o bus is the memory interface, display interface and the gpu.

And I doubt AMD finds a tenable relationship with MS where MS comissions a design from AMD and then allows MS to patent a bunch of inventions created by AMD.
 
Last edited by a moderator:
I think you're starting to sound a bit like jeff riby, misterxmedia....

There's nothing to indicate that someone at MS thought "Ah, lets resurrect the Dorado arch from 30 years ago and use it for the next Xbox".
It seems perfectly plausible that the system design was thought up by MS and AMD engineers over the past 3 years, and since that is the simplest explanation (and leaves nothing unexplained) it's likely to be correct.
 
The GPU core comes from AMD but there is nothing that says the highspeed i/o, DMEs nor the eSRAM is strictly an AMD contribution.
I never said it was. But there's nothing about the design that's AMd can't do, so the architecture is 'AMD's even if some non AMD employer made the design for a specific bus or sub processor. It's certainly not a non-AMD design that AMD were told to make. It's, again, x86 (Jaguar) cores coupled to GCN CUs. Believing anything else (brand new CPU design) is going completely against the flow of evidence at this point.
 
Last edited by a moderator:
Wow that's some interesting patent investigation. I'm not sure all the connections you are making are really there, but this is interesting:

A bit-map based data compression/decompression method for the architecture may be implemented to increase memory capacity and bandwidth available in the accelerator system. Training data may be compressed by conventional compression software and stored in the memories associated with the acceleration device. The FPGA may then read and decompress the data before performing computations. Implementing compression and decompression techniques with the FPGA may increase the virtual bandwidth from a DDR to a PE by 2-4 times the virtual bandwidth for uncompressed data.

If you can effectively create 2 - 4 times the bandwidth with compression, that is a serious big win on two fronts if it can be made to be automatic.
 
If you can effectively create 2 - 4 times the bandwidth with compression, that is a serious big win on two fronts if it can be made to be automatic.
Compressed bitmaps have been used for ages. DXTC, anyone? Furthermore, data compression only relieves bandwidth. It doesn't change the architecture from PC so even if MS are doing something new, Durango's still fundamentally a PC in operation.
 
Compression is a multi-edged sword; if you compress a chunk of data and then only need a section of it -is the bandwidth you save going to be worthwile when you essentially need to fetch and decompress more data than you really need - especially when spanning block boundaries? Will you be able to save much at all, or will you have to throw away detail to keep compression ratio up with small block sizes?

Also, not all data compresses well, and (de/)compression hardware introduces additional latency. Also, one big benefit of compression - taking up less storage space - can't really be realized due to different data compressing to different sizes. Example; you allocate room for a piece of data, compress it to 50% of original. What you do with the remainder, let it go completely to waste or free it up again and put something else there? Ok, so you do that and change your data, recompress it; now suddenly it only compresses to 75%, it won't fit in the memory space you got... Clearly a bad idea.
 
I never said it was. But there's nothing about the design that's AMd can't do, so the architecture is 'AMD's even if some non AMD employer made the design for a specific bus or sub processor. It's certainly not a non-AMD design that AMD were told to make. It's, again, x86 (Jaguar) cores coupled to GCN CUs. Believing anything else (brand new CPU design) is going completely against the flow of evidence at this point.

What are you talking about? Im not trying contradict the general consenus of what constitute durango's design. Have you even bother to read the post? Durango is an MS designation and that designation may have come from the influence of the memory and i/o setup. Its practically the only thing thats been leaked that differentiates durango from a standard AMD apu design.

Its not the cpu in dorado design I am referencing its the i/o-memory configuration and that durango may share a similar design philosphy. We've come to accept that Durango's DMEs, i/o bus and eSRAM is all there to simply help to alleviate the inclusion DDR3 that was necessitated by the BOM cost Kinect and the amount of RAM chosing by MS. Has any of us thought that its the other way around? That the inclusion of eSRAM, i/o bus and DMEs forced MS to use ddr3.

There are several patents of MS that reference LZ and JPEG compression together and its not limited to saving bandwidth or expanding memory. Their use have also been mentioned in the servicing of peripherals. MS has released quite a few patents on i/o design and servicing peripherals that covers what we have seen in Durango. Durango may be explicitly designed to allow it to robustly handle peripherals and other devices.

We are talking Durango handling data from a bunch of devices such as controllers, microphones, kinect, headset displays and a main display all simultaneously. That type of setup may saturate a typical i/o design, which would lead to a laggy experience. Durango design may help eliminate the limitations faced by Kinect and move it beyond its flaws and rather simplistic implementation as a game console peripheral. MS may have sacrificed CUs not because of the cost of including Kinect. Rather, MS may have sacrificed gpu performance in an effort to ensure that Kinects works wells as well as other peripherals/devices/features that may be part of or work with the durango ecosystem.
 
I didnt quite follow your argument because some of the hardware you mention and technical stuff is somewhat out of my reach, but it is fun & it involves many things I like to know and learn about -that's why we are here on Beyond3D after all-.

I think you might have a point there, and it would be interesting to know, if someone publishes a book on the matter -like they did when certain consoles were launched in the past-, the hardware design decisions behind the new consoles specifications.
 
Thank you dobwal. You reaise interesting points. The primary point being that there were certain experiences that MS was focused on providing and these considerations drove the techonology moreso than BOM.
 
So Im on MS research site looking for relevant infomation that may apply to durango and I run across a paper from the late 70s on a cpu arch. Normally i would of look past it because its old as dirt. But its the name of the cpu thats stands out and piques my interest.

The CPU is called Dorado.
The name is the strongest similarity between the two architectures, unfortunately.

Dorado is a single core device with a processor that made provisions for running high-level languages through emulation. I don't see a strong resemblance in that CPU to the modern one.

The diagram shows the gpu is fed with vertices and commands over the i/o fabric. I wonder why commands from the cpu would be fed in this manner instead of directly through the memory controller. When reading up on dorado, this really stuck out.
Memory can still be involved, the memory subsystem is an intermediate block. The primary memory path with the biggest numbers is the CU cache hierarchy, which is separate from the hooks used by the command processor. The bandwidth needs are comparatively modest, and miscellaneous hardware hangs off of a lower-bandwidth hub, which can perform some accesses to the L2. This is good enough for the hardware and the hub allows for quicker modification of the design without rebuilding the CU array and L2 interface, which is why the DME and controllers hang off of it.
 
The name is the strongest similarity between the two architectures, unfortunately.

Dorado is a single core device with a processor that made provisions for running high-level languages through emulation. I don't see a strong resemblance in that CPU to the modern one.


Memory can still be involved, the memory subsystem is an intermediate block. The primary memory path with the biggest numbers is the CU cache hierarchy, which is separate from the hooks used by the command processor. The bandwidth needs are comparatively modest, and miscellaneous hardware hangs off of a lower-bandwidth hub, which can perform some accesses to the L2. This is good enough for the hardware and the hub allows for quicker modification of the design without rebuilding the CU array and L2 interface, which is why the DME and controllers hang off of it.

Its was never the dorado's cpu itself that stuck out, it was the design philosophy of the i/o and memory system.

Dorado was designed to service peripherals. It had high bandwidth i/o capability. Dorado was exposed to peripheral devices as intimately and seamlessly as possible which was the main goal. Xerox didn't build computers to sell to general consumers it was built for its own engineers as tools for other products. Its had low latency high throughput off chip cache to service non sequential references and emulation while using the main storage ram to service sequential references and i/o data. Dorado sported a virtual memory system which allowed i/o devices to deal only in virtual addresses.

The high level language emulation was considered the lowest priority task on Dorado. Emulation would only run in the absence of other tasks.
 
Its was never the dorado's cpu itself that stuck out, it was the design philosophy of the i/o and memory system.

Dorado was designed to service peripherals. It had high bandwidth i/o capability. Dorado was exposed to peripheral devices as intimately and seamlessly as possible which was the main goal.
Durango doesn't follow this methodology. Dorado made it so the outside controllers could negotiate control over the CPU in order to populate memory or work the cache. The offload engines and DMA-enabled IO in current systems do their utmost to operate without bothering the CPU.

Except for the display controller, the IO devices in Dorado had to go through the CPU to get to memory. This allowed for highly simplified IO hardware, since they would reserve the CPU ahd have it run the microcode for the device.

Its had low latency high throughput off chip cache to service non sequential references and emulation while using the main storage ram to service sequential references and i/o data. Dorado sported a virtual memory system which allowed i/o devices to deal only in virtual addresses.
It had a cache. The off-chip nature of it was an artifact of the manufacturing capabilities at the time. It sports a mapping function for virtual memory, which is something it shares with Durango and the many other systems with an MMU. The more immediate inspiration for Durango would be the x86 and APU systems from more recent years.
 
I think you're starting to sound a bit like jeff riby, misterxmedia....

There's nothing to indicate that someone at MS thought "Ah, lets resurrect the Dorado arch from 30 years ago and use it for the next Xbox".
It seems perfectly plausible that the system design was thought up by MS and AMD engineers over the past 3 years, and since that is the simplest explanation (and leaves nothing unexplained) it's likely to be correct.

There's nothing to indicate that someone at Nintendo thought "Ah, lets resurrect the arch from 10 years ago and use it for the next WiiU... :LOL: Sorry this is off topic but had to share.
 
Durango doesn't follow this methodology. Dorado made it so the outside controllers could negotiate control over the CPU in order to populate memory or work the cache. The offload engines and DMA-enabled IO in current systems do their utmost to operate without bothering the CPU.

Yes, Durango employs multiple DMAs but Durango reflects a more complicated and contempary design as it sports multiple processors and numerous memory pools. Dorado had a wholly separate instruction fetch unit that could prefetch and decode instructions and operands for the processor. It could move data to and from the cache. It didn't employ DMAs as it didn't need them. Cache hits were like 99% with cache misses resolved within 9 cycles.

Except for the display controller, the IO devices in Dorado had to go through the CPU to get to memory. This allowed for highly simplified IO hardware, since they would reserve the CPU ahd have it run the microcode for the device.

Only slow (low bandwidth) i/o devices needed to go through the cpu, fast i/o devices communicated directly with main system RAM.
 
Last edited by a moderator:
Back
Top