N5 to be PowerPC based

glw · Nov 15, 2003

The Blue Gene/L processor has a peak of 2.8 GFlop/s @ 700 MHz
using a 128bit double precision SIMD FPU, 4 Flop per cycle for
DP FP-MADD by the looks of it. If both cores of the chip are
used (1 is usually used for handling MPI) it peaks at 5.6 GFlop/s.
Obviously with SP it could hit 5.6 and 11.2 GFlop/s respectively,
whether such a mode is supported is not clear, though I expect
it is not.

Not bad for 700MHz.

Only the L2 and L3 caches are cohererent so some software
management of memory is required which generally the
second core is used for, managing the scratchpad and allocated
blocks of the 4MB L3 EDRAM.

Not sure about the process, at a guess I'd say 130nm, though
90nm is just about possible, though no that likely.

2 cores per chip.
2 chips per card.
16 cards per board.
32 boards per rack.
64 racks for the full system.

V3 · Nov 15, 2003

Damn, I was right once again.

Stop gloating

You can gloat all you want when Sony releases the spec for PS3, for the time being I can hope that Sony is actually trying to put 1+ TFLOPS into PS3.

From the looks of things MS and Nintendo is trying to do the same. MS will probably go with this Blue Gene offspring too. Exciting time next year, when they're going to release the spec.

Games are getting rediculously bad now days, so at least what they can do is give me good graphics and sound.

Panajev2001a · Nov 15, 2003

glw said:
The Blue Gene/L processor has a peak of 2.8 GFlop/s @ 700 MHz
using a 128bit double precision SIMD FPU, 4 Flop per cycle for
DP FP-MADD by the looks of it. If both cores of the chip are
used (1 is usually used for handling MPI) it peaks at 5.6 GFlop/s.
Obviously with SP it could hit 5.6 and 11.2 GFlop/s respectively,
whether such a mode is supported is not clear, though I expect
it is not.

Not bad for 700MHz.

Only the L2 and L3 caches are cohererent so some software
management of memory is required which generally the
second core is used for, managing the scratchpad and allocated
blocks of the 4MB L3 EDRAM.

Not sure about the process, at a guess I'd say 130nm, though
90nm is just about possible, though no that likely.

2 cores per chip.
2 chips per card.
16 cards per board.
32 boards per rack.
64 racks for the full system.

Thanks for giving more details: so the L3 is e-DRAM and the 5.6 GFLOPS figure for each node.

If each node is doing 5.6 GFLOPS their peak would be, with 512 nodes, 2.867 TFLOPS so 2 TFLOPS must be a sort of benchmarked figure and not a theoretical peak.

I am glad you joined this thread and Beyond3D: you seem to have studied quite a bit BlueGene/L and I am always happy to learn new things and be corrected if I make mistakes.

I have not studied BlueGene/L as much as you have ( I used to focus more on the /P variant [SMT version] ), but I am starting to catch up.

This will be interesting, coming from Suzuoki's patent, for those who compare BlueGene/L to STI CELL ( btw, can the Worker PowerPC core access RAM using the DMAC or does the other PowerPC core have to handle that and I/O for it all the time ? ):

http://makeashorterlink.com/?F27B42C86

[0088] As discussed above, all of the multiple APUs of a PE can independently access data in the shared DRAM. As a result, a first APU could be operating upon particular data in its local storage at a time during which a second APU requests these data. If the data were provided to the second APU at that time from the shared DRAM, the data could be invalid because of the first APU's ongoing processing which could change the data's value. If the second processor received the data from the shared DRAM at that time, therefore, the second processor could generate an erroneous result. For example, the data could be a specific value for a global variable. If the first processor changed that value during its processing, the second processor would receive an outdated value. A scheme is necessary, therefore, to synchronize the APUs' reading and writing of data from and to memory locations within the shared DRAM. This scheme must prevent the reading of data from a memory location upon which another APU currently is operating in its local storage and, therefore, which are not current, and the writing of data into a memory location storing current data.

[0089] To overcome these problems, for each addressable memory location of the DRAM, an additional segment of memory is allocated in the DRAM for storing status information relating to the data stored in the memory location. This status information includes a full/empty (F/E) bit, the identification of an APU (APU ID) requesting data from the memory location and the address of the APU's local storage (LS address) to which the requested data should be read. An addressable memory location of the DRAM can be of any size. In a preferred embodiment, this size is 1024 bits.

[0090] The setting of the F/E bit to 1 indicates that the data stored in the associated memory location are current. The setting of the F/E bit to 0, on the other hand, indicates that the data stored in the associated memory location are not current. If an APU requests the data when this bit is set to 0, the APU is prevented from immediately reading the data. In this case, an APU ID identifying the APU requesting the data, and an LS address identifying the memory location within the local storage of this APU to which the data are to be read when the data become current, are entered into the additional memory segment.

[0091] An additional memory segment also is allocated for each memory location within the local storage of the APUs. This additional memory segment stores one bit, designated the "busy bit." The busy bit is used to reserve the associated LS memory location for the storage of specific data to be retrieved from the DRAM. If the busy bit is set to 1 for a particular memory location in local storage, the APU can use this memory location only for the writing of these specific data. On the other hand, if the busy bit is set to 0 for a particular memory location in local storage, the APU can use this memory location for the writing of any data.

[...]

[0110] The scheme described above for the synchronized reading and writing of data from and to the shared DRAM also can be used for eliminating the computational resources normally dedicated by a processor for reading data from, and writing data to, external devices. This input/output (I/O) function could be performed by a PU. However, using a modification of this synchronization scheme, an APU running an appropriate program can perform this function. For example, using this scheme, a PU receiving an interrupt request for the transmission of data from an I/O interface initiated by an external device can delegate the handling of this request to this APU. The APU then issues a synchronize write command to the I/O interface. This interface in turn signals the external device that data now can be written into the DRAM. The APU next issues a synchronize read command to the DRAM to set the DRAM's relevant memory space into a blocking state. The APU also sets to 1 the busy bits for the memory locations of the APU's local storage needed to receive the data. In the blocking state, the additional memory segments associated with the DRAM's relevant memory space contain the APU's ID and the address of the relevant memory locations of the APU's local storage. The external device next issues a synchronize write command to write the data directly to the DRAM's relevant memory space. Since this memory space is in the blocking state, the data are immediately read out of this space into the memory locations of the APU's local storage identified in the additional memory segments. The busy bits for these memory locations then are set to 0. When the external device completes writing of the data, the APU issues a signal to the PU that the transmission is complete.

Apparently, the APUs can use the DMAC to access DRAM ( outside their Local Storage... something EE's VUs cannot do ) and can also perform some sort of I/O with external devices without the PU doing it for them.

Also, bonus quote:

[0120] Software cells are processed directly by the APUs from the APU's local storage. The APUs do not directly operate on any data or programs in the DRAM. Data and programs in the DRAM are read into the APU's local storage before the APU processes these data and programs. The APU's local storage, therefore, includes a program counter, stack and other software elements for executing these programs. The PU controls the APUs by issuing direct memory access (DMA) commands to the DMAC.

[...]

[0127] As noted, the PUs treat the APUs as independent processors, not co-processors. To control processing by the APUs, therefore, the PU uses commands analogous to remote procedure calls. These commands are designated "APU Remote Procedure Calls" (ARPCs). A PU implements an ARPC by issuing a series of DMA commands to the DMAC. The DMAC loads the APU program and its associated stack frame into the local storage of an APU. The PU then issues an initial kick to the APU to execute the APU Program.

Sonic · Nov 15, 2003

Thread is off topic. Locked.

N5 to be PowerPC based

glw

V3

Panajev2001a

Sonic

Senior Member

Similar threads