Load/Store/Address dilemma ?

Subcutanious

Newcomer
Why do GPUs/CPUs need :

1-Register Files

2-Operand Collectors

3-Load/Store Units

4-Address Generation Unit

5-LDS

Some of these units seem to have their function overlapped , is there a way to clearly distinguish between them ?

Thanks in advance ..
 
1-Register Files
They are needed to store the temporary results of calculations.
2-Operand Collectors
These read values from the registers and feed them to the execution units at the right pace.
3-Load/Store Units
They read/write data from RAM.
4-Address Generation Unit
The memory addresses used by the software are virtual. They need to be translated to physical addresses by the AGU.
Local data stores are caches which store data which is accessible by multiple threads. It's faster than accessing RAM.
 
Why do GPUs/CPUs need :
1-Register Files
This is where the data are stored before and after processing in the ALUs .
2-Operand Collectors
In computer language , an instruction is the math operation itself (like add and multiply) , an operand specifies the data (values) that are going to be operated on .

2x3 = ??
X: is the instruction
2 & 3 : are the operands ..

so operand collectors , collect the operands and feed them to the ALUs .
3-Load/Store Units
Responsible for tracking and processing memory addresses to access it (read=load/write=store).

4-Address Generation Unit
Not sure , this is a purly CPU function I think , as stated by Nick , the CPU maps all types of memory (HD , RAM , Cache..etc) into one single virtual continuous addressing mode , this done on per process basis , and it hides data fragmentation and improves effeciency of fetching .

however , once the CPU is ready to store data , the virtual addresses must be converted into physical one , and this is where the AGU proves useful .

The AGU is aided by the TLB cache (translation lookaside buffer) ,in which the cpu stores a segment of virtual to physical addresses , for the convenience of the AGU .

AGUs do page walk ?

Local on-chip caches , caches are used in conjunction with texture units and ROPS ,they are also used in GPGPU programs .. they facilitate fetching data in advance (like cpu caches) and moving data around the chip without the need to repeatedly access the slow RAM .


However , I find myself curious about the order in which these units work in a GPU , I guess they work like this (excluding the intermediate steps of processing instructions in the ALUs) :

1-load/store
1-operand collectors
2-register files
 
Wouldn't it be more correct to say the Load/Store units move data to/from the L1 and L2 caches rather to/from RAM?
Right, the load/store units handle SRAM cache access while the memory controller (either on-chip or in the northbridge) handles DRAM access.
I thought DTLB did that. Doesn't it?
Yes, my bad. The AGUs are responsible for the various addressing modes, computing the actual virtual address. I guess the TLB is considered part of the load/store units.
 
The AGUs produce the needed virtual addresses for a memory operation.

The TLB caches the most recently used page table entries that would map those virtual addresses to physical ones. The page tables are allocated and maintained by whatever is managing the virtual memory system. Because the AGUs do not produce actual load or store values, we wouldn't expect them to play a role in building the page tables (other than the role they'd play in any other memory access).

edit: beaten
 
The AGUs produce the needed virtual addresses for a memory operation.

The TLB caches the most recently used page table entries that would map those virtual addresses to physical ones. The page tables are allocated and maintained by whatever is managing the virtual memory system. Because the AGUs do not produce actual load or store values, we wouldn't expect them to play a role in building the page tables (other than the role they'd play in any other memory access).

edit: beaten

So the AGU is responsible for looking up DTLB and converting virtual to physical address. If so, then where are the lea (and cousins) dispatched?
 
So the AGU is responsible for looking up DTLB and converting virtual to physical address. If so, then where are the lea (and cousins) dispatched?

The AGU produces the virtual address. The actual TLB checks are handled when the memory operation it produced the address to goes through the load/store unit and a memory access is initiated.

As for LEA, in AMD x86 later cores split LEA into a string of ADDs.
 
For simpler addressing modes, it can come mostly from registers.

x86 has more complex addressing.
I don't know enough to do more than rattle of some of the additional arguments AGUs take for that architecture.

Sourcing from the www.chip-architect.com article on K8:
address = base + index< scale + displacement + segment
Scale is hard-coded, segment is a register (usually assumed to be 0), displacement comes from decode, and the index and base come from the register file.
 
For simpler addressing modes, it can come mostly from registers.

x86 has more complex addressing.
I don't know enough to do more than rattle of some of the additional arguments AGUs take for that architecture.

Sourcing from the www.chip-architect.com article on K8:
address = base + index< scale + displacement + segment
Scale is hard-coded, segment is a register (usually assumed to be 0), displacement comes from decode, and the index and base come from the register file.
:oops:
Wow, x86 IS badly fucked up.
 
:oops:
Wow, x86 IS badly fucked up.

While I agree that x86 is in general fucked up, I actually think the addresing modes is one thing that x86 has actually gotten mostly-right (well, except the segments and the instruction encoding). From the hardware design point of view, it basically just adds together a bunch of numbers together with not much else happening (no memory-indirections, no pre/post-increments, no register-shift-amounts or other messy stuff like that), meaning that it's easy to make a very fast address generation unit by by just putting together a single regular adder and a few carry-save adders. From the compiler design point of view, the x86 addressing modes are relatively easy to map to higher-level-language constructs, meaning that it's easy for the compiler to take full advantage of them. A C code line like e.g.
Code:
a += b[c+2];
easily maps to an x86 assembly instruction like
Code:
add eax, [ebx + 4*ecx + 8]
which is basically as complex as x86 addressing modes get.
 
Back
Top