View Full Version : Using 0-8 SPUs, no change in run time
ymanton
30-Oct-2007, 18:25
I wrote a pretty simple program that counts all the bits set to 1 in a range of bytes. This program takes the same amount of time (give or take a few ms) to run regardless of whether or not I use the PPU exclusely to count, or split the load amongst 1..8 SPUs. I've confirmed that the correct number of SPUs are actually running.
Regardless of how efficiently I'm DMAing data and how I'm doing the counting on the SPUs, I can't imagine that running the program with 0..8 SPUs and 32MB of data would always result in 12.xx secs per run, there should be some variation. I've tried it with smaller data sizes, larger data sizes, different DMA transfer sizes, etc.
Does anyone know what might be going on? I'm doing this on a Blade, not a PS3, if it matters.
how do you split the load? how do you report you are done? and, finally, with this simple a task, are you sure you're not simply bandwidth limited?
Albuquerque
30-Oct-2007, 20:05
I was wondering the same on the bandwidth part -- it sounds like you simply aren't CPU limited in any circumstance.
ymanton
30-Oct-2007, 20:17
how do you split the load? how do you report you are done? and, finally, with this simple a task, are you sure you're not simply bandwidth limited?
Hi, thanks for taking the time.
It may very well be that I'm memory bound... I tried alternately removing the DMAs and the counting and saw no difference between the regular program and the version with just DMAs, several seconds each, the version with just counting took a few ms.
The load is split evenly. I've tried splitting the data into contiguous regions per SPU and having the data interleaved into DMA sized blocks; I'm using the latter at the moment.
Each SPU is passed the starting address, total size, dma size, and stride through args and mailboxes, then they initiate the DMAs and do the counting. The DMAs are overlapped, so the SPU will initiate one DMA while working on the previous one. When they're done they write the count to a mailbox and exit.
After kicking off all the SPUs the PPU does any extra counting necessary (whatever is left over after dividing amongst the SPUs and adjusting for alignment) and then spins on the SPU mailboxes and adds the results together then joins the SPU threads and returns.
I guess there's not much I can do with such a computationally simple application, so I'll leave it at that and move on.
Twelve seconds seems an awfully long time to pass 32MB through Cell... I don't think your bottleneck is memory bandwidth or compute cycles, but a bug in your code/design.
What's the count returned by the SPUs and the PPU? They should all be roughly the same if the data is random (or alternating 1's and 0's) and the load evenly distributed.
ymanton
31-Oct-2007, 05:11
Twelve seconds seems an awfully long time to pass 32MB through Cell... I don't think your bottleneck is memory bandwidth or compute cycles, but a bug in your code/design.
What's the count returned by the SPUs and the PPU? They should all be roughly the same if the data is random (or alternating 1's and 0's) and the load evenly distributed.
Hi, don't worry about the 12 seconds, that included generating random data on the PPU beforehand and doing the count with a dumb loop so I could check that I was getting the same result when counting on the SPUs. Without the diagnostic stuff, for a 250MB data set and 20 runs I'm getting about 4 seconds total.
I've attempted various things to try to squeeze more out of it, like forcing the data to memory using dcbst in case it was in the cache and that was throttling the DMAs; pulling the first DMA block to cache using dcbt to possibly reduce the latency of that first DMA that the first SPU has to wait for, and allow the 2nd one to grab from its first DMA from memory. None of that really made a huge difference.
It's my understanding that the EIB is a ring, and that data hops around it, so I was trying to see if I could use just the two SPUs adjacent to the MIC, but I didn't see anything in libspe2 that exposed the placement of SPUs on the EIB.
Ah well, pretty trivial program really, but thanks for the help anyway, it was a nice way to familiarize myself with things.
Vitaly Vidmirov
31-Oct-2007, 10:58
1 SPU can't fill the xdr bandwidth. You need at least 2.
ymanton
31-Oct-2007, 14:41
1 SPU can't fill the xdr bandwidth. You need at least 2.
Yeah, I noticed that. After I removed the diagnostic stuff, which was dominating everything else, the times got better as I went from 0 to 1 to 2 SPUs, and pretty much stayed there or went up slightly after that.
bonniemathew
10-Dec-2007, 13:46
hey ymanton,
cud u pls paste the code of the program u ve written to calculate the bandwidth ??
Thanks in advance,
Bonnie
Since we've had this discussion on the board recently, you could use this test program to fill the XDR bandwidth to check whether or not there is any difference in performance depending on which combination of SPEs you use when you use 3 SPUs (testing for affinity).
ymanton
11-Dec-2007, 19:45
hey ymanton,
cud u pls paste the code of the program u ve written to calculate the bandwidth ??
Thanks in advance,
Bonnie
Here you go, sorry it's not commented, but shouldn't be too hard to follow. Most of it is configurable at compile time, you can pass the work size in the command line however. The actual bit counting stuff might not be useful to you if you want to measure bandwidth.
"config.h"
#ifndef config_h
#define config_h
#ifndef WORK_SIZE
#define WORK_SIZE (64 * 1024 * 1024)
#endif
#ifndef ALIGNED_MEM
#define ALIGNED_MEM 1
#endif
#ifndef PPE_MEM_ALIGN
#define PPE_MEM_ALIGN 128
#endif
#ifndef SPE_MEM_ALIGN
#define SPE_MEM_ALIGN 16
#endif
#ifndef NUM_SPES
#define NUM_SPES 2
#endif
#ifndef DMA_SIZE
#define DMA_SIZE 1024
#endif
#ifndef STRIDE
#define STRIDE (DMA_SIZE * NUM_SPES)
#endif
#endif
"countBits.c"
#include <stdlib.h>
#ifdef VERIFY
#include <stdio.h>
#include <time.h>
#endif
#include "config.h"
extern unsigned int countBits(char *bits, unsigned int size_in_bytes);
unsigned int countBits_safe(char *bits, unsigned int size_in_bytes)
{
unsigned int i, j;
unsigned int c = 0;
for (i = 0; i < size_in_bytes; ++i)
{
for (j = 0; j < 8; ++j)
c += (bits[i] >> j) & 1;
}
return c;
}
int main(int argc, char **argv)
{
unsigned int size_in_bytes;
char *bits;
unsigned int count;
#ifdef VERIFY
unsigned int i, actual_count;
#endif
if (argc != 2)
size_in_bytes = WORK_SIZE;
else
size_in_bytes = atoi(argv[1]);
#ifdef ALIGNED_MEM
posix_memalign((void*)&bits, PPE_MEM_ALIGN, size_in_bytes);
#else
bits = malloc(size_in_bytes);
#endif
#ifdef VERIFY
srand(time(NULL));
for (i = 0; i < size_in_bytes; ++i)
bits[i] = rand();
actual_count = countBits_safe(bits, size_in_bytes);
#endif
count = countBits(bits, size_in_bytes);
free(bits);
#ifdef VERIFY
if (count != actual_count)
{
fprintf(stderr, "Fail: actual bit count = %u, calculated bit count = %u\n", actual_count, count);
return 1;
}
#endif
return 0;
}
"ppu_countBits.c"
#include <pthread.h>
#include <inttypes.h>
#include <libspe2.h>
#include "config.h"
struct SPE_ARGS
{
spe_context_ptr_t ctx;
char *bits;
unsigned int size_in_bytes;
};
extern spe_program_handle_t spe_countBits;
unsigned int countBits_ppu(char *bits, unsigned int size_in_bytes)
{
unsigned int c = 0;
unsigned int i;
for (i = 0; i < size_in_bytes; ++i)
{
c += (bits[i] & 1) +
((bits[i] >> 1) & 1) +
((bits[i] >> 2) & 1) +
((bits[i] >> 3) & 1) +
((bits[i] >> 4) & 1) +
((bits[i] >> 5) & 1) +
((bits[i] >> 6) & 1) +
((bits[i] >> 7) & 1);
}
return c;
}
void* spe_run(void *arg)
{
struct SPE_ARGS *spe_args = (struct SPE_ARGS*)arg;
unsigned int entry = SPE_DEFAULT_ENTRY;
int rval;
do
{
rval = spe_context_run(spe_args->ctx, &entry, 0, spe_args->bits, (void*)spe_args->size_in_bytes, NULL);
}
while (rval > 0);
return NULL;
}
unsigned int countBits(char *bits, unsigned int size_in_bytes)
{
pthread_t spe_thread[NUM_SPES];
struct SPE_ARGS spe_args[NUM_SPES];
spe_gang_context_ptr_t gang_ctx = NULL;
unsigned int count = 0;
unsigned int i;
unsigned int extra_h;
unsigned int size_per_spe;
unsigned int extra_t;
char *bits_spe;
extra_h = (PPE_MEM_ALIGN - ((uintptr_t)bits & (PPE_MEM_ALIGN - 1))) & (PPE_MEM_ALIGN - 1);
if (extra_h > size_in_bytes) extra_h = size_in_bytes;
size_per_spe = (size_in_bytes - extra_h) / NUM_SPES;
extra_t = (size_in_bytes - extra_h) % NUM_SPES;
bits_spe = bits + extra_h;
extra_t += (size_per_spe & (DMA_SIZE - 1)) * NUM_SPES;
size_per_spe &= ~(DMA_SIZE - 1);
if (size_per_spe != 0)
{
gang_ctx = spe_gang_context_create(0);
spe_args[0].bits = bits_spe;
spe_args[0].size_in_bytes = size_per_spe;
spe_args[0].ctx = spe_context_create_affinity(SPE_AFFINITY_MEMORY, NULL, gang_ctx);
spe_program_load(spe_args[0].ctx, &spe_countBits);
pthread_create(&spe_thread[0], 0, &spe_run, (void*)&spe_args[0]);
#ifdef INTERLEAVED
bits_spe += DMA_SIZE;
#else
bits_spe += size_per_spe;
#endif
for (i = 1; i < NUM_SPES; ++i)
{
spe_args[i].bits = bits_spe;
spe_args[i].size_in_bytes = size_per_spe;
spe_args[i].ctx = spe_context_create_affinity(0, spe_args[i - 1].ctx, gang_ctx);
spe_program_load(spe_args[i].ctx, &spe_countBits);
pthread_create(&spe_thread[i], 0, &spe_run, (void*)&spe_args[i]);
#ifdef INTERLEAVED
bits_spe += DMA_SIZE;
#else
bits_spe += size_per_spe;
#endif
}
}
count += countBits_ppu(bits, extra_h) + countBits_ppu(bits + extra_h + size_per_spe * NUM_SPES, extra_t);
if (size_per_spe != 0)
{
for (i = 0; i < NUM_SPES; ++i)
{
unsigned int spe_count;
while (spe_out_mbox_status(spe_args[i].ctx) == 0);
spe_out_mbox_read(spe_args[i].ctx, &spe_count, 1);
count += spe_count;
pthread_join(spe_thread[i], NULL);
spe_context_destroy(spe_args[i].ctx);
}
spe_gang_context_destroy(gang_ctx);
}
return count;
}
"spu_countBits.c"
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include "config.h"
#if (DMA_SIZE % 256 != 0)
#error "DMA size must be a multiple of 256--see unrolled loop"
#endif
static char buffer[DMA_SIZE * 2] __attribute__((aligned(SPE_MEM_ALIGN)));
int main(unsigned long long speid, unsigned long long argp, unsigned long long envp)
{
unsigned int bits_ea = argp;
unsigned int size_in_bytes = envp;
char *bits[2] = {buffer, buffer + DMA_SIZE};
unsigned int count = 0;
unsigned int tag[2] = {0, 1};
unsigned int tagmask[2] = {1, 2};
unsigned int bytes_done, i, j;
vector unsigned char *bits_v;
spu_mfcdma32(bits[0], bits_ea, DMA_SIZE, tag[0], MFC_GET_CMD);
for (i = 0, bytes_done = 0; bytes_done < size_in_bytes; ++i, bytes_done += DMA_SIZE)
{
#ifdef INTERLEAVED
bits_ea += STRIDE;
#else
bits_ea += DMA_SIZE;
#endif
spu_mfcdma32(bits[(i + 1) & 1], bits_ea, DMA_SIZE, tag[(i + 1) & 1], MFC_GET_CMD);
spu_writech(MFC_WrTagMask, tagmask[i & 1]);
while (!spu_mfcstat(MFC_TAG_UPDATE_IMMEDIATE));
#if !defined(NO_COUNT) || defined(VERIFY)
bits_v = (vector unsigned char*)bits[i & 1];
for (j = 0; j < DMA_SIZE / 16; j+= 16)
{
vector unsigned char b_0 = bits_v[j+0];
vector unsigned char b_1 = bits_v[j+1];
vector unsigned char b_2 = bits_v[j+2];
vector unsigned char b_3 = bits_v[j+3];
vector unsigned char b_4 = bits_v[j+4];
vector unsigned char b_5 = bits_v[j+5];
vector unsigned char b_6 = bits_v[j+6];
vector unsigned char b_7 = bits_v[j+7];
vector unsigned char b_8 = bits_v[j+8];
vector unsigned char b_9 = bits_v[j+9];
vector unsigned char b_a = bits_v[j+10];
vector unsigned char b_b = bits_v[j+11];
vector unsigned char b_c = bits_v[j+12];
vector unsigned char b_d = bits_v[j+13];
vector unsigned char b_e = bits_v[j+14];
vector unsigned char b_f = bits_v[j+15];
vector unsigned char c_0 = spu_cntb(b_0);
vector unsigned char c_1 = spu_cntb(b_1);
vector unsigned char c_2 = spu_cntb(b_2);
vector unsigned char c_3 = spu_cntb(b_3);
vector unsigned char c_4 = spu_cntb(b_4);
vector unsigned char c_5 = spu_cntb(b_5);
vector unsigned char c_6 = spu_cntb(b_6);
vector unsigned char c_7 = spu_cntb(b_7);
vector unsigned char c_8 = spu_cntb(b_8);
vector unsigned char c_9 = spu_cntb(b_9);
vector unsigned char c_a = spu_cntb(b_a);
vector unsigned char c_b = spu_cntb(b_b);
vector unsigned char c_c = spu_cntb(b_c);
vector unsigned char c_d = spu_cntb(b_d);
vector unsigned char c_e = spu_cntb(b_e);
vector unsigned char c_f = spu_cntb(b_f);
vector unsigned short s_01 = spu_sumb(c_0, c_1);
vector unsigned short s_23 = spu_sumb(c_2, c_3);
vector unsigned short s_45 = spu_sumb(c_4, c_5);
vector unsigned short s_67 = spu_sumb(c_6, c_7);
vector unsigned short s_89 = spu_sumb(c_8, c_9);
vector unsigned short s_ab = spu_sumb(c_a, c_b);
vector unsigned short s_cd = spu_sumb(c_c, c_d);
vector unsigned short s_ef = spu_sumb(c_e, c_f);
vector unsigned short s_0123 = spu_add(s_01, s_23);
vector unsigned short s_4567 = spu_add(s_45, s_67);
vector unsigned short s_89ab = spu_add(s_89, s_ab);
vector unsigned short s_cdef = spu_add(s_cd, s_ef);
vector unsigned short s_01234567 = spu_add(s_0123, s_4567);
vector unsigned short s_89abcdef = spu_add(s_89ab, s_cdef);
vector unsigned short s_all = spu_add(s_01234567, s_89abcdef);
unsigned short s0 = spu_extract(s_all, 0);
unsigned short s1 = spu_extract(s_all, 1);
unsigned short s2 = spu_extract(s_all, 2);
unsigned short s3 = spu_extract(s_all, 3);
unsigned short s4 = spu_extract(s_all, 4);
unsigned short s5 = spu_extract(s_all, 5);
unsigned short s6 = spu_extract(s_all, 6);
unsigned short s7 = spu_extract(s_all, 7);
unsigned short s01 = s0 + s1;
unsigned short s23 = s2 + s3;
unsigned short s45 = s4 + s5;
unsigned short s67 = s6 + s7;
unsigned short s0123 = s01 + s23;
unsigned short s4567 = s45 + s67;
unsigned short total = s0123 + s4567;
count += total;
}
#endif /*!defined(NO_COUNT) || defined(VERIFY)*/
}
spu_write_out_mbox(count);
return 0;
}
"Makefile"
ifdef XLC
PPU_CC ?= ppuxlc
SPU_CC ?= spuxlc
PPU_CFLAGS += -q32 -O5 -qarch=cbeppu -qtune=cbeppu
SPU_CFLAGS += -O5 -qarch=cbespu -qtune=cbespu
else
PPU_CC ?= ppu-gcc
SPU_CC ?= spu-gcc
PPU_CFLAGS += -m32 -Wall -O3
SPU_CFLAGS += -Wall -O3
endif
PPU_EMBEDSPU ?= ppu-embedspu
PPU_EMBEDSPU_FLAGS += -m32
.phony: clean
ppe_countBits: countBits.c ppu_countBits.c spe_countBits.o
${PPU_CC} ${PPU_CFLAGS} -o ppe_countBits countBits.c ppu_countBits.c spe_countBits.o -lpthread -lspe2
spe_countBits.o: spe_countBits
${PPU_EMBEDSPU} ${PPU_EMBEDSPU_FLAGS} spe_countBits spe_countBits spe_countBits.o
spe_countBits: spu_countBits.c
${SPU_CC} ${SPU_CFLAGS} -o spe_countBits spu_countBits.c
clean:
rm ppe_countBits spe_countBits.o spe_countBits
ymanton
11-Dec-2007, 22:06
Since we've had this discussion on the board recently, you could use this test program to fill the XDR bandwidth to check whether or not there is any difference in performance depending on which combination of SPEs you use when you use 3 SPUs (testing for affinity).
First, libspe2 has limited affiinity options; all you can ask for is the closest SPE to main memory, and one of the two neighbours to an SPE you already have. It doesn't expose the geometry of the ring explicity. So to do what you asked I asked for the SPE closest to memory and created 7 neighbours in a chain afterward. With that I have some idea of which contexts map to which physical SPEs, but I can't say it's 100% guaranteed.
Having said that, I tried with SPEs 0,1,7 (the closest three) vs 3,4,5 (the farthest three) and it didn't make a noticeable difference.
If anyone is interested in those modifications to the previous code I posted, let me know.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.