How are the vector masks generated? Is it always handled by the hardware based on branches encountered in the code? I guess I'm missing the part where my scalar program would get strung across the vector unit for 16 data items at a time. It's pretty clear what happens in CUDA with the warp/block construct. But does LRB also have a similar setup where it automatically breaks down your data set into strand groups and maps them across the VPU? I thought those were just concepts and it was up to the developer to structure the data and code appropriately?
My brief look at the instructionset gave me the impression that it will probably be done 'manually'.
As in, you can generate a mask, and then repack your strands based on the mask, but you'd have to specifically inject instructions to do that. But perhaps I missed something very significant there