Anybody know how to optimize this code with openCL or SIMD instructions?

Anybody know how to optimize this code with openCL or SIMD instructions?

uint64_t sqrtx, x;

for (uint64_t i = 37; i

>(((SIMD)))
shiggy

Explain your reasoning.

bumping for input.

do you're own homework

It's not homework.

what's it for then?

uint64_t sqrtx, x;

for (uint64_t i = 37; i

>i + 6
It's not the same code m8

can crunch the if statements:
if(!(x%(i+NUM)) return ...

Class.

Actually, since you're dealing with magic numbers (0,4,6,10,12,16,22,24), I'd just put those in an array - then you'd have something like:

int specialNums[8] = {0,4,6,10,12,16,22,24};

for(unit64_t i = 37; i

What the fuck is the point of this code?

It's a part of a number factoring function. The selected code occupies the majority of CPU time and I need to speed it up using vector /multi- processing. The problem is that I have little experience with such tasks.

All if checks can be removed. Speed is 20 ns then.

Too bad you never read Hacker's Delight.

Now back to 9gag

>All if checks can be removed
how?

That'd be pointless. The only way to optimize that is to get rid of the excessive testing.
Branch misprediction is killing your code.

What is this code supposed to do?
Can you provide some inputs/outputs please?

Okay... (I explained what it does in )

You can set it up with the following values (takes about 6 seconds to do 83 million loops on my machine) to get the return value 2502845209:


uint64_t x = 8700000089193112463;

uint64_t sqrtx = 2949576255;

The return value is the next lowest prime factor of the input number.

No it can easily be sped up by using multi-processing (splitting the task across CPUs. But I want to know if there are any vector operations or if OpenCL would be of any use. It looks like there are a few things I can do before resorting to multi-processing at least. Seems like nobody here has much experience with vectors, GPU processing, or multi-threadded applications.

Yeah, just rewrite it without a loop

Not possible, it has to loop over 100 million times in some cases.

movapd xmm0, XMMWORD PTR A
movddup xmm2, QWORD PTR B
mulpd xmm2, xmm0
movddup xmm1, QWORD PTR B+8
shufpd xmm0, xmm0, 1
mulpd xmm1, xmm0
addsubpd xmm2, xmm1
movapd XMMWORD PTR C, xmm2

Well, how many tasks can be executed simultaneously on your GPU device?

One has 16 execute units, the other is supposed to have 384 CUDA cores. My CPU is quad core with HT in each core (SIMD operations in each).

GPU seems too slow though due to the overheads unless I change the way things run.

Can you give me any references as to how to understand this? How do I use it?

I am a noob and this is a horrible solution, but I was able to gain a speedup by spawning pthreads with intervals of the i value.
Although lower intervals will finish first, I secured the priority with a Queue structure.
Do that with openCL and you should gain an average speedup.

However, this is a really weird task to optimize this loop with openCL/SIMD.
If you are able to, just switch to another algorithm for your factoring function.