So I assume you played around with the basic SSE instructions from the last post and are now familiar with the very basics of Intel''s SSE Instruction Set. In this second part of the HowTo we are going to solve the problem of normalizing a vector as fast as we can. This problem should be ideal for anyone to learn SSE and if you read carefully, you should be able to implement other more complicated operations in SSE as well. However, as in the last post, I will focus only on the implementation of the algorithm itself for didactical reasons.
Getting Started...So first of all, we''ll try to implement an algorithm that''ll solve our problems in C++. This is always a good starting point, because it helps us getting familiar with the problem and also helps as a reference when we''re going to implement the SSE-version. And like mentioned earlier, it is always good to have a non-SSE version of your programs at hand, in case your program has to be executed on an older machine.
Basically what we want to do is normalizing a vector. That means, you want to calculate the coordinates of a vector that points in the exact same direction as the given vector, but has a length of exactly 1. To achieve this, we''ll have to divide every coordinate of the given vector by the length of the given vector. The length is calculated as the square root of the sum of the squared coordinates. For a 3-D vector that means:
Code:
v3=(x,y,z);
|v3| = sqrt(x2+y2+z2)
v3_normalized = v3/|v3|
in c++ this would look like the following:
Code:
struct cVector
{
float x,y,z;
};
void main()
{
cVector vec1;
vec1.x=0.5;
vec1.y=1.5;
vec1.z=-3.141;
//First calculate the length:
float len = sqrt((vec1.x*vec1.x) + (vec1.y*vec1.y) + (vec1.z*vec1.z));
//Now divide each coordinate by the length:
vec1.x/=len;
vec1.y/=len;
vec1.z/=len;
cout << vec1.x << " " << vec1.y << " " << vec1.z << ''\n'';
}
The lines that we''ll focus on are the ones for length calculation and for the coordinate division. So what do we have here? There are 3 muls and 3 adds in the length calculation, plus a really slow sqrt(), which we''ll handle in detail later on. Then there are 3 divs for normalizing the vector.
As you can imagine, it''ll be easy to wrap up the 3 muls and 3 divs into one SIMD call each.
The one thing that gets us into trouble is the addition. To handle this problem, we need to introduce another powerful concept of SSE: The shuffle!
Mixing registers: ShufflingThis is probably the hardest command for a SSE-adept to learn, so listen carefully:
The main problem about our register structure atm is, that we can only set off ''equal'' coordinates against each other. That means, we can add the x-coord of one vector with the x-coord of another, but we can not add the x-coord with the y-coord or the z-coord. unfortunately, that is exactly what we need here: we want to add the three coordinates of a single vector upon each other, so how''re we gonna do this?
The trivial method would be to reload the vector data from memory but store it in a different order, like maybe store the x coordinate in the second block of the register and the y value in the first, or something like that. This would not be a good idea, not at all! Be aware that fetching the data from RAM into the SSE registers is even slower than fetching it to the general purpose registers of the CPU, so that a double fetch from RAM could result in an even slower algorithm, than one that isn''t using SSE at all! However, there is a way of swapping the different elements of a SSE register, although it''s not that easy...
The command we''re talking about is the SHUFPS (Shuffle Packed Single). This command expects two SSE-registers and a one byte Hex-string as operands. The first two elements (Bits 0-63) of the Destination register (if you recall, that''ld be the first operand) will be overwritten by any two elements of the Destination register, while the last two elements (Bits 64-127) of the Destination register will be overwritten by any two elements of the source register. The elements that are actually copied are specified using the 1 Byte Hex-String. Before we proceed, I''ll write one example of how the SHUFPS looks like:
Code:
shufps xmm0, xmm1, 0x4e
Obviously, we want to shuffle to XMM0. That means, all elements contained in XMM0 might be overwritten, while XMM1 is only read but not changed. Now for the third parameter. First of all, let''s decode the Hex-Value 4E back to binary (i assume you already know how to do this, because if you don''t, you probably shouldn''t play around in assembler

).
Code:
[4E]_16 = [0100 1110]_2
Notice that you can split up the 8Bit Hex Value into four 2Bit values: 01, 00, 11, 10
Do you see something? We have now four values of which each is able to adress a space of four numbers. This damn Hex-string is actually telling us which elements to copy! However, there is one thing you''ll have to keep in mind: Computers are always reading the least significant bit first, so you''ll have to read from the left to the right (this is for technical reasons, so you shouldn''t bother about it; just keep it in mind when you''re using this command).
Now here''s what our shuffle does:
Code:
shufps xmm0, xmm1, 0x4e:
First element of XMM0 will be set to element 10 (the third element) of XMM0
2nd element of XMM0 will be set to element 11 (the fourth element) of XMM0
3rd element of XMM0 will be set to element 00 (the first element) of XMM1
4th element of XMM0 will be set to element 01 (the second element) of XMM1
You didn''t understand at all? Don''t worry, here is some code for you:
Code:
void main()
{
cVector vec1;
vec1.x=0.5;
vec1.y=1.5;
vec1.z=-3.141;
vec1.w=2;
__asm {
movups xmm0, vec1
movaps xmm1, xmm0
mulps xmm1, xmm1
shufps xmm0, xmm1, 0x4e
movups vec1, xmm0
}
cout << vec1.x << " " << vec1.y << " " << vec1.z << " " << vec1.w << ''\n'';
}
First of all, we store vec1 in XMM0 and (vec1)2 in XMM1. Then we do the exact same shuffle as described above. Try to guess which values will be put out at the end of the console, before you execute it for the first time. If you were wrong, don''t worry, this is the only thing about SSE that is really complicated. Before you proceed I''d strongly recommend that you play around with the shuffling until you feel that you really understood it. This concept is simply too important to be skipped and gives you the ability to create very powerful algorithms. Try to experiment with the different parameters and Hex-values and try to visualize in your mind, how your data is stored and shuffled in memory. As soon as you feel that you''re comfy with the SHUFPS you may proceed...
Applied ShufflingNow we are ready to rebuild our normalizing function using what we''ve learned so far.
If you want, try to implement this by yourself, before you read on. You should have all the knowledge to write a basic SSE version of the program before we move on (DIVPS divides, MULPS multiplies) and it will be a great exercise to learn how all of the different commands fit together.
Code:
void main()
{
cVector vec1;
vec1.x=0.5;
vec1.y=1.5;
vec1.z=-3.141;
vec1.w=0;
cVector vec2;
__asm {
movups xmm0, vec1
mulps xmm0, xmm0 ;Calculate squares
movaps xmm1, xmm0
shufps xmm0, xmm1, 0x4e ;Shuffle #1
addps xmm0, xmm1 ;Add #1
movaps xmm1, xmm0
shufps xmm1, xmm1, 0x11 ;Shuffle #2
addps xmm0, xmm1 ;Add #2
movups vec2, xmm0
}
float len = sqrt(vec2.x);
vec1.x/=len;
vec1.y/=len;
vec1.z/=len;
cout << vec1.x << " " << vec1.y << " " << vec1.z << " " << vec1.w << ''\n'';
}
Notice that you now need a fourth coordinate in the vector to make the algorithm work! Why does it need to be zero?
Be sure you understand how the two shuffle/adds work! Why do we need 2 shuffles? Is it wise to have a register where all four elements contain the same value?
This algorithm returns us a vector structure to vec2 where all four floats are set to |vec1|2. This algorithm is pretty fast, but there''s still the nasty sqrt-call when calculating the length. In the next chapter we''ll finally get rid of this...
The thing with the Square-rootsYou will find a chapter about this in almost any book on computer graphic algorithms. There are numerous ways to perform a fast square-root calculation. The sqrt() from math.h which we used so far is based on an iterative algorithm which is pretty much the standard for square root calculation. Since there is no square root function in assembler, this algorithm is calculating an iterative approximation to the square root until a given accuracy is reached. Unfortunately, this approach is not very fast. Even worse, many C++ compilers offer only one implementation of sqrt() for double precision, so you''re calculating with 64Bit precision although you can only save 32Bit when working with floats... A disaster in terms of performance!
The main idea behind all of the so called fast-sqrt-algorithms is basically to cheat around the whole operation: First of all, you''re not working at full precision, fast sqrt always means sacrificing precision to performance. While this would be irresponsible in scientific applications, it is very common in computer graphics where the precision of the fast algorithms is still enough. And last but not least you''re using one dirty little trick in these fast-sqrts: Lookup-tables! That means you''re actually calculating some typical values using the sqrt() from math.h while loading the program, and then calculate any square root you need from the values you have in this table at runtime. Now, building a lookup table isn''t that easy and it involves quite some math... But fortunately there is a lookup table for fast square root calculation implemented in SSE, so we''ll just have to use it

The command we''re looking for is RSQRTPS (reciproce square root of packed single). This function takes two register as arguments, calculates the reciproce of the square root (that is actually: 1/sqrt()) from the CPU-integrated lookup table of each element of the source register and saves it to the destination register.
All you''ll have to do now is take the reciproce of this calculation and you have the sqrt. But since we are going to divide by the length anyway, we don''t even need to do that! Isn''t that nice?
If you want to know more about this function, look it up in Intel''s manual. If you do so, take a look at the special return values for invalid inputs (e.g. if you try to do a sqrt(-1)).
Putting it all together: Our final Normalize-function:
Code:
struct cVector
{
float x,y,z,w;
};
void main()
{
cVector vec1;
vec1.x=0.5;
vec1.y=1.5;
vec1.z=-3.141;
vec1.w=0;
__asm {
movups xmm0, vec1
movaps xmm2, xmm0
mulps xmm0, xmm0
movaps xmm1, xmm0
shufps xmm0, xmm1, 0x4e
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm1, xmm1, 0x11
addps xmm0, xmm1
rsqrtps xmm0, xmm0
mulps xmm2, xmm0
movups vec1, xmm2
}
cout << vec1.x << " " << vec1.y << " " << vec1.z << " " << vec1.w << ''\n'';
}
Now this is it. There are no comments in there, but I hope you can still understand what''s going on.
Try to compile the program above and compare the output to the pure-C++ version. You notice the differences? That''s what I meant when I said sacrificing accuracy...

Now, this post has grown really huge and in fact I covered a lot more than I originally intended, but I hope you enjoyed this tut and that you were able to follow it properly. I''d love to hear your feedback on the forums.
Before I leave you with this code to play, here are some things that you could try after finishing the tutorial to improve your skills:
First of all: Build the above into a function. If you know how the stack pointer works, you may do some cool tweaks there too (if not, search the Intel manual for infos on how the EIP-register works

).
Second: Read about the RSQRTSS function in the Intel Manual. It is pretty much the same like the RSQRTPS, except for the fact that it is only calculating the sqrt of one value rather than four. Maybe you''ll find an efficient way to replace the above code with one that uses RSQRTSS. Try running some benchmarks with it, to find out which of the two implementations is faster.
Third: Explore the other SSE-functions and the Chapter in Volume 3 of the Intel manual. You can do some really crazy stuff with SSE, if you know how to use it right. If you know how caching works, you shouldn''t miss the PREFETCHxx command family which will improve your programs even further.
Last but not least: Just play around! Try to use what you''ve learned today to implement other algorithms. Start simple, maybe with elementary vector maths, and then slowly move to more complex subjects like colors or matrix-math. SSE can be a lot of fun and is relatively simple to handle within C. Try to make benchmarks to compare the pure C++ implementation to your SSE code, you''d be surprised what you can achieve by making the right optimizations.
And of course, if you''ve any questions about this, feel free to come back and ask, because that is what 3D Buzz is here for.
Again, I really hope you enjoyed this article and learned something from it. Thanks for your patience while reading it and see you on the forums,
ComicSansMS
post sciptum: thanks to all of you!
@halma: you''re absolutely right about the title, it''s really better the way it is now...