HowTo: Inline Assembly & SSE: Vector normaliz...

jijo 2009-09-17

展開全文

involving inline assembly and SSE

The goal of this small HowTo is to present a very fast algorithm for normalizing vectors. The whole thing started back in this thread, where BlackHwk4 was asking for the "fastest way to normalize".
Disclaimer: The code examples below are far from perfect. Especially, I''ve skipped over almost every environment saving (save registers to stack before overwriting them, saving/restoring FPU state before using SSE, etc.) to enhance readability of the code and make it easier to understand for a complete beginner. If you use one of the segments in a bigger program you should take care of these things before shipping it

So let''s get started:

The very basics: Inline Assembler in 5 minutes
The ISO 14882 standard that covers the implementation of C++, specifies an asm definition that allows the use of inline assembler in your C++ programs. However, the standard does not specify the exact implementation of this definition, so it basically depends on your compiler how the instructions are interpreted. Please refer to the manual of your compiler if you want to know how to use assembler.
In this post, I''ll assume that you''re using Visual Studio, look at the bottom of this post for a link to the MSDN article about inline assembler in Visual C++.
Instead of the ISO-asm definition, Visual C++ supports the __asm command, which allows you to declare whole blocks of Assembler code inside of your C++-programs:

Code:

void main()
            {
            __asm {
            ;Insert Assembler code here
            }
            }

Notice that everything inside this block follows the rules of assembler and not the rules of C++ anymore. That means in particular: There''s no need to finish each line with a '';'', instead '';'' introduces a one line comment.

Now, how does this Assembler thing work?
Probably the most important thing about Assembler are the registers. Think of a register as a small block of memory on the CPU where you can store data. To a programmer, it''s pretty much the same as a variable, it''s just some place where you can put the data you''ll be working with. However, a register usually doesn''t have types (like int or bool when talking about C-variables), it''s just all 1 and 0. The registers we''ll be working with are the general purpose registers EAX, EBX, ECX and EDX. Each of them is 32 Bits wide and thus can store a full int value which is 32 Bit in VS.NET (use the sizeof() statement to find out how many bytes the several data types in C require on your compiler). So, let''s go and play with the registers:

Code:

void main()
            {
            unsigned int i=0;
            __asm {
            mov eax, i 		;MOVe the value of i to the EAX-register:   eax = i
            add eax, 12 	;ADD 12 to EAX:   eax = eax+12
            mov	i, eax		;MOVe the result back to i:   i = eax
            }
            cout << i;
            }

You''ll notice that the output of the program is 12. Be aware of the special assembler notation, which puts the destination of the operation first, and the source last. This can be a little tricky at first, but you''ll get used to it soon. I''ll give you one more program to get comfortable with the look''n''feel of Assembler, then I''d suggest that you write some small assembler progs for yourself, just to get familiar with it before you proceed reading. You may want to take a look at the Instruction Set Reference from Intel (link at the bottom) to get an overview over the available commands.

Code:

void main()
            {
            unsigned int i=5;
            __asm {
            mov eax, i		;MOVe i to EAX
            _LABEL1:				;This is a jump label
            dec eax			;DECrement EAX by 1
            jz _PROCEED		;If the result of the last operation was equal to zero, jump to the label _PROCEED, else proceed with the next line
            jmp _LABEL1		;JuMP to the line labeled as _LABEL1
            _PROCEED:
            xor eax, 0xFFFFFFFF
            ;Notice the use of hexadecimal numbers above
            mov i, eax
            }
            cout << "The biggest number one can possibly represent with 32 Bits is " << i;
            }

Try to compile the program above and try to find out what it does. If you''re not sure what one of the commands is doing, look it up in the Intel manual. After you figured out how it works, try playing around with it. Try to place some instructions between the _LABEL1 and the jz _PROCEED and take a look at how it reacts. Try to change the unsigned int i to a signed int. What is different now, and why is it so?
Take your time to play around with the instructions and try to build your very own small program, you''ll need as much experience as you can get to understand the next chapter.
As soon as you feel comfortable with what you''ve seen so far, proceed with the next chapter.

The not-so-basics: SIMD in 5 minutes
First of all: Read the article on Ars Technica (link at the bottom) it is a great article on what SIMD is and how it works. However, to those of you that do not want to dig through 6 pages of highly technical geekspeak, here''s a short explanation of what SIMD is all about.
When you are processing data in an usual C++ or Assembler program, you''re manipulating only one thing at a time. If you want to increase two variables by 1, you would write something like this:

Code:

int i=12;
            int j=255;
            i++;
            j++;

Notice that you need two instructions, although you''re performing the same operation (a simple increment) on both variables. This technique is also referred to as SISD (Single-Instruction-Single-Data), that means, the CPU can basically only calculate one value at a time. You want to calculate two values, you need two statements and thus double the time a single calculation would take.

Now imagine 3D graphics. A world full of vectors and matrices, all multidimensional data that needs to be processed as quickly as possible. If you want to add two 3D-vectors in SISD, you would do something like this:

Code:

vec1.x=vec1.x+vec2.x;
            vec1.y=vec1.y+vec2.y;
            vec1.z=vec1.z+vec2.z;

Three almost identical lines of code. Now imagine, just for one second: What if you could write something like this:

Code:

vec1.AddVector(vec2);

Now, you could of course write a function AddVector that would execute the exact same three lines mentioned above and it wouldn''t help anybody. But what if your CPU had a function AddVector integrated that would automatically execute the addition of all three coordinates at the very same time? In fact, that''s what SIMD (Single-Instruction-Multiple-Data) is all about.

Advancing: SIMD and Assembler
Believe it or not, but since the introduction of Intel''s MMX, SIMD is supported by most CPUs out there. While MMX was just for integer values, the introduction of the Streaming SIMD Extensions (that is short SSE) brought SIMD to the floats and made it a very powerful tool for fast multimedia programming. With SSE you''re able to process 4 32Bit floats at the same time, which is especially helpful when working with 3D-vectors (consists of x,y,z and w coordinates), 3D-matrices (basically a 4x4 array), High Color Graphics (R,G,B and A... you should get the point), etc.

Now, how can we get our hands on it? Basically, the trickiest hurdle for a programmer is the compiler. VS 6 is supporting SSE with the latest service packs installed (unless you''re using the standard edition; sorry guys, no sse support here), VS.NET should compile SSE out of the box, for any other compiler: Check your compiler''s manual! There are ways to get SSE running on almost every compiler, you just need to figure out how.
But there''s another obstacle: Not all CPUs support SSE! Actually, any processor after the Pentium 3/Athlon 4 should support the basic instruction set. To be absolutely sure, you should use the CPUID command that is implemented on all Intel compatible CPUs since the Pentium 1 (in a real world app, you would most likely write two implementations of your code, one in pure C for non-SSE CPUs and another one in SSE-assembler, and then do a cpuid on each startup to decide wether to call the C, or the SSE function):

Code:

void main()
            {
            unsigned int cpeinfo;
            unsigned int cpsse3;
            __asm {
            mov eax, 01h       ;01h is the parameter for the CPUID command below
            cpuid
            mov cpeinfo, edx   ;Get the info stored by CPUID
            mov cpsse3, ecx    ;info about SSE3 is stored in ECX
            }
            cout << "1 - Instruction set is supported by CPU\n";
            cout << "0 - Instruction set not supported\n";
            cout << "--------------------------------\n";
            cout << "MMX:  " << ((cpeinfo >> 23) & 0x1 ) << "\tSSE:  " << ((cpeinfo >> 25) & 0x1 ) << "\tSSE2: " << ((cpeinfo >> 26) & 0x1 ) << "\n";
            cout << "SSE3: " << ((cpsse3       ) & 0x1 );
            }

For more information on how CPUID works, check out Intel''s manual. To execute the sample programs in this tutorial, you only need SSE1 (SSE2 is needed if you want to work with double precision (64 bits) floats). Now, to make sure your compiler understands SSE, try to compile the following:

Code:

struct cVector
            {
            float x,y,z;
            };
            void main()
            {
            cVector vec1;
            vec1.x=0.5;
            vec1.y=1.5;
            vec1.z=-3.141;
            __asm {
            movups xmm1, vec1
            mulps xmm1, xmm1
            movups vec1, xmm1
            }
            cout << vec1.x << " " << vec1.y << " " << vec1.z << ''\n'';
            }

If it worked correctly you should read something like this: 0.25 2.25 9.86588
Now, what the hell just happened there?
Obviously, the program squared the vec1.x, vec1.y and vec1.z. How did it do that?

First of all, let''s talk about registers again. You remember the 32Bit EAX from above? In SSE you have something similar, only that your registers are now 128Bits wide and that they do not necessarly contain only one value. The special SSE registers are called XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 (some trivia between the lines: these 8 registers are exact sufficient for a very elegant multiplication of 4x4 matrices). Each of them can hold up to four single precision (32Bit) floats at a time where the first float occupies Bits 0-31 of the register, the second Bits 32-63, the third 64-95 and the last 96-127. This is what we call a ''Packed Single''.
With MOVUPS (Move Unaligned Packed Single) we are copying our whole vector structure into the XMM1 register (we could''ve used any of the other XMM0-XMM7 equally). We use the Move Unaligned because we do not know if the vec1''s adress is aligned at a 16 Byte border in memory. If you do know for sure that your data is aligned, use MOVAPS, it''s a lot faster.
In the next line, we''re using the MULPS command (Multiply packed single) and this is where all of the work is done. It basically multiplies each of the four floats by itself and writes the result back to XMM1. Then we copy the whole structure back to vec1.
Make sure you understand this example, because once you do, you know pretty much everything that there is about SSE. There are a lot of tricks when working with such parallelism, but for the very basics of it, that''s just it.

Try to play around a bit with the program, here are some suggestions: Add a fourth value w to the vec1 structure and see what happens. Try to multiply two different vectors with each other. Try to add two vectors using the ADDPS command.
And some things you should try out if you''re new to Assembler: You''re not in your high-level C++-world any more, the rules down here are somewhat different: Try changing the type of the vector''s components from float to double. You won''t get any compiler error! What happens and why? What is the danger behind this new freedom? Also try to leave one of the floats uninitialized. Why is there still a valid number in it after execution?

Now, you have some stuff to play around with. If you''re interested, take a look at Intel''s manual for more SSE-commands (basically all that involve the XMM-registers).

I''m taking a little break now and will be posting the last part of this tut later this evening...

Coming up next... Vector normalizing!

Literature:

(1)
http://msdn.microsoft.com/library/en...29_.topics.asp
Microsofts statement on inline assembler.

(2) http://www.intel.com/design/pentium4.../index_new.htm
the official manuals to the pentium 4 processor. for assembler programming, the Instruction Set Reference (Software Developer''s Manual Volumes 2A & 2B) are very useful, chapter 12 of the System Programming Guide (Software Developer''s Manual Volume 3) treats more advanced problems when using sse-instructions

(3)
http:///articles/paedia/cpu/simd.ars
This is a great article on SIMD in general on Ars Technica. Definitely worth reading!

04-18-2005, 02:04 PM

halma

Programmer

Join Date: Jan 2004

Location: South Pole

Posts: 1,538

Awesome!

Thanks for all the information.. I never knew the CPU could do all that, and using SSE and assembler is a lot easier than I''d thought!

I''m going to change the thread name just a bit, and make it a sticky; I hope you don''t mind.

04-18-2005, 02:24 PM

KhaoticMind KhaoticMind is offline

God wannabe

Join Date: Oct 2003

Location: My little corner of the world

Posts: 1,486

KhaoticMind will become famous soon enough

truly awesome stuff!
I had no idea that MMX and SSE were supposed to do that as well

halma: i don''t think that assembly is "that" easy. For starts the way you write the programs is totally different to what you are used to, not to mention that you have some limits in the number of the registers, their size and all that, not to mention knowing the available commands

Hope to have some time in the future to dig into it a bit more

__________________
"Things are like they are because thats how they are suposed to be"

04-18-2005, 02:27 PM

benwatt

Shiny Happy Person

Join Date: Apr 2002

Location: Aberdeenshire, Scotland

Posts: 2,382

Good explanation - I wish I''d found such a good description of SIMD when I started coding a bit in this area back in the day. I know this thread will be of use to quite a few people out there. Keep it up

04-18-2005, 02:33 PM

BlackHwk4

Registered User

Join Date: Jul 2003

Location: California

Posts: 99

Very cool. Thanks for the tutorial. If anyone is interested in learning assembly the following links might help get you started.

http://www.csn./~darkstar/assembler/
http://www./zone5/cat792/
http://www./~smit/asm01001.htm

04-18-2005, 03:26 PM

ComicSansMS ComicSansMS is offline

Most Horrible Font Ever

Join Date: Jun 2003

Location: Trier, Germany

Posts: 687

ComicSansMS will become famous soon enough

So I assume you played around with the basic SSE instructions from the last post and are now familiar with the very basics of Intel''s SSE Instruction Set. In this second part of the HowTo we are going to solve the problem of normalizing a vector as fast as we can. This problem should be ideal for anyone to learn SSE and if you read carefully, you should be able to implement other more complicated operations in SSE as well. However, as in the last post, I will focus only on the implementation of the algorithm itself for didactical reasons.

Getting Started...
So first of all, we''ll try to implement an algorithm that''ll solve our problems in C++. This is always a good starting point, because it helps us getting familiar with the problem and also helps as a reference when we''re going to implement the SSE-version. And like mentioned earlier, it is always good to have a non-SSE version of your programs at hand, in case your program has to be executed on an older machine.

Basically what we want to do is normalizing a vector. That means, you want to calculate the coordinates of a vector that points in the exact same direction as the given vector, but has a length of exactly 1. To achieve this, we''ll have to divide every coordinate of the given vector by the length of the given vector. The length is calculated as the square root of the sum of the squared coordinates. For a 3-D vector that means:

Code:

v3=(x,y,z);
            |v3| = sqrt(x2+y2+z2)
            v3_normalized = v3/|v3|

in c++ this would look like the following:

Code:

struct cVector
            {
            float x,y,z;
            };
            void main()
            {
            cVector vec1;
            vec1.x=0.5;
            vec1.y=1.5;
            vec1.z=-3.141;
            //First calculate the length:
            float len = sqrt((vec1.x*vec1.x) + (vec1.y*vec1.y) + (vec1.z*vec1.z));
            //Now divide each coordinate by the length:
            vec1.x/=len;
            vec1.y/=len;
            vec1.z/=len;
            cout << vec1.x << " " << vec1.y << " " << vec1.z << ''\n'';
            }

The lines that we''ll focus on are the ones for length calculation and for the coordinate division. So what do we have here? There are 3 muls and 3 adds in the length calculation, plus a really slow sqrt(), which we''ll handle in detail later on. Then there are 3 divs for normalizing the vector.
As you can imagine, it''ll be easy to wrap up the 3 muls and 3 divs into one SIMD call each.
The one thing that gets us into trouble is the addition. To handle this problem, we need to introduce another powerful concept of SSE: The shuffle!

Mixing registers: Shuffling
This is probably the hardest command for a SSE-adept to learn, so listen carefully:
The main problem about our register structure atm is, that we can only set off ''equal'' coordinates against each other. That means, we can add the x-coord of one vector with the x-coord of another, but we can not add the x-coord with the y-coord or the z-coord. unfortunately, that is exactly what we need here: we want to add the three coordinates of a single vector upon each other, so how''re we gonna do this?
The trivial method would be to reload the vector data from memory but store it in a different order, like maybe store the x coordinate in the second block of the register and the y value in the first, or something like that. This would not be a good idea, not at all! Be aware that fetching the data from RAM into the SSE registers is even slower than fetching it to the general purpose registers of the CPU, so that a double fetch from RAM could result in an even slower algorithm, than one that isn''t using SSE at all! However, there is a way of swapping the different elements of a SSE register, although it''s not that easy...

The command we''re talking about is the SHUFPS (Shuffle Packed Single). This command expects two SSE-registers and a one byte Hex-string as operands. The first two elements (Bits 0-63) of the Destination register (if you recall, that''ld be the first operand) will be overwritten by any two elements of the Destination register, while the last two elements (Bits 64-127) of the Destination register will be overwritten by any two elements of the source register. The elements that are actually copied are specified using the 1 Byte Hex-String. Before we proceed, I''ll write one example of how the SHUFPS looks like:

Code:

shufps xmm0, xmm1, 0x4e

Obviously, we want to shuffle to XMM0. That means, all elements contained in XMM0 might be overwritten, while XMM1 is only read but not changed. Now for the third parameter. First of all, let''s decode the Hex-Value 4E back to binary (i assume you already know how to do this, because if you don''t, you probably shouldn''t play around in assembler

Code:

[4E]_16 = [0100 1110]_2

Notice that you can split up the 8Bit Hex Value into four 2Bit values: 01, 00, 11, 10
Do you see something? We have now four values of which each is able to adress a space of four numbers. This damn Hex-string is actually telling us which elements to copy! However, there is one thing you''ll have to keep in mind: Computers are always reading the least significant bit first, so you''ll have to read from the left to the right (this is for technical reasons, so you shouldn''t bother about it; just keep it in mind when you''re using this command).
Now here''s what our shuffle does:

Code:

shufps xmm0, xmm1, 0x4e:
            First element of XMM0 will be set to element 10 (the third element) of XMM0
            2nd element of XMM0 will be set to element 11 (the fourth element) of XMM0
            3rd element of XMM0 will be set to element 00 (the first element) of XMM1
            4th element of XMM0 will be set to element 01 (the second element) of XMM1

You didn''t understand at all? Don''t worry, here is some code for you:

Code:

void main()
            {
            cVector vec1;
            vec1.x=0.5;
            vec1.y=1.5;
            vec1.z=-3.141;
            vec1.w=2;
            __asm {
            movups xmm0, vec1
            movaps xmm1, xmm0
            mulps xmm1, xmm1
            shufps xmm0, xmm1, 0x4e
            movups vec1, xmm0
            }
            cout << vec1.x << " " << vec1.y << " " << vec1.z << " " << vec1.w << ''\n'';
            }

First of all, we store vec1 in XMM0 and (vec1)2 in XMM1. Then we do the exact same shuffle as described above. Try to guess which values will be put out at the end of the console, before you execute it for the first time. If you were wrong, don''t worry, this is the only thing about SSE that is really complicated. Before you proceed I''d strongly recommend that you play around with the shuffling until you feel that you really understood it. This concept is simply too important to be skipped and gives you the ability to create very powerful algorithms. Try to experiment with the different parameters and Hex-values and try to visualize in your mind, how your data is stored and shuffled in memory. As soon as you feel that you''re comfy with the SHUFPS you may proceed...

Applied Shuffling
Now we are ready to rebuild our normalizing function using what we''ve learned so far.
If you want, try to implement this by yourself, before you read on. You should have all the knowledge to write a basic SSE version of the program before we move on (DIVPS divides, MULPS multiplies) and it will be a great exercise to learn how all of the different commands fit together.

Code:

void main()
            {
            cVector vec1;
            vec1.x=0.5;
            vec1.y=1.5;
            vec1.z=-3.141;
            vec1.w=0;
            cVector vec2;
            __asm {
            movups xmm0, vec1
            mulps xmm0, xmm0			;Calculate squares
            movaps xmm1, xmm0
            shufps xmm0, xmm1, 0x4e		;Shuffle #1
            addps xmm0, xmm1			;Add #1
            movaps xmm1, xmm0
            shufps xmm1, xmm1, 0x11		;Shuffle #2
            addps xmm0, xmm1			;Add #2
            movups vec2, xmm0
            }
            float len = sqrt(vec2.x);
            vec1.x/=len;
            vec1.y/=len;
            vec1.z/=len;
            cout << vec1.x << " " << vec1.y << " " << vec1.z << " " << vec1.w << ''\n'';
            }

Notice that you now need a fourth coordinate in the vector to make the algorithm work! Why does it need to be zero?
Be sure you understand how the two shuffle/adds work! Why do we need 2 shuffles? Is it wise to have a register where all four elements contain the same value?
This algorithm returns us a vector structure to vec2 where all four floats are set to |vec1|2. This algorithm is pretty fast, but there''s still the nasty sqrt-call when calculating the length. In the next chapter we''ll finally get rid of this...

The thing with the Square-roots
You will find a chapter about this in almost any book on computer graphic algorithms. There are numerous ways to perform a fast square-root calculation. The sqrt() from math.h which we used so far is based on an iterative algorithm which is pretty much the standard for square root calculation. Since there is no square root function in assembler, this algorithm is calculating an iterative approximation to the square root until a given accuracy is reached. Unfortunately, this approach is not very fast. Even worse, many C++ compilers offer only one implementation of sqrt() for double precision, so you''re calculating with 64Bit precision although you can only save 32Bit when working with floats... A disaster in terms of performance!

The main idea behind all of the so called fast-sqrt-algorithms is basically to cheat around the whole operation: First of all, you''re not working at full precision, fast sqrt always means sacrificing precision to performance. While this would be irresponsible in scientific applications, it is very common in computer graphics where the precision of the fast algorithms is still enough. And last but not least you''re using one dirty little trick in these fast-sqrts: Lookup-tables! That means you''re actually calculating some typical values using the sqrt() from math.h while loading the program, and then calculate any square root you need from the values you have in this table at runtime. Now, building a lookup table isn''t that easy and it involves quite some math... But fortunately there is a lookup table for fast square root calculation implemented in SSE, so we''ll just have to use it

The command we''re looking for is RSQRTPS (reciproce square root of packed single). This function takes two register as arguments, calculates the reciproce of the square root (that is actually: 1/sqrt()) from the CPU-integrated lookup table of each element of the source register and saves it to the destination register.
All you''ll have to do now is take the reciproce of this calculation and you have the sqrt. But since we are going to divide by the length anyway, we don''t even need to do that! Isn''t that nice?

If you want to know more about this function, look it up in Intel''s manual. If you do so, take a look at the special return values for invalid inputs (e.g. if you try to do a sqrt(-1)).

Putting it all together: Our final Normalize-function:

Code:

struct cVector
            {
            float x,y,z,w;
            };
            void main()
            {
            cVector vec1;
            vec1.x=0.5;
            vec1.y=1.5;
            vec1.z=-3.141;
            vec1.w=0;
            __asm {
            movups xmm0, vec1
            movaps xmm2, xmm0
            mulps xmm0, xmm0
            movaps xmm1, xmm0
            shufps xmm0, xmm1, 0x4e
            addps xmm0, xmm1
            movaps xmm1, xmm0
            shufps xmm1, xmm1, 0x11
            addps xmm0, xmm1
            rsqrtps xmm0, xmm0
            mulps xmm2, xmm0
            movups vec1, xmm2
            }
            cout << vec1.x << " " << vec1.y << " " << vec1.z << " " << vec1.w << ''\n'';
            }

Now this is it. There are no comments in there, but I hope you can still understand what''s going on.
Try to compile the program above and compare the output to the pure-C++ version. You notice the differences? That''s what I meant when I said sacrificing accuracy...

Now, this post has grown really huge and in fact I covered a lot more than I originally intended, but I hope you enjoyed this tut and that you were able to follow it properly. I''d love to hear your feedback on the forums.

Before I leave you with this code to play, here are some things that you could try after finishing the tutorial to improve your skills:
First of all: Build the above into a function. If you know how the stack pointer works, you may do some cool tweaks there too (if not, search the Intel manual for infos on how the EIP-register works

).
Second: Read about the RSQRTSS function in the Intel Manual. It is pretty much the same like the RSQRTPS, except for the fact that it is only calculating the sqrt of one value rather than four. Maybe you''ll find an efficient way to replace the above code with one that uses RSQRTSS. Try running some benchmarks with it, to find out which of the two implementations is faster.
Third: Explore the other SSE-functions and the Chapter in Volume 3 of the Intel manual. You can do some really crazy stuff with SSE, if you know how to use it right. If you know how caching works, you shouldn''t miss the PREFETCHxx command family which will improve your programs even further.

Last but not least: Just play around! Try to use what you''ve learned today to implement other algorithms. Start simple, maybe with elementary vector maths, and then slowly move to more complex subjects like colors or matrix-math. SSE can be a lot of fun and is relatively simple to handle within C. Try to make benchmarks to compare the pure C++ implementation to your SSE code, you''d be surprised what you can achieve by making the right optimizations.

And of course, if you''ve any questions about this, feel free to come back and ask, because that is what 3D Buzz is here for.

Again, I really hope you enjoyed this article and learned something from it. Thanks for your patience while reading it and see you on the forums,

ComicSansMS

post sciptum: thanks to all of you!
@halma: you''re absolutely right about the title, it''s really better the way it is now...

Last edited by ComicSansMS; 04-18-2005 at 03:43 PM.