Archive for the ‘Graphics’ Category

Seam Carving

Monday, August 27th, 2007

carved_pagoda.jpg I recently watched this interesting video by Shai Avidan and Ariel Shamir from MERL. They developed an extremely cool image resizing technique called seam carving. They explain all about it in this article.

One of the coolest things about their idea is that is it easy to implement. I implemented a semi-optimized version of their algorithm in a couple of hours of coding. All the images in this post were generated by my code.

The following image of a pagoda will assist me in demonstrating the ideas to follow:

pagoda.jpg

First lets discuss the two image resizing methods currently used, namely scaling and cropping. Scaling consists of uniformly resizing the image. If for example we were to reduce an image 100 pixels wide to an image 50 pixels wide, we could simply average every two neighbouring pixels (a generalization of this technique can be applied even when the image’s new width does not divide the image’s original width, and of course to resizing both the width and the height of the image).

Scaling is efficient and, in some sense, you do not lose a lot of information from the original image. There are two big problems with scaling however:

  • If the new dimensions of the image are not proportionate to the original dimensions, the scaled image is distorted.
  • The method does not take into account the contents of the image at all (I will clarify this point shortly).

An example of the pagoda image scaled:

scaled_pagoda.jpg

A second widely used form of image resizing is cropping. Cropping consists of simply selecting a sub image of the desired proportions from the original image – i.e. removing the borders of the image. An important question now arises: how do we select which box of the image to keep (or equivalently, how do we select which borders to remove)?

Lets say we assign a weight to each of the pixels of the image, representing its importance. Then selecting the cropping box amounts to simply selecting the box containing the pixels whose sum of weights is highest. We will refer to the weights of the pixels as the energy of the pixels. So optimal cropping would select the box with the highest energy.

Now, all we are left to do is assign the weights to the pixels. Following Avidan and Shamir, we can use simple gradient magnitude as our energy function. Gradient magnitude simply means calculating the gradient at each pixel (regarding the image as a function from R2 to R) and taking its magnitude at each pixel. Here is the result of applying gradient magnitude on the pagoda image:

grad_mag.JPG

The perceptive reader would have noticed that a color image is usually a function from R2 to R3 and not to R (as there are usually three independent color channels). There are several ways to calculate the gradient magnitude in this case. The one I used in my code is to calculate the magnitudes of the three gradients separately and average the results. Note that averaging before calculating the gradients is a bad idea as a lot of information is lost in the process!

Note that the gradient magnitude of a pixel constitutes some measure of how likely that pixel is to be a border pixel (and thus more important).

Using all of this, an optimal cropping of the pagoda image is shown here:

croped_pagoda.jpg

The cropping is optimal in the sense that this is the sub-image of the given dimensions containing the most energy.

Note that all the water from the bottom of the image, and the flowers from the top are missing, as well as the left side of the white house on the left of the pagoda.

Now, unlike the scaling method, the cropping technique just presented takes into account the contents of the image. But what if the image contained two pagodas, one at each side of the image? Cropping would only be able to select one of them.

So what we are looking for is some way to automatically remove the uninteresting parts in the middle of the image, and bring the interesting bits closer together.

A naive solution, similar to regular cropping, would be to remove the columns of the image with the least amount of energy not necessarily from the border. This technique causes severe artifacts.

Avidan’s and Shamir’s seam carving technique is an improvement of this. They define a vertical seam as an 8-connected path from top-to-bottom of the image containing one pixel in each row. 8-connectedness means that if pixel (x,y) is in the vertical seam, then exactly one of the pixels (x+1, y+1), (x, y+1) or (x-1, y) is in the vertical seam as well (unless of course y = the height of the image). A horizontal seam is defined similarly.

Now instead of deleting the column (row) with the least energy, they suggest deleting the vertical (horizontal) seam with the least energy. The seam with the least energy can be easily found using dynamic programming.

The first seam to be removed in the pagoda image is depicted below:

seam.JPG

The process of reducing an image’s size via repeated deletions of the least important seams is called seam carving. The result of applying some seam carving to the pagoda image is shown here:

carved_pagoda.jpg

Note that although some artifacts are present in the image, it is the best among all the reduced images (at least – a mon avis).

A problem with seam carving, compared to scaling, is efficiency. I have some algorithmic ideas that I think might substantially reduce computation costs. I will let you know when I will have time to test them.

Another example is shown. The original image:

pagodas_small.jpg

Optimal cropping:

pagodas_small_cropped.jpg

Scaling:

pagodas_small_scaled.jpg

And finally, seam carving:

pagodas_small_carved.jpg

And just to convince you that the method doesn’t only work on pagodas:

hopper_landscape.jpg

And the carved version:

hopper_landscape_carved.jpg

SIMD (Fire)Works!

Sunday, August 12th, 2007

When I started yaniv.leviathanonline.com, I declared it to be a site about “Mathematics and Hacking”. As of now, most of its content is Mathematical. I thought it was time for a little change…

Those of you with no interest in assembly or low-level programming are welcome to skip this post entirely, although expanding your horizons is never a bad idea!

A few years ago I gave a week-long x86 assembly course for some fellow programmers. As I wanted them to learn how to use the Pentium’s SIMD extensions, I asked them to program a simple fireworks demo and try to apply some filters to it using MMX. In this article I will detail my solution to the SIMD fireworks exercise.

The following paragraph is intended for those of you without assembly knowledge. All of you assembly masters are welcome to skip to the fun stuff below!

Some References 

Well, for starters I recommend reading the ”The Art of Assembly Language Programming” (AoA) which covers a lot of the basics. And yes – Assembly Programming isan Art (maybe more so than programming in other languages). Next, you should also read Intel’s manuals: Volume 1: Basic Architecture, Volume 2A: Instruction Set Reference, A-M, Volume 2B: Instruction Set Reference, N-Z, Volume 3A: System Programming Guide, Volume 3B: System Programming Guide and Architectures Optimization Reference Manual. Yes, I know its a lot of reading but they are really interesting and you don’t have to read the whole thing in order (although I recommend it – so many interesting facts lye in those pages). Finally, if you are going to develop in assembly (e.g. assemble the source included in this article) I recommend using NASM, as it is the simplest and cleanest assembler (although some people prefer MASM, which is kind of similar, but a bit more clumsy). If you are going to do some 16-bit DOS programming, you will find Ralf Brown’s Interrupt List (RBIL) a useful tool.

Throughout this article I will assume basic assembly knowledge (the first chapters of AoA will more than cover everything).

The Buzzwords

SIMD stands for Single Instruction Multiple Data.  It is clear that most algorithms require repeating the same operation over and over again (each time on different sets of data). From this observation stemmed the idea of a vector computer or SIMD. We will shortly see that indeed our fireworks program will repeat the same operations many times and so we will be using the Pentium’s SIMD capabilities for that purpose.

MMX was Intel’s first attempt at bringing SIMD capabilities to the Pentium. MMX does not stand for anything, but it has been offered several meanings since its appearance (such as Multimedia Extensions). MMX adds 8 new registers (mm0,…,mm7) to the CPU. These are 64-bit registers. The main advantage of these registers (compared to the x86 general purpose registers) is that the MMX registers can function as vector registers. This means that each of these registers can represent one 64-bit integer value, two 32-bit integer values, four 16-bit integer values or eight 8-bit integer values. For example, the command:

psubusb mm0, mm1

is essentially equivalent to 8 byte subtractions (it stands for Packed SUBtract Unsigned Saturated Byte).

Note that in order to be backward compatible with existing OS’s (mainly Windows) Intel aliased the MMX registers to the FPU registers. This means that when accessing one of them the contents of the other one are altered as well. This allowed the OS’s to keep using their old context switching mechanisms.

SSE stands for Streaming SIMD Extensions. This was Intel’s second attempt at bringing SIMD to the PC (in response to AMD’s 3DNow!). SSE added many instructions to the instruction set as well as 16 128-bit vector registers (xmm0,…,xmm15). Unlike the 8 MMX registers, the SSE registers can contain four 32-bit floating point values. As these registers are not aliased to any existing registers (and so require special care by the OS) they are disabled by default and are only turned on when the OS explicitly commands it.

Mode 13h

The easiest to program DOS graphics mode is probably mode 13h, which we shall therefore use. In mode 13h, the screen is represented as a 320*200 byte buffer (located in address 0xA000). Each byte in the buffer is an index to a 256 bytes long palette. When mode 13h is enabled, simply writing bytes memory in the range 0xA000-0x19A00 (0x19A0 = 0xA000 + 0xFA00, 0xFA00 = 320*200) will draw pixels on the screen. You should not draw directly to this buffer though, as when the video card is drawing the screen, it scans through this buffer, thereby stalling your write commands. What you should do, is allocate an off-screen buffer and render to this buffer and once in a frame spill the contents of this buffer as quickly as possible to the memory block starting at 0xA000. We will get to this later.

Entering mode 13h is very easy, and is done by calling BIOS function 0×10, as such:

mov ax, 0x13
int 0x10

Calling BIOS functions is done through raising an interrupt. The parameters are passed in the registers (for more information see RBIL).

The default video mode for DOS is mode 3h. It is the mode of choice for displaying text. So upon program exit we should restore the video mode to mode 3h, as such:

mov ax, 3
int 0x10

The Palette 

Now, in order to render properly colored fireworks, we need to have a decent color palette. In mode 13h the palette is set by outputting the palette index number to port 0x3c8 and then outputting 3 color values (for the red, green and blue components) to port 0x3c9. Note that if the entire palette is to be set in one go, only index 0 needs to be outputted to port 0x3c8, as it is incremented automatically after three writes to port 0x3c9. Each of the 3 color components is represented by a 7 bit value, it is a number in the range 0-63, where a value of 0 means that the component is not present at all in the color, and a value of 63 means that the component is present with full opacity in the color. So the color pure green will be represented by the tuple (0,63,0). The palette we shall use will consist of a simple linear interpolation from black to red to white. Index 0 will be total black, index 128 will be total red and index 255 will be total white.

A simple program that sets such a palette and then draws it to the screen can be found here (its source is here) Our fireworks program will use most of this code. Its output is:

palette.JPG

The Basic Fireworks Program

Enough with the introductions – lets get to business!

The design for the fireworks program is very simple. We will manage a buffer of particles, each having the following qualities: x, y, x_speed, y_speed and color (although this version uses a constant white color for all particles, which looks better). Each particle’s position will be updated according to its speed, which will in turn be updated according to gravitation.

The full source of the program can be found here. The program itself can be found here. It begins with some defines (you can play with the number of particles by changing the NUM_PARTS define). I then allocate a buffer big enough to contain a copy of the entire screen (we will draw to this buffer, so as to avoid the stalling inherent in writing to the video memory itself). We access the buffer though the es register.

We then enter mode 13h, set the palette and zero the buffer.

We then use a little very useful trick. We call the RDTSC instruction, which reads the CPU’s 64-bit time-stamp register (this time-stamp register is zeroed upon CPU initialization – i.e. when you turn on your computer – and is incremented after every instruction! What do you say, is there a chance of overflow?) into registers EDX:EAX. As NASM does not support the RDTSC instruction I coded it as:

db 0x0f, 0x31

This trick is very useful, as when writing C++ programs for the PC, it can be used to generate very accurate time-stamps very efficiently. The following code will do the trick in Visual Studio:


int RDTSC()
{
     __asm _emit 0x0f;
     __asm _emit 0x31;
}

Lets get back to the fireworks program. We use this trick to generate a random seed for the random number generator (RNG). This RNG will later be used to initialize the particles.

Next, is the init_particle function, which as its name implies, initializes a particle’s properties. In order to make the particles look like fireworks I initialize the position of bunches of particles to the same value (using the global_x and global_y variables). The init_particle function accepts a single argument (a pointer to a particle) passed through the bp register.

The MAIN function updates the particles (and calls init_particle when needed) and then draws the particles to our off-screen buffer.

After all of the particles are drawn, the MAIN function calls two functions which apply filters on our buffer, before displaying it on the screen (by copying our video buffer to address 0xA000).

The Filters 

These filters are very important (and are implemented using MMX!). There are three: mmx_shade, mmx_blur and mmx_blur_right.

The particles are drawn by the MAIN function with a solid white color (i.e. an index of 255). Instead of wiping the screen after each frame, we call the mmx_shade function which decreases the value of all the pixels on the screen by 1. Its code is very simple:


mmx_shade:
mov cx, 320*200/8
xor di, di
movq mm1, [sub_mask]
.lop:
 movq mm0, [es:di]
 psubusb mm0, mm1
 movq [es:di], mm0
 add di, 8
 loop .lop
ret

Where,

sub_mask: dd 0x01010101, 0x01010101

Note how simple this is! First of all we gain a big speed factor as we are processing 8 pixels in each iteration. Even more importantly we use the extremely convenient saturated subtraction instruction. Saturated subtraction means that if the result of the subtraction is negative, then instead of getting the value mod 256, we get 0!

Running the fireworks program with only the mmx_shade filter on, results in this:

shade_only.JPG

Not very impressive yet, but it does look like fireworks! (at least when animated…)

In order to make our program much nicer, we use two more blurring filters. The mmx_blur function averages the value of all pixels with that of their 4 immediate neighbours (giving twice more weight to the value already in the current pixel). Its code is:

mmx_blur:
mov cx, (320*200-330*2)/8
mov di, 328
.lop:
 movq mm0, [es:di]
 movq mm1, [es:di+1]
 movq mm2, [es:di-1]
 movq mm3, mm0
 movq mm4, [es:di-320]
 movq mm5, [es:di+320]
 pavgb mm0, mm1
 pavgb mm3, mm2
 pavgb mm4, mm5
 pavgb mm3, mm4
 pavgb mm0, mm3
 movq [es:di], mm0
 add di, 8
 loop .lop
ret

Notice that we skip the first and last lines (plus several extra bytes) in order to keep the function as simple as possible. Those of you with sharper eyes will observe a small issue with this function. In order to properly perform these averages, we need to use a third buffer. As we are scanning through our buffer, we use the results of previous blurs as inputs to the following ones, instead of using the original contents of the buffer. This is not very noticeable, and as it saves us the need to allocate a third buffer it is worth it.

Running the program with the mmx_blur filter applied, as well as the mmx_shade filter, results in this:

mmx_blur.JPG

Much nicer! Finally we have the mmx_blur_right function, which is a slight variation of mmx_blur, that gives the fireworks comet like appearance. Its code is self-explanatory:


mmx_blur_right:
mov cx, (320*200-330*2)/8
mov di, 328
.lop:
 movq mm0, [es:di]
 movq mm1, [es:di+1]
 movq mm2, [es:di+320]
 movq mm3, [es:di+321]
 pavgb mm0, mm1
 pavgb mm3, mm2
 pavgb mm0, mm3
 movq [es:di], mm0
 add di, 8
 loop .lop
ret

The MAIN program switches between the two blur functions once in a while. The result of applying the mmx_blur_right function is:

mmx_blur_right.JPG

Exercises to the Reader

  1. Change the three filter functions so that they use the 16-byte SSE registers, thus gaining a factor of 2 on these functions’ running time. This is very easy!
  2. Add another offscreen buffer and correctly implement the blurring.
  3. Make the program shorter/cleaner/nicer. Play with the parameters. Change the palette. Enjoy!

I hope that you now have a decent understanding of SIMD and how to use it when programming the Pentium!

PS – in case you want to assemble the source use the following command:

nasmw.exe fireworks.asm -o fireworks.com

Understanding Soccer

Sunday, June 3rd, 2007

soccer_ball.gifIn 1998 (I can’t believe it was so long ago!) I was working at MATE (Media Access Technologies), an Israeli start-up at the time, specialising in face recognition and video searching. It was then that I came to know one of the coolest tricks in image recognition - the Hough Transform (pronounced “huf transform“).

In this article, I will demonstrate the algorithm with the following example:

Say you want to write a program that watches soccer. The input to your program is a sequence of frames from the game. The output should be the locations of all the players at all times. Given a frame of soccer, you first need to locate the players in the frame (i.e. you should know that a certain player occupies a certain set of pixels). Next, you need to know what location in the world is represented by those pixels. In order to do that, you need to figure out where the camera is looking at. In this article I will focus on this part of the problem (understanding the view point and direction). I will not deal with the (easier) part of determining what pixels are occupied by players.

Consider the following image of Ronaldinho:

nike-ronaldinho.jpg

I manually transformed it to this:

nike-ronaldinho_small.jpg

(the portion of the image covered by grass). I think it is clear how this step can be automatized.

Next I applied an automatic algorithm that got me this:

bw.gif

(the same image converted to black and white, for all the pixels that were bright enough). My algorithm simply turned all the pixels that are brighter than the color (80,80,20) (Hough Thresh 1) on a per channel basis to white, and all the other pixels to black. Note that my threshold color requires a high value of the red component, which is not present in the green grass, in order to turn the grass itself to black.

Lets return to our original problem – determining the point and direction of view (POV). We have big clues in the picture for determining the POV, namely the lines on the ground. Note that from the lines on the ground in the black and white picture, it is easy to see that Ronaldinho is standing in the center of the field. Can we somehow detect them?

Lets formulate the following abstact problem:

You are given a matrix M over Z2 (i.e. a matrix of bits). The matrix represents an image. Pixel Pij of the image is white if Mij is 0 and black if Mijis 1. The image contains straight lines. The lines are not perfect (i.e. they may overlap, they may not be continuous and there is a lot of noise). Your goal is to develop an efficient algorithm that receives the matrix and outputs the line equations.

Please take some time to think about this before continuing to read. If you spend enough time thinking about it, you will probably reach the conclusion that this is a really though problem.

A solution to this problem is the Hough Transform. The Hough Transform consists of noticing a duality of the Euclidean plane. We Consider a the straight line equation:

y = m * x + b

In this equation m and b are constants and x and y are variables. But what happens if we make x and y constants and m and b variables?

Well, the set of m’s and b’s that satisfy the equation represents all the lines that pass through our point (x, y)!

It is easy to see that for each point in the x-y plane we get a line in the b-m plane, namely, the line:

b = -x * m + y

Now each line in the x-y plane is matched with a point in the b-m plane! The duality depicted:

Duality

So the algorithm basically says:

  • Run through the original matrix.
  • For each pixel (x,y) that is white draw the line that matches it in the dual plane.
  • Then count the number of line intersections in the dual plane.
  • All points at which there is an intersection of more than T lines (these correspond to lines in the x-y plane on which we have more than T points in our original image) are considered valid lines.

To better illustrate the algorithm, I jotted down a few lines of python. The results follow (I must say that I was really surprised that it worked perfectly the first time around! All the threshold values I picked were somehow correct on the first go ;-) ).

Note that there is an implementation problem with using the b-m plane - Precision issues. For example, a near vertical line has huge values of m. In order to solve this, I used another duality (which is completely equivalent): the duality of the x-y plane and the a-c plane, where a is the angle of the line with the x axis (i.e. m = tan(a)), and c is b*cos(a).

The Hough map (i.e. the result of the Hough Transform) my script produced is:

hough_org.gif

Since it is kind of hard to distinguish the different color values, I generated a more human-friendly version (all the values are multiplied by 15):

hough.gif

And finally, the result (after converting back to the x-y plane):

res.gif

And the same result overlayed on the original image:

merged.gif

A perfect match! 

This is a big subject, and I only covered a small part of it. I will update this article in the near future with a more detailed analysis. If you have specific questions please post them as comments and I promise I will answer!

Meanwhile, let me just point out that the same method can be used to detect other shapes (e.g. circles). The generalization is trivial.

I will finish by showing the results of running the script on another image. Note that in this case, the threshold values are a little off. The results are nevertheless impressive:

a_small.jpg

The black and white image generated was:

a_bw.gif

The Hough map:

a_hough.GIF

The result:

a_res.GIF

And, finally, the overlayed result:

a_merged.GIF

To be updated…