A new search algorithm by pixels selection and adjustment for block motion estimation
Motion estimation (ME) is the critical part of video coding. Good ME shorten the coding bit stream by a factor of two or more. However, it also takes a lot of time during the whole coding process. One algorithm, which finds good motion vector quickly, is demanded. Based on the concept of random select some of the pixels to speed up MAD calculation, some methods are developed and compared with current fast algorithms such as Three Step Search (TSS) and Four Step Search (FSS).

Background
The concept of pixel selection can be found in mid-term report, where the relation of number of selected pixels and bit stream size is considered. The clip "steven.yuv" are used as sample. Also it has been found that generally random pixel selection has little difference to regular selection. However, some cases do exist to make random selection much better in coding bit stream size.

Adjustment after ME by pixels selection
After Examining the differences between the motion vectors estimated with full 256 pixels search and 16 pixels selection search, we found that they are distributed as a cross: Most differences are either within 5 steps in x-direction OR y-direction, or only 1 step in both of the two direction. With adjustments, the motion vector can be moved to a better place. In other words, the bit stream can be shortened because better motion estimation with adjustments. To adjust the motion vector, a cross-shaped mask was developed, shown in Figure one.

Figure 1: Mask of neighborhoods for adjustment

After the motion estimation by pixel selection, one block is selected, called potential block. Put the center of the mask (marked with '0' ) at the potential block. The blocks masked with number ( 'T' means '10' ) are among the candidates being considered for MAD distortion, depending on the total number of blocks for adjustment. Those with smaller number have higher priority. For example, only 13 blocks are considered for MAD distortion, then the blocks marked with 0-4 are chosen. MADs are calculated and the one with least distortion is selected for motion compensation. (or potential block for further adjustments)

Three-step ME -- fast scan with coarse and find adjustments
First, the whole blocks are scanned with MADa (We call MAD calculated by n pixels MADn. Here n can be 1~256. MAD256 is the full pixel MAD), a potential block is selected out. Here a is a small number for fast scan. Then put the mask center at the potential block to choose its neighbor blocks for MADb calculation. Here b is larger than a (in fact, better let b*b=a*256) and the number of blocks chosen t1 is optional. A new potential block is selected with least distortion. Last, repeat step 2 with MAD256 and select the block for motion compensation. The number of neighbor blocks chosen t2 is not necessarily the same as t1.

Theoretical calculation cost
If only the MAD calculation dominates (this assumption is proven to be good in the mid-term report), the calculation cost in one block (normalized as two pixel comparison) should be: 31^2*a+t1*b+t2*256. In unit of full MAD search, it is (31^2*a+t1*b+t2*256)/256 MAD/block. If the block is at the boarder of one frame, it should be smaller.

A full search with MAD16 costs 31^31=961MAD/block. TSS costs (9+8+8+8)=33MAD/block. FSS costs 17MAD/block and 50MAD/block in very bad case.

Implementation methods
Method1: This is developed to achieve very good bit rate quality. The algorithm is modified a little: Not all the blocks are scanned in step one, but all of those in center--within 12 steps from the center. This modification is to decrease the time in scan. a=16 and b=80; t1=41 and t2=21. The cost is [(12*2+1)^2*16+41*80+21*256]/256=72MAD/block.

Method2: It is developed to speed up ME further. In step one, not only the blocks far from the center are ignored, some others are also skipped considering their position. Only those have distances that can be divided by 3 in both x-direction and y-direction are chosen. Totally there are (25/3)^2=81 blocks are chosen. Other numbers are a=16 and b=64; t1=49 and t2=13. The cost is [81*16+49*64+13*256]/256=29MAD/block.

Method3: Similar to Method2, but t1=24. The cost is 23MAD/block.

Motion prediction
Motion vectors at the corresponded position of the previous frame can be referenced for motion prediction. Because of this, the fast scan step can be removed. However, if the reference motion vector is bad, no good motion vector can be expected. Therefore, the motion vectors in the first frame of a sequence should be carefully selected.

Implementation methods with motion prediction
Method4: With motion prediction, this method is expected as the best one. For the first frame, 25^2 blocks are scanned with MAD16 (a=16). b=64 and t1=21 for step two and t2=17 for step three. The cost is 56 MAD in this frame. For other frames, step one is removed and motion vector in corresponded position in the previous frame are selected as current potential motion vector; b=64 and t1=49 for step two and t2=5 for step three. The cost is 17 MAD/block.

Method5: It is to compare the effect of our new method with TSS under the condition of motion prediction. The motion vectors of the first frame are searched by the same method mentioned in method4. As to other frames, same potential motion vectors are selected from previous frame. TSS is used from step size equals to 2 to 1. The cost is 16 MAD/block.

Results:
Quantizer is chosen to be 8 for all the following coding process. There are three samples: 'akiyo.yuv', 'mobile.yuv' and 'steven.yuv'. each o f those are 10-frame CIF clip. Theoretical cost MAD/block, real time spent in ME, total MADa, MADb, MAD256 and total cost, cost of MAD/block, bit size of the nine frames and normalized size are listed in tables. For the method with motion estimation, first frame is separated from others for better understanding.

Akiyo.yuv
 
  Full search Three SS Four SS Method1 Method2 Method3
Theoretical MAD/block
961
33
17~50
72
29
23
Real time cost in ME
3170
271
509
945
488
444
MAD16
0
0
0
2050488
246312
246312
MAD64 or MAD80
0
0
0
136459
162159
83642
MAD256
3152115
109116
56450
70798
43691
43694
Total cost in MAD
3152115
109116
56450
241597
145808
79999
MAD/block
884
31
15.8
68
41
22.4
Bit frame size
17902
17902
18833
17908
17910
17910
Normalized size
100%
100%
105.2%
100%
100%
100%
 
   
Method4
Method5
Frame
First
Other eight
Total
Other
Total
Theoretical MAD/block
60
17
21.7
16
20.9
Real time cost in ME
39
129
168
73
112
MAD16
227832
0
227832
0
227832
MAD64 or MAD80
7873
144084
151957
0
7873
MAD256
6365
15203
21568
50083
56448
Total cost in MAD
22573
51224
74121
50083
72656
MAD/block
57
16.2
20.8
15.8
20.4
Bit frame size
6679
11347
18026
11215
17894
Normalized size  
100.7%
100%
 

Mobile.yuv
 
  Full search Three SS Four SS Method1 Method2 Method3
Theoretical MAD/block
961
33
17~50
72
29
23
Real time cost in ME
23457
1328
1067
2579
488
444
MAD16
0
0
0
2050488
246312
246312
MAD64 or MAD80
0
0
0
134061
162159
83642
MAD256
3152115
109161
57911
69439
43691
43694
Total cost in MAD
3152115
109161
57911
231110
145808
79999
MAD/block
884
31
16.2
64.8
41
22.4
Bit frame size
1124247
1146346
1143609
1132475
1205603
1230794
Normalized size
100%
102.0%
101.7%
100.7%
107.2%
109.5%
 
   
Method4
Method5
Frame
First
Other eight
Total
Other
Total
Theoretical MAD/block
60
17
21.7
16
20.9
Real time cost in ME
96
282
378
334
430
MAD16
227832
0
227832
0
227832
MAD64 or MAD80
7645
142292
149937
0
7645
MAD256
6232
14927
21159
50160
56392
Total cost in MAD
22381
50502
72883
50160
72541
MAD/block
57
15.9
20.4
15.8
20.4
Bit frame size
118064
1008064
1126128
1119601
1237665
Normalized size  
100.2%
110.1%
 

Steven.yuv
 
  Full search Three SS Four SS Method1 Method2 Method3
Theoretical MAD/block
961
33
17~50
72
29
23
Real time cost in ME
30502
1595
1532
2975
1512
1318
MAD16
0
0
0
2050488
246312
246312
MAD64 or MAD80
0
0
0
136994
164010
84457
MAD256
3152115
110277
79887
71214
44063
44131
Total cost in MAD
3152115
110277
79887
233618
100460
80640
MAD/block
884
31
22.4
65.5
28.2
22.6
Bit frame size
568326
616772
844312
577952
647299
658789
Normalized size
100%
108.5%
148.6%
101.7%
113.9%
115.9%
 
   
Method4
Method5
Frame
First
Other eight
Total
Other
Total
Theoretical MAD/block
60
17
21.7
16
20.9
Time used
114
313
427
370
484
MAD in 16
227832
0
227832
0
227832
MAD in 64 
7884
145339
153283
0
7884
MAD in 256
6411
15309
21720
50822
57233
Total MAD
22622
51644
74280
50822
73444
MAD/block
57
16.3
20.8
16.0
20.6
size
62038
523186
585224
604749
666787
Normalized size  
103.0%
117.3%
 
Bit stream performance and coding time cost are also ploted out.

Method1 is proven to be a very good method in ME. Compared to the full search algorithm, it saved more than 90% time. However, the bit stream is only about 1% longer. When the speed is not a critical problem, it is a recommended method. Method4 is another good one because its speed. For long sequence, it is not slower than TSS or FSS and the bit stream is shorter. Compared with method4 and method5, the adjustment by pixel selection is better than the adjustment from TSS.

Option I: the neighbor mask
Another mask is made as shown in Figure two. Some evaluation is made (result will not be shown here but can be obtained by running the source code). It is found this mask is not better than the one we mentioned before.

 
Figure two: Another neighbor mask
 

Option II: The rate between the number of MAD64 and MAD256
One MAD256 costs about the same time as four MAD64. With motion prediction (and ME for the first frame mentioned before), there are several options to choose to keep the MAD/block the same. For example, we can use no MAD64 but 17 instead and the MAD cost is same. The following is a comparison of bit stream among the methods with same cost in other frames. Except method5, others are related to the cross-shaped mask in Figure one.
 
 
Method5
     
Method4
s teven # MAD by 64 pixels
0 (step 3)
0
16
32
48
# MAD by all pixels
16 (step 3)
16
12
8
4
MAD in 64 
7884
7884
59231
106582
153283
MAD in 256
57233
57805
45672
33529
21720
Total MAD
73444
72168
74720
74414
74280
MAD/block
20.6
20.2
21.0
20.9
20.8
Bit stream size
666787
609013
596060
586886
585224
Percent 
117.3%
115.3%
104.9%
103.3%
103.0%
M Bit stream size
1237665
1135993
1133209
1130357
1126128
Percent 
110.1%
101.0
100.8%
100.5%
100.2%
A Bit stream size
17941
18035
17894
17902
18026
Percent 
102.2%
100.7%
100%
100%
100.7%
'M' represents the clip mobile.yuv and 'A' represents the clip akiyo.yuv

From the first two columns, cross-shaped mask is better than TSS for adjustment, especially for simple clips. Also we found more MAD64 do help to shorten the bit stream. It proves pixel selection is a very good algorithm in adjustment.

Option III: random pixel selection
As stated in mid-term report, the pixels are selection by a pattern. If many pixels are selected, the coding results may have little relation with the pattern. However, if the number of pixels selection is small, the pattern may affect the bit stream size. Here three patterns of 16 pixels are shown in Figure 3:

Figure 3: Three pattern for pixel selection
The left pattern is the one used above and also appeared in the mid-term report. The center one looks more unbalanced. is used for comparison. Both of them are constructed from computer program. The right pattern is made by hand and looks more balanced.
 
Left pattern
Center pattern
Right pattern
Total bit size
Total bit size
Total bit size
percent
akiyo.yuv
18026
18066
18180
101.6%
mobile.yuv
1126128
1134329
1120830
99.7%*
Steven.yuv
585224
592520
584403
101.6%
*although the bit size made from full search motion vectors is not necessarily be the shortest, it is still a surprise to see the bit stream size made from the right pattern are shorter. Further investigation may needed in the following research.

Through the result, we concluded that more balanced pixel distribution pattern expects shorter coding bit stream size. However, to find out the best pattern for many more samples needs more experiments and analysis.

Conclusion:
A new algorithm for ME is developed. Some implementations are developed and compared with existing algorithms. This new algorithm really shows its advantages in time saving and bit rate quality obtaining. Furthermore, it is flexible. Through this concept, different methods can be developed for different sequence cases and different requirements.

Appendix:
source code
syncfile: gray scale, color scale