Content-Based Synchronization Using the Local Invariant Feature for Robust Watermarking

(1)

C.H. Lim and M. Yung (Eds.): WISA 2004, LNCS 3325, pp. 122–134, 2004.

Content-Based Synchronization Using

the Local Invariant Feature for Robust Watermarking

Hae-Yeoun Lee¹, Jong-Tae Kim¹, Heung-Kyu Lee¹, and Young-Ho Suh²

1 Department of EECS, Korea Advanced Institute of Science and Technology, Guseong-dong, Yuseong-gu, Daejeon, Republic of Korea

[email protected]

2 Digital Content Research Division, Electronics and Telecommunications Research Institute, Gajeong-dong, Yuseong-gu, Daejeon, Republic of Korea

Abstract. This paper addresses the problem of content-based synchronization for robust watermarking. Synchronization is a process of extracting the location to embed and detect the signature, copyright information, so that it is crucial for the robustness of the watermarking system. In this paper, we will review representative content-based approaches and propose a new synchronization method based on the scale invariant feature transform. In content-based synchronization approaches, it is important to extract robust features even with image distortions and we suspect that the consideration of local image characteristics will be helpful. The scale invariant feature transform regards these characteristics and is invariant to noise, spatial filtering, geometric distortions, and illumination changes of the image. Through experiments, we will compare the proposed method with representative content-based approaches and show the appropriateness for robust watermarking.

1 Introduction

The rapid growth of network and computing technology has opened the ubiquitous era and digital multimedia has been widely used and accessed everywhere. However, digital multimedia can be illegally copied, manipulated, and reproduced without any protection. Digital watermarking is an efficient technique to prove the ownership by inserting copyright information into the contents itself.

Since Cox et al. [1] proposed a novel watermarking strategy using spread-spectrum technique, there have been many researches inspired by methods of image coding and compression. These works are robust to image noise and spatial filtering, but show severe problems to geometric distortions.

In order to counter geometric distortions, synchronization, a process of identifying the location in contents for watermark embedding and detection, is required. These techniques can be classified into four categories: (1) the use of periodical sequence, (2) the use of templates, (3) the use of invariant transforms, and (4) the use of media contents.

Kutter [2] proposed a robust synchronization approach using periodical sequence.

The signature is embedded multiple times in the image at different spatial locations.

The peak pattern corresponding to the location of the embedded signature is acquired by the auto-correlation function and used to restore geometric distortions by which watermarked images have undergone. Pereira and Pun [3] described a template-based

(2)

approach that inserted templates into media contents. Accurate and efficient recovery of geometric transformations is possible by detecting these templates and estimating distortions. Lin and Cox [4] introduced an approach that exploited invariant properties of the Fourier transform to cyclic translation using a log-polar mapping, called as the Fourier-Mellin transform. This transform is mathematically well defined and invariant to rotation, scaling, and translation of the image. However, the severe fidelity loss during the inversion of a log-polar mapping makes it difficult to implement practi- cally this approach. The last category is based on media contents and our approach belongs to this category. Details of content-based approaches will be explained in section 2.

In this paper, we will review previous content-based synchronization approaches and propose a new content-based synchronization method based on the scale invariant feature transform. In content-based synchronization approaches, feature extraction, or analysis, is important for the robustness of the watermarking system and we suspect that the consideration of local image characteristics will be helpful to extract features robustly even with image distortions. The scale invariant feature transform may be one of the solutions based on local image characteristics and is invariant to rotation, scaling, translation, and illumination changes of the image. Therefore, we adopt and modify this transform for the watermarking purpose. In experiments, we will compare the performance of the proposed method with that of other content-based approaches by applying various attacks such as lossy compression, spatial filtering, and geometric distortions. The results show the appropriateness of the proposed method for robust watermarking.

In the following section, we will describe previous content-based synchronization approaches. Section 3 will propose a new synchronization method based on the scale invariant feature transform. Experiment results and discussions are shown in section 4 and we will make a conclusion.

2 Previous Content-Based Synchronization Approaches

In content-based synchronization approaches, media contents represent an invariant reference for geometric transformations so that referring contents can solve the problem of synchronization, i.e. the location of the signature is not related to image coordinates, but image semantics. If we fail to detect the exact location where the signature is embedded, it is impossible or severely difficult to retrieve the signature correctly and the performance of the watermarking system will be decreased after all.

Therefore, extracting the location, called as the patch, for watermark embedding and detection is very important and carefully designed. In this section, we will review representative content-based synchronization approaches to calculate the patch.

2.1 Bas et al.’s Approach

Bas et al. [5] proposed a feature-based synchronization approach based on salient feature points and their Delaunay tessellation. In order to formulate patches for watermark embedding and detection, they first extract feature points by applying the Harris corner detector that uses the differential features of the image. The set of extracted feature points are decomposed into a set of disjoint triangles through Delaunay

(3)

tessellation. If the set of extracted feature points in the original image and distorted images is identical, Delaunay tessellation will be an efficient method to divide the image and invariant to spatial filtering and geometric distortions of the image. They use each triangle as the patch and embed the signature into the patch by applying a classical additive watermarking method on the spatial domain.

Drawbacks of this method are that the Harris corner detector is sensitive to image modifications, i.e. the set of extracted feature points is different even with small image modification and Delaunay tessellation of these points is also severely different from that of the original image. Therefore, it is difficult to robustly extract the triangle, the patch, and the robustness of the watermarking system will be eventually decreased. Furthermore, geometric manipulations that modify the relative position of feature points or remove feature points, for example aspect ratio changes and cropping of the image, may result in different tessellation and the patch do not more correspond. Fig. 1 shows the results of different tessellation in image modifications.

Fig. 1. Feature points and their Delaunay tessellation: (a) the original image, (b) the gaussian blurred image, and (c) the cropped image

2.2 Nikolaidis and Pitas’ Approach

Nikolaidis and Pitas [6] described an image segmentation-based synchronization approach. In general, image segmentation is a useful tool in image processing and segmented regions are expected to be invariant to image noise and spatial filtering.

Moreover, each region will be affected by geometric manipulations as the whole image.

In order to extract the patch, they apply an adaptive k-mean clustering technique and retrieve several largest regions. These regions are then fit by ellipsoids and their bounding rectangles are used as the patch to embed or detect the signature.

Problems with this method are that image segmentation is dependent on image contents, objects and textures, etc., and severely sensitive to image modifications that remove image parts, for example cropping and translation of the image. Fig. 2 shows the original image and its segmented regions in Baboon and Airplane images. For convenience, we represent only boundaries of segmented regions. In the Baboon image, we can easily select the largest and efficient regions for the patch, pointed by an arrow, but in the Airplane image, it is difficult to select the region for the patch.

In our experiments, for the analysis of the segmentation-based synchronization approach, we first adopt an adaptive k-mean clustering technique to segment the image,

(4)

calculate the center of gravity, centroid, of segmented regions whose size is over pre- defined thresholds to extract robust feature points, and then these points are decomposed into triangles, patches, by Delaunay tessellation. The results with this method will be described in section 4.

Fig. 2. (a) The original image and (b) its segmented image in Baboon and Airplane images

2.3 Tang and Hang’s Approach

Tang and Hang [7] introduced a synchronization approach using the intensity-based feature extractor and image normalization. In general, the objects in the normalized image are invariant to small image modifications and this approach focuses on this fact. In order to extract feature points, they use a method called as Mexican hat wavelet scale interaction. It determines feature points by identifying the intensity change of the image and is more robust to spatial distortions. Then, the disks of the fixed radius R, whose center is each extracted feature point, are normalized so that they can be invariant to rotation, translation, and partly spatial filtering of the image. They use these normalized disks as patches for watermark embedding and detection.

(5)

Fig. 3. The normalized disks: (a) the original image, (b) the blurred image, (c) the 10º rotated image, and (d) the 1.2× scaled image

However, they fix the radius of the disks and an image normalization technique is sensitive to image contents used for normalization so that this approach shows severe weakness to scaling distortion. In fact, it is not easy to determine the radius of the disks efficiently. Fig. 3 shows the shape of the normalized disks with image distortions. We can find that the normalized disks is robustly extracted with spatial filtering and rotation of the image, but has problems in the scaling distortion of the image, i.e.

the normalized disk from the scale image is not matched with that from the original image (see Fig. 3d).

For the analysis of this approach, we first calculate the normalized disks and acquire six affine-transformation parameters. These parameters are used to formulate the normalized rectangle, the patch, for watermark embedding and detection. The results will be shown in section 4.

3 Proposed Synchronization Approach

In object recognition and image retrieval applications, affine-invariant features have been recently researched. Lowe [8] proposed a scale invariant feature transform that is based on the local maximum or minimum of the scale-space. Mikolajczyk and Schmid [9] suggested an affine-invariant interest point extractor by considering the local textures, and Tuytelaars and Gool [10] described a local image descriptor that extracted interest points and searched their near-by edges, contours, for affine-

(6)

invariant regions. These affine-invariant features are highly distinctive and matched with high probability against a large case of image distortions, for example viewpoint changes, illumination changes, partial visibility, and noise of images.

In content-based synchronization approaches, the extraction of the patch is very important for the robustness of the watermarking system and we suspect that the consideration of local image characteristics will be helpful for the robust extraction of the patch. In this section, we propose a new synchronization method based on the scale invariant feature transform.

Original image

Difference of Gaussian (DoG) images neighbors in the same scale

neighbors in the scale above

neighbors in the scale below Difference

Scaleσ1

Scaleσ2

Scaleσ3

Difference

Fig. 4. The scale-space by using the difference of gaussian function and the closest neighbors of a pixel (filled a black color)

3.1 Scale Invariant Feature Transform

The scale invariant feature transform, called as SIFT descriptor, has been proposed by Lowe [8] and proved to be invariant to image rotation, scaling, translation, partly illumination changes, and projective transform. This descriptor extracts feature points by considering local image characteristics and describes the properties of each feature point such as the location, scale, and orientation. The basic idea of SIFT descriptor is detecting feature points efficiently through a staged filtering approach that identifies stable points in the scale-space.

SIFT descriptor can extract local feature points from following steps: (1) select candidates for feature points by searching peaks in the scale-space from a difference of gaussian (DoG) function, (2) localize feature points using the measures of their stability, (3) assign orientations based on local image properties, and (4) calculate feature descriptors which represent local shape distortions and illumination changes.

In order to extract candidate locations for feature points, they first acquire the scale-space by using a difference of gaussian function and retrieve all local maximum and minimum in the scale-space by checking eight closest neighbors in the same scale and nine neighbors in the scale above and below. These locations are invariant to scale changes of the image (see Fig. 4).

After candidate locations have been found, they perform a detail fitting to the nearby data for the location, edge response, and peak magnitude. Then, candidate points that have a low contrast or are poorly localized are removed by measuring the stability of each feature point at the location and scale using a 2 by 2 Hessian matrix, H, as follows.

(7)

( ) _{( )}

r r D D D

D Stability D

xy yy xx

yy

xx 2

2

2 < +1

−

= + , where _



 



=

yy xy

xy xx

D D

D

H D . (1)

The r value is the ratio between the largest and smallest eigen values and used to control the stability. In experiment, they use a value of r = 10.

To achieve invariance to image rotation, they assign a consistent orientation to each feature point based on local image properties and describe it relative to this orientation. In order to assign an orientation, the gradient magnitude m and orientation T are computed by using the pixel difference as follows.

( ) ( )













−

= −

− +

−

=

− +

−

− +

y x y x

y x y x y x y x

L L

L L L

L m

, 1 , 1

1 , 1 1 ,

2 1 , 1 , 2 , 1 , 1

θ tan . (2)

where L is the gaussian smoothed image with the closest scale where feature points are found. The histogram of orientations is formed from the gradient orientation at all sample points within a circular window of a feature point. Peaks in this histogram correspond to the dominant directions of each feature point.

For illumination condition invariance, they define 8 orientation planes, make the gradient magnitude m and orientation T smooth by applying a gaussian filter, and then sample over a 4 by 4 grids of locations with 8 orientation planes. This feature vector, 4x4x8 elements, is normalized by dividing the square root of the sum of squared components to reduce the effect of illumination changes.

Local feature points obtained through SIFT descriptor are invariant to rotation, scaling, translation, and partly illumination changes of the image.

3.2 Modification for the Watermarking Purpose

The number and distribution of local feature points from SIFT descriptor is dependent on image contents and textures. Moreover, SIFT descriptor is originally devised for image matching applications so that it extracts many feature points densely distributed to over the whole image. In order to use this local invariant descriptor for the watermarking purpose, we adjust the number, distribution and scale of feature points and remove points whose possibility to be detected, matched, with image distortions is low through experiments. Finally, the patch for watermark embedding and detection is formulated using this descriptor.

SIFT descriptor represents the properties of each feature point such as its location (t1, t2), scale (s), and orientation (T), etc. Therefore, for watermark embedding and detection, we can make the patch that is invariant to rotation, scaling, and translation of the image by the following affine-transform equation. Through this transformation, we can convert the signature whose shape is a rectangle into the shape of patches in the image or vice-versa.

(8)





+

 



 



= −





2 1

cos sin

sin cos

t t y s x

y x

o o n

n

θ θ

θ

θ . (3)

Images from natural scenes have many noise factors that affect feature extraction and we can decrease the interference of noise by applying gaussian filtering before feature extraction.

In order to control the distribution of local feature points, we apply a circular neighborhood constraint used by Bas et al. [5]. The neighborhood size D is dependent on the image dimension and quantized by the r value as follows.

r height width

D= + . (4)

The width and height represent the width and height of the image, respectively.

The r value is a constant that control the size and set as 24 similar to Bas et al. How- ever the value from the difference of gaussian function is used to measure the strength of each feature point. The neighborhood size must be carefully designed. If the size is small, the patch will be severely overlapped and if the size is large, the number of the patch will not be enough.

SIFT descriptor considers image contents and hence the shape of the patch is rotated and scaled dependently on image contents (see Fig. 7d). For embedding and detection of the signature into the patch, interpolation is necessarily required to transform the rectangular signature to be matched with the shape of the patch or vice- versa. In order to minimize the distortion of the signature through interpolation, the size of the patch must be set near to that of the rectangular signature. For adjusting the size of local features, we divide the scale of feature points into the range and apply magnification factors determined experimentally on the assumption that the size of watermarked images will not be excessively changed.

The scale of feature points from SIFT descriptor is also related to the scale factor of a gaussian function in the scale-space. In our analysis, feature points whose scale is small have the low probability to be detected because they are easily disappeared when image contents are modified. Local feature points whose scale is large also have the low probability to be detected in distorted images because their locations easily move to other locations. Moreover, it means overlapping with other patches and it will degrade the perceptual quality of the image when the signature is inserted. There- fore, we remove feature points whose scale is below 2 or over 10, these values are experimentally determined.

Fig. 5 shows the patch from our proposed method for watermark embedding and detection. For convenience, we represent only one patch. We can find that the patch is formulated robustly even with spatial filtering, rotation, and scaling of the image.

4 Experiment Results and Discussions

In this section, we will compare the performance of the proposed method with that of three representative content-based synchronization approaches described in section 2.

Method 1 is an approach proposed by Bas et al. [5], method 2 is a segmentation-based approach similar to Nikolaidis and Pitas’ approach [6], and method 3 is an approach

(9)

described by Tang and Hang [7]. For all methods, we applied a circular neighborhood constraint of Bas et al. [5] to obtain the homogeneous distribution of feature points.

For experiments, we used five 512 by 512 pixel images: Lena, Baboon, Pepper, Airplane, and Lake images widely used in image processing applications (see Fig. 6).

Each image is distorted by applying spatial filtering attacks such as mean filtering, median filtering, gaussian noise, and JPEG compression and geometric distortions such as rotation, scaling, translation, and cropping of the image.

Fig. 5. The affine-invariant patch from the scale invariant feature transform: (a) the original image, (b) the blurred image, (c) the 10 rotated image, and (d) the 1.2× scaled image (the arrow represents the scale and orientation of the feature point)

The robustness of the patch is measured by matching the patch from the original image with those from attacked images. If the pixel difference between the location of the patch from the original image and that from attacked images is less than two pixels, we considered it as the correctly matched patch. These small miss-alignments can be compensated by searching some pixels around the location of the patch originally found when we retrieve the embedded signature, prove the ownership, from the watermarked patch. In particular, if images are attacked by geometric distortions, we transform the coordinates of the patch from attacked images into the coordinates of the original image by calculating their inverse transform.

Table 1 shows experiment results. The data in an “Original image” row is the number of the patch extracted from the original image. The data in other rows is the

(10)

number of matching patches between the patch from the original image and those from attacked images. The root mean square (RMS) errors of the pixel difference, miss-alignments of matching patches, are represented in parenthesis (in pixels). Each item in table is the averaged value from five images.

Fig. 6. (a) the Lena image, (b) the Baboon image, (c) the Pepper image, (d) the Airplane image, and (e) the Lake image

Table 1. The number of matching patches from the original image and attacked images. RMS errors of the pixel difference are represented in parenthesis

Method 1 Method 2 Method 3 Proposed method

Original image 45.5 57 31 43.6

Mean 3×3 17.8

(0.412)

14.8 (0.569)

30.2 (0.234)

25.6 (0.664) Median 3×3 19.2

(0.379)

12.4 (0.664)

21.8 (0.491)

25.4 (0.664) 80% JPEG

compression

25.6 (0.294)

21.4 (0.442)

26.2 (0.557)

33.0 (0.457) Gaussian noise 16.6

(0.369)

16.2 (0.546)

22.4 (0.658)

21 (0.748) Rotation (10°) 11.4

(0.441)

0 (0.000)

5.6 (0.683)

15.8 (0.661) Scaling (1.2×) 12.8

(0.502)

13 (0.673)

0 (0.000)

5.2 (1.325) Translation

(30×30) 24

(0.000)

0 (0.000)

7.6 (0.443)

31.2 (0.311) Cropping (1/4) 27.6

(0.000)

0.6 (0.067)

8.2 (0.387)

31.6 (0.143)

Figure 7 shows the shape of patches by each content-based synchronization approach in the Pepper image. The shape of patches from method 1 and method 2 is a

(11)

triangle and the shape of patches from method 3 and our method is a rectangle. The background image (Fig. 7b) of method 2 is the boundary of segmented regions. The background image (Fig. 7c) of method 3 represents the residual image between two images scaled by different factors in Mexican hat wavelet scale interaction.

As mentioned in section 2, method 1 is a synchronization approach based on feature points by the Harris corner detector and their Delaunay tessellation. The Harris corner detector is considerably sensitive to small image modifications and the triangles of Delaunay tessellation from these feature points do not more correspond.

Therefore, the results with method 1 show that the number of matching patches is decreased in attacks. Especially, in cropping and translation distortions of images, although the intensity of images is not modified, the matching patches, triangles, from Delaunay tessellation are severely different. The results with this method however were overall acceptable for the watermarking purpose. In general, the watermarking system can prove the ownership of contents if it can retrieve the embedded signature correctly from at least one patch.

Method 2 is a synchronization approach using image segmentation and Delaunay tessellation. Although this approach shows high performance in image scaling, the performance with other attacks was poorer than other approaches. Moreover, when images were cropped, rotated, and translated, the center of gravity of segmented regions easily moved to other locations and hence it was very difficult to formulate the patch robustly.

Method 3 is based on an intensity-based feature extractor called as Mexican hat wavelet scale interaction and image normalization. As explained in section 2, the objects in the normalized image are invariant to small image modifications and rotation. The results with this method show considerable robustness to spatial filtering attacks than other approaches. However the performance with geometric attacks is relatively low and especially when images are scaled, they failed to extract the matching patches.

The overall performance of the proposed method is satisfactory. Our method can extract the patch more robust than method 1 in spatial filtering, translation, and cropping attacks because SIFT descriptor only considers local image characteristics and is invariant to illumination changes of the image. Our method is also more robust than method 3 in geometric distortions. We expect that the watermarking system using the proposed synchronization method will be resilient to spatial filtering and geometric distortions of the image.

However, our method shows relatively a low performance in scaling distortion but it is acceptable for the watermarking purpose because the watermarking system can prove the ownerships of contents if it can retrieve the embedded signature correctly from at least one patch. If we can retrieve the signature from more patches, the reli- ability and robustness of the watermarking system can be increased. Our on-going research is focused on increasing the robustness in scaling distortion and we expect to acquire more high performance.

5 Conclusion and Future Works

Synchronization is a process of identifying the location to embed and detect the signature and crucial for the robustness of the watermarking system. The content-based

(12)

synchronization approach is one of the solutions and our on-going research is focused on this approach. In this paper, we reviewed representative content-based synchronization approaches and proposed a new synchronization approach based on the scale invariant feature transform, called as SIFT descriptor. We suspect that the consideration of local image characteristics will be helpful to extract features with robustness.

The scale invariant feature transform considers these local image properties and is invariant to rotation, scaling, translation, and illumination changes of the image. We modified this descriptor for the watermarking purpose.

In experiments, we applied various image distortions, attacks, such as spatial filtering and geometric distortions and compared the performance of our approach with that of representative content-based synchronization approaches. The results support that the consideration of local invariant features is helpful in designing a robust watermarking system and our synchronization approach is one of the efficient methods Fig. 7. The shape of patches by each content-based synchronization approach. (a) Method 1, (b) Method 2, (c) Method 3, and (4) Proposed method

(13)

to solve the problem of synchronization. Our future research is focused on increasing the robustness with geometric attacks and applying the patch for watermark embedding and detection. We believe that our watermarking approach using this local invariant feature will be also robust.

References

1. I.J. Cox, J. Kilian, T. Shamoon: Secure spread spectrum watermarking for multimedia.

IEEE Trans. on Image Processing, Vol. 6 (1997) 1673-1678

2. M. Kutter: Watermarking resisting to translation, rotation and scaling. Proc. of SPIE, Vol.

3528 (1998) 423-431

3. S. Pereira, T. Pun: Robust template matching for affine resistant image watermark. IEEE Trans. on Image Processing, Vol. 9 (2000) 1123-1129

4. C. Lin, I.J. Cox: Rotation, scale and translation resilient watermarking for images. IEEE Trans. on Image Processing, Vol. 10 (2001) 767-782

5. P. Bas, J-M. Chassery, B. Macq: Geometrically invariant watermarking using feature points. IEEE Trans. on Image Processing, Vol. 11 (2002) 1014-1028

6. A. Nikolaidis, I. Pitas: Region-based image watermarking. IEEE Trans. on Image Process- ing, Vol. 10 (2001) 1726-1740

7. C.W. Tang, H-M. Hang: A feature-based robust digital image watermarking scheme. IEEE Trans. on Signal Processing, Vol. 51 (2003) 950-959

8. D.G. Lowe: Object recognition from local scale-invariant features. Proc. of International Conference on Computer Vision, (1999) 1150-1157

9. K. Mikolajczyk, C. Schmid: An affine invariant interest point detector. Proc. of European Conference on Computer Vision, (2002) 128-142

10. T. Tuytelaars, L.V. Gool: Wide baseline stereo matching based on local affinely invariant regions. Proc. of British Machine Vision Conference, (2000) 412-422