Video Segmentation Using Hidden Markov Model with Multimodal Features

(1)

P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 401–409, 2004.

Multimodal Features

Tae Meon Bae, Sung Ho Jin, and Yong Man Ro IVY Lab., Information and Communications University (ICU),

119, Munjiro, Yuseong-gu, Deajeon, 305-714, Korea {heartles, wh966, yro}@icu.ac.kr

Abstract. In this paper, a video segmentation algorithm based on Hidden Markov Model classifier with multimodal feature is proposed. By using Hidden Markov Model classifier with both audio and visual features, erroneous shot boundary detection and over-segmentation were avoided compared with conventional algorithms. The experimental results show that the propose method is effective in shot detection.

1 Introduction

With the rapid increase of the multimedia consumption, automatic or semi-automatic video authoring and retrieval systems are inevitable to effectively manipulate a large amount of multimedia data. Because video segmentation is basic operation in authoring and retrieving video contents, it is important to detect precise shot boundaries and segment a video into semantically homogeneous units for high-level video processing.

In conventional video segmentation algorithms, the detection of camera shot changes mostly had been focused on. To detect shot changes, visual clues such as luminance, color histogram differences of neighboring frames had been utilized [1-3].

But the conventional methods have the following problems.

First, the visual feature difference between neighboring frames is too small to detect gradually changed shots. To resolve this problem, the approach calculating the difference between every n-th frame was proposed [4-6]. The difference value could be amplified enough to detect shot change by the certain level of a threshold value.

However, noises due to motion or bright variation are also amplified, thereby video is over-segmented. And the position of gradual shot boundary is not exact in this approach.

Second, an empirically obtained threshold value is generally used to detect shot boundaries. But the value of the feature difference between frames depends on the characteristics of the video content. So, shot detection methods with the constant threshold value could cause the miss of shot boundaries or over-segmentation. John S.

Boreczky and Lynn D. Wilcox[7] suggested an HMM(Hidden Markov Model) based approach to avoid problems rising from the fixed threshold value. In their work, camera effects are described as the state of the HMM. The frames that belong to a

‘shot’ state are grouped into a video segment. The luminance difference, motion

(2)

segment are modeled by one state. The detection of the each effect mainly depends on the output probability of the state that belongs to the effect and state transition probabilities that represent statistical occurrences of camera effects.

Other HMM based approaches focused on the modeling the semantic scenes [8], where every shot should have its own semantics. These approaches can be applied after video segmentation to detect or merge shot by semantics.

In this paper, we propose new HMM based framework. In the proposed method, camera effects and video segment have their own HMMs, where the probabilistic characteristics of the effects can be represented in temporal direction. Furthermore, we employed the method to emphasize the temporal characteristics of the abrupt and gradual shot change [4]. And audio features were used to detect shot boundaries in cooperation with visual features to enhance semantic video segmentation.

The multimodal features in the proposed method employs Scalable Color, Homogeneous Texture, Camera Motion, and Audio Spectrum Envelop descriptor in MPEG-7 audio/visual features, which could provides audio and visual descriptors that represent the characteristics of the multimedia contents [9-13]. These descriptors make it possible to separate video content manipulation from feature extraction.

Namely, video authoring or retrieval tool can use metadata which contain pre- extracted feature sets without accessing raw video data in the video content database.

2 Video Segmentation Using HMM Classifier

Figure 1 shows the procedure of the proposed video segmentation. Hidden Markov Model is adopted in video segmentation. First, MPEG-7 audio and visual features are extracted from video content. Using these features, HMM classifies video frames into non-transition region, abrupt transition region, and gradual transition region. Shot boundary detection could be considered as detection of abrupt or gradual transition regions. After that, falsely detected shot boundaries are rejected by shot transition verification process. Audio feature is a clue of video segmentation as well. It could improve semantic similarity in the segmentation. It can be assumed that the similar audio characteristics have the similar semantics.

(3)

Fig. 1. Proposed video segmentation process

2.1 HMM Based Video Segmentation

The objective of the proposed HMM classier is to detect meaningful shot boundaries.

The visual feature difference between every K-th frame shows a specific waveform in temporal domain as shown in Fig. 2. The shape of an abrupt shot change signal is similar to rectangular while the shape of gradual change is similar to triangular. We trained HMM to recognize these waveforms.

Using HMM, shot boundaries can be modeled into a set of features and specific waveforms, same as speech recognition system [14]. For an abrupt shot boundary, the HMM could detect rectangular waveform with the length of K. The HMM can also detect gradual boundaries that is arbitrary in length by using Viterbi algorithm which has the dynamic time warping characteristics [14].

(a) Abrupt shot (b) Gradual shot

Fig. 2. The difference feature values of Scalable Color descriptor between every 10-th frames for an example video

Figure 3 shows the structure of the HMM classifier. Three HMMs are designed to classify video segments into non-transition, abrupt transition, and gradual transition region. Here, ‘S’ is starting state, and ‘C’ is collector state. Ergodic state transition model is used in the HMM.

(4)

Edge Histogram detects spatial similarity between images, and detects camera break that could not be detected by Scalable color. But Edge Histogram is space variant. We use Homogeneous Texture feature which has space invariant characteristics [15]. To differentiate abrupt cut and gradual change, scalable color difference between neighboring frames are also used as feature data.

Fig. 3. HMM classifier for shot boundary detection

AudioSpectrumEnvelop descriptor has 10 coefficients, and among them, we use the value of the lowest frequency band. The neighboring non-transition regions are assumed to be semantically similar if an audio signal exists in the transition region (shot boundary region). Figure 4 shows the extracted audio feature and camera cuts.

As seen, the first 3 shots are semantically similar even though they are separated by camera cut. The audio frame length is 18.75 milliseconds and the hop size is 6.25 milliseconds. To synchronize with video features, we averaged audio features according to the video frame rate.

The feature vector is composed of DT_K(n), DS₁(n), DE_K(n), DS_K(n), and ASE(n);

where DS₁(n) is L1 norms of difference between n-th and (n+1)-th frames of Scalable Color; DT_K(n), DE_K(n), DS_K(n) are L1 norm of difference between n-th and (n+k)-th frames of Homogeneous Texture, Edge Histogram, and Scalable Color descriptor, respectively; ASE(n) is averaged spectrum value of the AudioSpectrumEnvelop feature. To retain the mean and variance of the features, DT_K(n), DS₁(n), DE_K(n), DS_K(n),and ASE(n) are normalized as follows

(5)

∑

⁻

=

= 1

0 2( )

) ) (

( N

n norm

n f

n n f

f

(1)

where f(n) is n-th frame feature, f_norm(n) is normalized feature, and N is the number of frames in the video clip. Therefore, input feature vector for HMM can be written as

)) ( ),

( ),

( ), ( (

)

(n DT n DS₁ n DS n DE n ASE n

Fnorm = _Knorm _norm _Knorm _Knorm _norm . (2)

0 10 20 30 40 50 60 70 80 90 100

1 147 293 439 585 731 877 1023 1169 1315 1461 1607 1753 1899 2045 2191 2337 2483 2629 2775 2921 3067 3213 3359 3505 3651 3797 3943

frame number

camera cut audio spectrum envelop

Fig. 4. The envelop of audio spectrum (0~88.338Hz): “science eye” video clip

2.2 Verification of Detected Shot Transition

After detecting abrupt and gradual shot transition regions, erroneously detected shots are rejected by verification process proposed in this paper. We observed two main causes for erroneous shot transition detection.

First, true shot transition regions do not contain camera motions. Shot transition regions are rejected when camera motions are detected within transition region.

Equation (3) shows a flag indicating the rejection by camera motion detection.

∑

=







 〉

=

13

0

ce onalPresen ion_practi

camera_mot where

, , 0 ) ,

( _ _

i

[i]

cm

otherwise false

cm if T true

motion by reject

(3)

(6)

value, the transition region T is re-labeled as non-transition region.

Verification process by camera motion can be included in the HMM by using camera motion as HMM input feature. But there are several problems in this approach. First, camera motion is shot level feature, so we should know the start and end frame to calculate camera motion. Without knowing the shot boundary, camera motion estimation results erroneous value. Second, to use camera motion as input feature of HMM, it is need to extract camera motion frame by frame, which is very time consuming task.

3 Experiment

In the experiment, we tested the performance and robustness of the proposed method.

The performance was compared with the threshold based method. And robustness was tested by applying the proposed method to diverse genre videos. Effectiveness of audio feature was also examined.

3.1 Experimental Condition

In experiments, we used MPEG-7 video database [16]. We tested three different video genres, e.g., news, golf, and drama. Table 1 shows number of shot boundaries in the tested video set used in the experiments. The performance of the abrupt shot boundary and the gradual shot boundary detection were tested with the video set in Table 1.

The HMMs were trained using 10 minutes video contents. Mono channel, 16 KHz audio was used to extract audio features. Six states for non-transition, one state for abrupt transition, and two states for gradual transition HMM were obtained by iterative performance evaluation in training HMM. The frame difference K is set to 10 in the experiment, which is empirically determined.

3.2 Experimental Results

Table 2, 3, and 4 show the results of shot boundary detection by the proposed method.

Because there are several different genres in the news clip, the experimentation using

(7)

Table 1. Number of shot boundaries in the tested MPEG-7 video set Shot boundary

Test video Item

Number Content Number of

frame Abrupt Gradual

Golf.mpg V17 NO. 26 Golf 17909 31 17

Jornaldanoite1.mpg V1 NO. 14 News 33752 184 36

Pepa_y_Pepe.mpg V7 NO. 21 Drama 16321 20 1

this clip is a good way to test the robustness of the segmentation algorithm. The news clip used for the experiment contains soccer, marathon, dance, and interview scenes.

As seen in Table 2, there are little missed abrupt shot boundaries. Most missed abrupt cuts were due to aliasing effect of adjacent gradual shots. When abrupt shots are occurred near the gradual shot, the two regions are likely to be overlapped and converged one shot. The missed gradual shots in Table 2 were related with power of signal. In the news clip, there were burst misses of 5 dissolves occurred from 32670 to 33003-th frame, where the video frames neighboring the dissolve regions are similar.

Table 2. Results of abrupt shot detection Abrupt Shot boundary Contents

Original Correct Missed

News 184 177 7

Golf 31 31 0

Drama 20 18 2

Table 3. Results of gradual shot detection Gradual shot boundary Contents

Original Correct Missed

News 36 31 5

Golf 17 15 2

Drama 1 1 0

Table 4 and 5 shows the effects by the audio feature. In HMM, visual only feature could be better than multimodal HMM in terms of the correctness of shot boundary detection. However, in the news clip as seen in Table 4 and 5, 6 additional missed gradual shot boundaries are semantically merged by audio feature, which means audio feature could lead to merge semantically homogeneous shots. For example, an anchor speech could be continued even though there exist camera cuts. These shots are semantically linked and audio feature could combine them into the same scene. The one additional missed shot in drama clip is also due to audio similarity with neighboring shot. The HMM classifier without audio feature could detect camera breaks or cuts more correctly, but fail to merge shot-fragments into meaningful shot.

We compared the proposed method with the threshold based algorithm that shows good result in detecting gradual shot boundary [6]. In this algorithm, color histogram difference of every 20-th frame is calculated. And shot boundary is determined when the ratio of difference is larger than predefined threshold value that is selected enough to detect gradual shot boundary. From Table 4 and 6, the proposed method and

(8)

Table 5. Results of shot detection without audio feature Total number of shot boundary Accuracy Contents

Original Correct Missed False Recall Precision

News 220 214 6 17 0.973 0.926

Golf 48 46 2 22 0.958 0.676

Drama 21 20 1 5 0.952 0.800

Table 6. Results of shot detection by color histogram thresholding Total number of shot boundary Accuracy Contents

Original Correct Missed False Recall Precision

News 220 205 15 22 0.932 0.903

Golf 48 46 2 50 0.900 0.479

Drama 21 21 0 27 1.000 0.438

Table 7 shows the performance of verification process. It reduced 50 to 60% of falsely detected shots. Camera motion and short term disturbance were dominant reasons of false detection in the analysis with the verification process. Some of the falsely detected shots were due to the failure of camera motion estimation.

Table 7. Number of flase shots related wit the verification process

Contents False before the processing False after the processing

News 30 17

Golf 40 22

Drama 4 4

4 Conclusions and Future Works

In this paper, a video segmentation algorithm with multimodal feature is proposed. By using both audio and visual features, erroneous shot boundary detection and over- segmentation were avoided compared with conventional algorithms. The experimental results show that the propose method is effective in shot detection. Also

(9)

the results showed that shots could be merged into more meaningful scenes. To increase merging efficiency, shot level features could be considered as a future work.

Acknowledgement. This research was supported in part by “SmarTV” project from Electronics and Telecommunications Research Institute.

References

1. Wang, H., DivaKaran, A., Vetro, A., Chang, S.F., Sun, H.: Survey of Compressed-domain features used in audio-visual indexing and analysis, J. Vis. Commun. Image R. 14, (2003) 150-183

2. Gargi, U., Kasturi, R., Strayer, S.H.: Performance Characterization of Video-Shot-Change Detection Methods, IEEE Trans. On CSVT, Vol.10, (2000)

3. Huang, J., Liu, Z., Wang, Y.: Integration of audio and visual information for content-based video segmentation, in Proc. IEEE Int. Conf. Image Processing (ICIP98), vol. 3, Chicago, IL, (1998) 526-530

4. Zhang H.J., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video, Multimedia System, (1993) 10-28

5. Li, Y., Kuo, C.C.J.: Video Content Analysis Using Multimodal Information, Kluwer Academic publisher, (2003)

6. Shim S.H., et al.: Real-time Shot Boundary Detection for Digital Video Camera using The MPEG-7 Descriptor, SPIE Electronic Imaging, Vol. 4666, (2002) 161-171

7. Boreczky, J.S., Wilcox, L.D.: A hidden Markov model framework for video segmentation using audio and image features, in Proc. Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP-98), Vol. 6, Seattle, WA, May (1998) 3741-3744

8. Wolf, W.: Hidden Markov Model Parsing of Video Programs, in Proc. Int. Conf.

Acoustics, Speech, and Signal Processing (ICASSP-97), Vol. 4, April. (1997) 2609-2611 9. ISO/IEC JTC1/SC29/WG11 N4980, MPEG-7 Overview, July (2002)

10. ISO/IEC 15938-4, Information Technology – Multimedia Content Description Interface – Part 4: Audio, (2001)

11. ISO/IEC 15938-3, Information Technology – Multimedia Content Description Interface – Part 3: Visual, July (2002)

12. ISO/IEC JTC1/SC29/WG11MPEG02/N4582, MPEG-7 Visual part of eXperimentation Model Version 13.0, March (2002)

13. Manjunath, Salembier, P., Sikora, T.: Introduction to MPEG-7 Multimedia Content Description Interface, B.S., John wiley & Sons, LTD, (2002)

14. Huang, X., Acero, A., Hon, H.W.: Spoken Language processing, Prentice Hall, (2001) 377-409

15. Ro, Y.M., et al.: MPEG-7 Homogeneous Texture Descriptor, ETRI Journal, Vol. 23, Num.

2, June (2001)

16. ISO/IEC JTC1/SC29/WG11/N2467, Description of MPEG-7 Content Set, Oct. (1998)