以音節特徵為基礎的鳥種自動辨識系統研究成果報告(精簡版)

(1)

行政院國家科學委員會專題研究計畫成果報告

以音節特徵為基礎的鳥種自動辨識系統研究成果報告(精簡版)

計畫類別：個別型

計畫編號： NSC 95-2221-E-216-024-

執行期間： 95年08 月 01 日至 96年07 月 31 日執行單位：中華大學資訊工程學系

計畫主持人：周智勳

計畫參與人員：碩士班研究生-兼任助理：許永慶、陳彥良

處理方式：本計畫涉及專利或其他智慧財產權，1年後可公開查詢

中華民國 96 年 10 月 31 日

(2)

行政院國家科學委員會補助專題研究計畫 ; 成果報告

□期中進度報告以音節特徵為基礎的鳥種自動辨識系統

計畫類別： ; 個別型計畫 □整合型計畫計畫編號：NSC 95－2221－E－216－024－

執行期間：95 年 08 月 01 日至 96 年 07 月 31 日

計畫主持人：周智勳共同主持人：

計畫參與人員：許永慶，陳彥良

成果報告類型(依經費核定清單規定繳交)： ; 精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

; 涉及專利或其他智慧財產權， ; 一年□二年後可公開查詢

執行單位：中華大學資訊工程系

中華民國 96 年 10 月 31 日

(3)

中文摘要：

本計畫主要利用鳥鳴聲的主要頻率序列為一音節特徵，並嘗試以此主要頻率序列建立 HMM 音節類型模型，做為鳥種辨識的準則。在訓練階段，將鳴聲訊號轉換為頻域特徵：主要頻率序列，並利用頻譜能量將連續訊號切割出音節，以音節做為辨識單位。我們將同一群聚的音節視為同一音節類型，建立 HMM 音節類型模型。在測試階段時，取得測試樣本所有音節類型的單一序列，再以樣本內所含音節類型的單一序列與資料庫內 HMM 模型參數進行比對，求得相似度，以此為依據，再以特定評分方式評定為何鳥種。

關鍵詞：鳥種辨識，鳥鳴聲，隱藏馬可夫模型，主要頻率序列，音節

Abstract

In this project, a bird species recognition system based on their sounds is proposed. In this system, the birdsong of a bird species is segmented into many syllables, from which several primary frequency sequences can be obtained. By using the statistics of the principle frequency sequences, all the syllables are clustered with the fuzzy C-mean clustering method so that each syllable group can be modeled by a hidden Markov model (HMM) characterizing the features of the song of the bird species. Using the Viterbi algorithm, the recognition process is achieved by finding the template bird species that has the most probable HMMs matching the frequency sequences of the test birdsong. Experimental results show that the proposed system can achieve a recognition rate of over 78% for 420 kinds of bird species.

Keywords: Bird species recognition, birdsong, HMM, principle frequency sequence, syllable

I

(4)

1. Introduction

There are a lot of studies of human speaker recognition, and some of them have been applied to bird species recognition. However, there is diversity in the vocalization of any particular bird species as in the case of human beings, and specific sound features are required for bird species recognition. The vocalization types of birds are birdsong and birdcall. Birdsong being complicated, varied, agreeable and pleasant to listen to is usually generated by a male bird and is used to declare his turf and attract a mate. Birdcall, on the other hand, is monotonous, brief, repeated, fixed and sexless and is used to contact or alert companions.

Birdsongs are typically divided into four hierarchical levels: note, syllable, phrase, and song [1], of which syllable plays an important role in bird species recognition. The DTW algorithm was used in a study to recognize the syllables of two bird species [2].

The authors in [3] found that many bird sounds have clear harmonic spectrum structures, and they used them to classify bird syllables into four classes. A template-based technique combining time delay neural networks was proposed to automatically recognize the syllables of 16 bird species [4]. In [5] syllables were used to deal with the overlapping problem of the sound waveforms of multiple birds, and their frequencies and amplitudes were used to form the feature vectors for recognizing 14 bird species. Instead of extracting features syllable by syllable, the histogram based on consecutive syllables was used to reveal the temporal structure of the birdsong [6]. Combination of syllables with other features can be found in [7-9].

In this study, syllables of a birdsong were extracted and used to construct the principle frequency sequence for HMM modeling in order to be able to characterize a bird species. By using the Viterbi algorithm, the recognition of a test birdsong was determined by finding the template bird species that had the biggest number of probable HMMs. After the Introduction, the developed automatic bird species recognition system is described in Section 2; Section 3 presents simulation results for examining the system performance; Section 4 gives a conclusion.

2. The proposed recognition system

The block diagram of the proposed system containing the training part and the recognition part is shown in Fig. 1. In Fig. 1, after establishing the syllable HMMs for the template bird species, the recognition process was achieved by comparing the matching degrees of the test birdsong to the HMMs of

all the template birdsongs. The details are illustrated in the following.

2.1. The extraction of principle frequencies Birdsongs are non-stationary signals requiring short-time analysis. In this study, a rectangular window with size N equaling 512 samples was applied.

After segmentation, the FFT was applied to each frame with the following equation

1 , , 1 , 0 , ]

[ ] [

1

0

2 = −

=

∑

⁻

=

−

N k

e n x k X

N

n

Nn j k

π L

,

(1) which in polar coordinate form becomes

[ ]

^[ ^]

]

[k X k e^j ^k

X = ^θ , (2)

where is the frequency index called frequency bin, and

k

[ ]

k

X and θ

[ ]

^k denote the spectrum magnitude and spectrum phase of bink.

Apply the short-time Fourier transform to each frame, which results in a magnitude spectrum and a phase spectrum. The magnitude spectrum of each frame was used to obtain a narrow time-frequency chart. Align all the time-frequency charts of a birdsong to form the time-frequency spectrogram of the birdsong signal. In this spectrogram, the grey levels reflect the strengths of the frequency components.

Each trajectory represents a syllable pattern.

To extract the principle frequency of a frame, for example frame m, the frequency Bin in frame m with the greatest grey level (magnitude) was reserved and denoted by the symbol bin_m as follows:

[

[ ]

]

max arg0 ( -1)2 X k

binm= ≤k≤N . (3)

This reserved frequency Bin, called the principle frequency of frame m, represents the spectral peak of the frame. Both the principle frequency and its corresponding magnitude form a feature vector of the frame as follows:

⎥⎦

⎢ ⎤

⎣

=⎡

] ] [

[

m m

bin X m bin

f . (4)

2.2. The principle frequency sequences

Obtaining the principle frequency sequence of a syllable requires syllable segmentation from the spectrogram, as described in the following:

1. Find the feature vectors of all frames of the birdsong signal.

1

(5)

M bin m

X m bin

m

m , 1,2, ,

] ] [

[ ⎥ = L

⎦

⎢ ⎤

⎣

=⎡

f . (5)

2. Initialize the syllable index j, j=1.

3. From the feature vectors of all frames, compute the frame t at which the maximum magnitude occurs

) ] [ ( max

arg1 m

M

m X bin

t= ≤ ≤ , (6)

and set the amplitude of syllable j as (dB)

] [ log

20 ₁₀ _t

j X bin

A = ⋅ . (7)

4. Start from frame t and move backward and forward until frames h_j and t_j such that both

] [ log 20 ₁₀

hj

bin

⋅ X and 20 log₁₀ [ ]

tj

bin

⋅ X are

smaller than (A_j−α). and are called the head frame and tail frame of syllable j. In this study the parameter

hj t_j α was set as 25 dB.

5. Record as the

principle frequency sequence of syllable j.

j j j

j

j h h t t

s bin bin ₁ bin ₁bin

B = ₊ K ₋

6. Update the frame feature vectors of the birdsong by . (8)

j j j

j h t t

h m

m , , 1, , 1,

0 ] 0

[ ⎥ = + −

⎦

⎢ ⎤

⎣

=⎡ L

f

7. Let j= j+1.

8. Repeat Step 3 to Step 7 until A_j^<A1⁻α. 2.3. Syllable clustering

A birdsong is usually composed of many syllables but has only a few syllable patterns because of syllable repetitions. It is more practical to build an HMM for a group of similar syllables than each one of them. In this study, the fuzzy c-mean (FCM) clustering method was applied to cluster the syllables. Assume J syllables are obtained from the song of a bird species, denote the corresponding J principle frequency sequences as

, then the statistic vector of j J

j j

s , 1,2,...,

B = ^th

syllable is computed by

(9)

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

=

] ) B [(

3 2 1

j j j

j

s s s s

E E E v

where [(B_s )^k] denotes the k

E j ^th moment of the

principle frequencies of syllable j, and is calculated by

∑

= − +

= ^j

j j

t

h

s j j

k s k

s bin t h

E 1

) 1 ( ] ) B

[( . (10)

The FCM algorithm was applied to cluster the statistic vectors of a birdsong when the variance of the statistic vectors was greater than a predefined threshold value.

In the clustering process the optimal cluster number was determined by using the WB index proposed in [10]. The principle frequency sequences of a syllable cluster were used to establish an HMM.

copt

2.4. Construct HMMs for the bird species In this study, a 3-stage ergodic fully connected HMM as shown in Fig 2 was utilized. The state- transition probability matrix is denoted by , in which ,

3×3

A aij i=1,2,3, j=1,2,3 are nonnegative. Since the training data are the principle frequency sequences of a syllable group, the principle frequencies form the possible observations of each state in the training process. Meanwhile, the possible principle frequencies (the frequency Bins) range from 0 to 255, so that the possible observations of each state were

} 255 ,..., 2 , 1 , 0

={

V . So in this study, to train an HMM is to determine the initial state probabilities

π

_i ,

3 , 2 ,

=1

i , the state transition probability matrix , and the state observation probability

3×3

A )}

( {b_i v_k

B= ,

255 0,1,2,...,

k =

v , i=1,2,3, by using the principle frequency sequences (the observation sequences) of a syllable group.

Assume there are k principle frequency sequences in a syllable group represented by , then forms the observation sequence set for training an HMM. To train the HMM parameters by using , a well-applied expectation maximization (EM) algorithm, called a Baum-Welch algorithm, was used [11]. After the training phase, each template bird species was modeled by some HMMs. To recognize a test birdsong, after extracting its principle frequency sequences, the principle frequency sequences in the same syllable cluster were linked as a single observation sequence with which the Viterbi algorithm could be applied to find the most probable HMM λ (bird species) that generates the sequence [11]. A test birdsong signal usually contains several syllable clusters resulting in several matching degrees of the most probable HMMs. These matching degrees were used for species recognition as demonstrated in the next section.

} B , , B , B { ₁ ₂

s= s s K sk

B Bs

Bs

2

(6)

3. Experimental results

The bird species vocalization database was obtained from a commercial CD [12], which contains both birdsongs and birdcalls of 420 bird species making it a much larger database than those used in previous studies. The sampling rate of these vocalization signals is 44.1 kHz with 16-bit resolution and a monotone type PCM format. In the experiment, the frame size was set as 512 samples with three- fourths frame overlapping. For each experiment, two- thirds of the birdsongs were randomly selected for training, and the remaining for testing. The recognition rate RR was defined as the number of species recognized correctly divided by the number of all species.

A. Experiments with different scales of HMMs In the recognition phase, a test birdsong usually has several matching degrees to the most probable HMMs, requiring a rating method for recognition. Let syllable k of the test birdsong and syllable j of the template birdsong i be denoted as and . The matching degree of with respect to the HMM is

represented by , , where s

is the syllable number of the test birdsong. In the match process, for each test syllable , find the most likely template syllable,

testk temp_ij testk

) , (test_k temp_ij

m k =1,2,...,s

testk

, (11)

) , ( max arg

, k ij

j

i m test temp

then the times of i appearing in (11) for all k, , denotes the number of votes for bird species i. The species with the largest number of votes is identified as the bird species of the test birdsong.

s k=1,2,...,

Each experiment was performed 30 times, and the performance indices included maximum recognition rate (Max), minimum recognition rate (Min), average recognition rate (Ave) and variance of recognition rate (Var). The experimental results using three scales of HMMs are given in Table 1. Table 1 shows that the system achieved average RRs of about 78%. On the other hand, the results show that the 3-stage HMM performed better than the other two scales, so for the following experiments, the 3-stage HMM was used.

B. Experiments with different feature dimensions

As stated in eq. (9), the first three moments of the principle frequencies in were calculated to construct the three-dimensional feature vector for syllable . In this experiment, the first one, three, five and seven moments were computed to form the feature vectors with different dimensions. Each case was done 10 times for comparison as shown in Table 2. It was found that the 3-dimensional case exhibited superior results in the first three performance indices and a comparable result for the last. This result supports the use of a 3-dimensional feature vector.

sj

B

sj

v sj

4. Conclusions

When recognizing bird species by their songs, the characteristics like width of the spectrum, energy concentration and rapid spectrum variety make the recognition process distinct from that of the human voice. In this study, the principle frequency sequences were extracted for species HMM modeling. By using the Viterbi algorithm, the recognition process could be achieved by finding the most probable HMM models for the test birdsong. An average RR of 78.3% was achieved by the proposed recognition system.

References

[1] C.K. Catchpole and P.J.B. Slater, Bird song:

biological themes and variations, Cambridge University Press, 1995.

[2] S.E. Anderson, A.S. Dave and D. Margoliash,

“Template-based automatic recognition of birdsong syllables from continuous recordings,”

Journ. Acoust. Soc. Amer., vol. 100, no. 2, pp.

1209-1219, 1996.

[3] A. Härmä and P. Somervuo, “Classification of the harmonic structure in bird vocalization,” in Proc.

IEEE Intern. Conf. Acoust., Speech, Signal Proc., vol. 5, pp. V-701-4, 2004.

[4] S.A. Selouani, et al., “Automatic birdsong recognition based on autoregressive time-delay neural networks,” in Proc. ICSC Congr. Comput.

Intellig. Methods Appli., pp. 1-6, 2005.

[5] A. Härmä, “Automatic identification of bird species based on sinusoidal modeling of syllables,” in Proc. IEEE Intern. Conf. Acoust., Speech, Signal Proc., vol. 5, pp. 545-548, 2003.

[6] P. Somervuo and A. Härmä, “Bird song recognition based on syllable pair histograms,” in Proc. IEEE Intern. Conf. Acoust., Speech, Signal Proc., vol. 5, pp. V-825-8, 2004.

3

(7)

[7] A.L. McIlraith and H.C. Card, “Bird song identification using artificial neural networks and statistical analysis,” in Proc. Canadian Conf.

Electr. Comp. Engin., vol. 1, pp. 63-66, 1997.

[8] A.L. McIlraith and H.C. Card, “A comparison of backpropagation and statistical classifiers for bird identification,” in Proc. IEEE Intern. Conf.

Neural Networks, vol. 1, pp. 100-104, 1997.

[9] A.L. McIlraith and H.C. Card, “Birdsong recognition using backpropagation and multivariate statistics,” IEEE Trans. Signal Proc., vol. 45, no. 11, pp. 2740-2748, 1997.

Figure 2 Applied 3-state ergodic HMM Table 1 Experimental results with three types of

HMMs.

[10] J.H. Tan, On cluster validity for fuzzy clustering, Master Thesis, Applied Mathem. Dep., Chung Yuan Christian University, Taiwan, R.O.C., 2000.

Indices 3-state 4-state 5-state Max 80.6 81.1 81.8 Min 74.9 74.4 72.7 Avg 78.2 78.0 77.6 Var 2.1 1.0 2.8 [11] L.R. Rabiner, “A tutorial on hidden Markov

models and selected applications in speech recognition,” Proc IEEE, vol. 77, no. 2, pp. 257- 286, 1989.

[12] T. Kabaya and M. Matsuda, The Songs & Calls of 420 Birds in Japan, SHOGAKUKAN Inc., Tokyo, 2001.

Table 2 Experimental results with different feature dimensions

RR(%) 1-D 3-D 5-D 7-D Max 79.2 80.6 79.2 77.6 Min 70.9 75.6 74.9 74.7 Avg 75.6 78.3 76.8 75.9 Var 5.7 2.5 2.1 1.3

Figure 1 Block diagram of the proposed system

4

(8)

計畫成果自評：

1. 研究內容與原計畫相符程度、達成預期目標情況：

申請計畫時所述預期完成之工作項目為：完成一自動化之鳥鳴聲音辨識系統，依據鳥類發聲原理及其音色特性，自鳥鳴聲音訊號中擷取適當之特徵值做為該聲音訊號之聲紋特徵，用此聲紋特徵來做鳥類辨識。可以分為以下幾點：

1.

對鳥鳴聲音訊號進行前處理，含短時傅利葉轉換。

2.

自鳥鳴聲音訊號中擷取音節(syllable)之演算法。

3.

自鳥鳴聲音音節中擷取音色特徵之演算法。

4.

將特徵向量分群。

5.

建立每種鳥種之每個音節分群的

HMM。

6.

以所建立的

HMM

進行不同鳥種的辨識。

依據上述之結案報告內容，本計劃確實與申請計畫時之預期成果一致。

2. 研究成果之學術或應用價值、是否適合在學術期刊發表或申請專利、主要發現或其他有關價值等

本計劃之研究成果已發表於Second International Conference on Innovative Computing,

Information and Control (September 5 - 7, 2007, Kumamoto City International Center,

Kumamoto, Japan, http://www.ijicic.org/icicic2007.htm)，目前正在整理成長篇論文，準備投

稿SCI期刊。

以音節特徵為基礎的鳥種自動辨識系統研究成果報告(精簡版)

行政院國家科學委員會專題研究計畫 成果報告

以音節特徵為基礎的鳥種自動辨識系統 研究成果報告(精簡版)

中 華 民 國 96 年 10 月 31 日

行政院國家科學委員會補助專題研究計畫 ; 成 果 報 告

□期中進度報告 以音節特徵為基礎的鳥種自動辨識系統

計畫類別： ; 個別型計畫 □整合型計畫 計畫編號：NSC 95－2221－E－216－024－

執行期間：95 年 08 月 01 日至 96 年 07 月 31 日

計畫主持人：周智勳 共同主持人：

計畫參與人員：許永慶，陳彥良

成果報告類型(依經費核定清單規定繳交)： ; 精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

; 涉及專利或其他智慧財產權， ; 一年□二年後可公開查詢

執行單位：中華大學資訊工程系

中 華 民 國 96 年 10 月 31 日

Abstract

Keywords: Bird species recognition, birdsong, HMM, principle frequency sequence, syllable

1. Introduction

2. The proposed recognition system

∑

,

[ ]

[ ]

[ ]

[

]

∑

π

3. Experimental results

4. Conclusions

References

1.

2.

3.

4.

5.

HMM。

6.

HMM

Information and Control (September 5 - 7, 2007, Kumamoto City International Center,

Kumamoto, Japan, http://www.ijicic.org/icicic2007.htm)，目前正在整理成長篇論文，準備投

行政院國家科學委員會專題研究計畫成果報告

以音節特徵為基礎的鳥種自動辨識系統研究成果報告(精簡版)

中華民國 96 年 10 月 31 日

行政院國家科學委員會補助專題研究計畫 ; 成果報告

□期中進度報告以音節特徵為基礎的鳥種自動辨識系統

計畫類別： ; 個別型計畫 □整合型計畫計畫編號：NSC 95－2221－E－216－024－

計畫主持人：周智勳共同主持人：

中華民國 96 年 10 月 31 日