# 音樂人聲分離

1. 人聲偵測
2. 人聲分離

## 人聲偵測原理

${\displaystyle min_{a_{t},b_{t}}E\left\{\mid m_{t}^{\prime }-a_{t}*m_{t+bt}\mid ^{2}\right\}=min_{a_{t},b_{t}}E\left\{\mid s_{t}+m_{t}^{\prime }-a_{t}*m_{t+bt}\mid ^{2}\right\}}$
${\displaystyle {\begin{smallmatrix}where-B\leqslant b_{t}\leqslant B\end{smallmatrix}}}$

## 人聲分離原理

• 計算歌曲的N-point short-time Fourier Transform(STFT)，得到光譜圖(spectrogram)${\displaystyle {\begin{smallmatrix}X=Me^{j*p}\end{smallmatrix}}}$ 。此處${\displaystyle {\begin{smallmatrix}M\in R^{f*t}\end{smallmatrix}}}$ 為強度，而${\displaystyle {\begin{smallmatrix}P\in R^{f*t}\end{smallmatrix}}}$ 為相位。
• 根據RPCA，將矩陣M分解成低階(low-rank)的矩陣L，以及一個稀疏(sparse)的矩陣S。其原因是，樂器的聲音叫人聲固定，且伴奏通常具有重複的音樂架構，因此將其視為低階的訊號。而人聲的變化較大，相較於伴奏為高階訊號，不論是在時域(time domain)及頻域(frequency domain)皆具有較大的稀疏性。因此所得的矩陣S將大部分由人聲所組成，矩陣L大都由背景音樂組成。

${\displaystyle min_{M=L+S}\mid \mid L\mid \mid _{*}+\lambda \mid \mid L\mid \mid _{1}}$

${\displaystyle {\begin{smallmatrix}\mid \mid .\mid \mid _{*}:nuclearnorm\end{smallmatrix}}}$
${\displaystyle {\begin{smallmatrix}\mid \mid L\mid \mid _{1}:l_{1}norm\end{smallmatrix}}}$

• inverse-STFT(ISTFT)

## 參考文獻

1. P. S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, "Singing-voice separation from monaural recordings using robust principal component analysis," in Proc. IEEE Int. Conf. Acoustics, Speech, & Signal Proc., pp. 57-60, 2012.
2. H. M. Yu, W. H. Tsai, and H. M. Wang, "A query-by-singing system for retrieving karaoke music," IEEE Trans. Multimedia, vol. 10, pp. 1626–1637, 2008
3. M. Rocamora and P. Herrera, "Comparing audio descriptors for singing voice detection in music audio files," in Brazilian Symposium on Computer Music, 11th. San Pablo, Brazil, 2007