# 語言模型

## 模型类型

### 单元语法（unigram）

a 0.1
world 0.2
likes 0.05
we 0.05
share 0.3
... ...
${\displaystyle \sum _{\text{term in doc}}P({\text{term}})=1\,}$

${\displaystyle P({\text{query}})=\prod _{\text{term in query}}P({\text{term}})}$

a 0.1 0.3
world 0.2 0.1
likes 0.05 0.03
we 0.05 0.02
share 0.3 0.2
... ... ...

### n-元语法

${\displaystyle P(w_{1},\ldots ,w_{m})=\prod _{i=1}^{m}P(w_{i}\mid w_{1},\ldots ,w_{i-1})\approx \prod _{i=1}^{m}P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})}$

${\displaystyle P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})={\frac {\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1},w_{i})}{\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1})}}}$

#### 例子

{\displaystyle {\begin{aligned}&P({\text{I, saw, the, red, house}})\\\approx {}&P({\text{I}}\mid \langle s\rangle )P({\text{saw}}\mid {\text{I}})P({\text{the}}\mid {\text{saw}})P({\text{red}}\mid {\text{the}})P({\text{house}}\mid {\text{red}})P(\langle /s\rangle \mid {\text{house}})\end{aligned}}}

{\displaystyle {\begin{aligned}&P({\text{I, saw, the, red, house}})\\\approx {}&P({\text{I}}\mid \langle s\rangle ,\langle s\rangle )P({\text{saw}}\mid \langle s\rangle ,I)P({\text{the}}\mid {\text{I, saw}})P({\text{red}}\mid {\text{saw, the}})P({\text{house}}\mid {\text{the, red}})P(\langle /s\rangle \mid {\text{red, house}})\end{aligned}}}

### 指数型

${\displaystyle P(w_{m}|w_{1},\ldots ,w_{m-1})={\frac {1}{Z(w_{1},\ldots ,w_{m-1})}}\exp(a^{T}f(w_{1},\ldots ,w_{m}))}$

## 参考资料

1. ^ Jurafsky, Dan; Martin, James H. N-gram Language Models. Speech and Language Processing 3rd. 2021 [24 May 2022]. （原始内容存档于22 May 2022）.
2. ^ Rosenfeld, Ronald. Two decades of statistical language modeling: Where do we go from here?. Proceedings of the IEEE. 2000, 88 (8).
3. ^ 王亚珅，黄河燕著,短文本表示建模及应用,北京理工大学出版社,2021.05,第24頁
4. ^ Kuhn, Roland, and Renato De Mori. "A cache-based natural language model for speech recognition." IEEE transactions on pattern analysis and machine intelligence 12.6 (1990): 570-583.
5. ^ Andreas, Jacob, Andreas Vlachos, and Stephen Clark. "Semantic parsing as machine translation页面存档备份，存于互联网档案馆）." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.
6. ^ Andreas, Jacob, Andreas Vlachos, and Stephen Clark. "Semantic parsing as machine translation页面存档备份，存于互联网档案馆）." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.
7. ^ Pham, Vu, et al. "Dropout improves recurrent neural networks for handwriting recognition." 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE, 2014.
8. ^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval, pages 237–240. Cambridge University Press, 2009
9. ^ Buttcher, Clarke, and Cormack. Information Retrieval: Implementing and Evaluating Search Engines. pg. 289–291. MIT Press.
10. ^ Craig Trim, What is Language Modeling?页面存档备份，存于互联网档案馆）, April 26th, 2013.