停用詞

在資訊檢索中，為節省儲存空間和提高搜尋效率，在自然語言處理數據（或文字）之前或之後會自動過濾掉某些字或詞，這些字或詞即被稱為Stop Words(停用詞)。不要把停用詞與安全口令混淆。這些停用詞都是人工輸入、非自動化生成的，生成後的停用詞會形成一個停用詞表。但是，並沒有一個明確的停用詞表能夠適用於所有的工具。甚至有一些工具是明確地避免使用停用詞來支援短語搜尋的。

對於一個給定的目的，任何一類的詞語都可以被選作停用詞。通常意義上，停用詞大致分為兩類。一類是人類語言中包含的功能詞，這些功能詞極其普遍，與其他詞相比，功能詞沒有什麼實際含義，比如'the'、'is'、'at'、'which'、'on'等。但是對於搜尋引擎來說，當所要搜尋的短語包含功能詞，特別是像The Who、The The或Take That等複合名詞時，停用詞的使用就會導致問題。另一類詞包括詞彙詞，比如'want'等，這些詞應用十分廣泛，但是對這樣的詞搜尋引擎無法保證能夠給出真正相關的搜尋結果，難以幫助縮小搜尋範圍，同時還會降低搜尋的效率，所以通常會把這些詞從問題中移去，從而提高搜尋效能。

作為資訊檢索的先驅者之一，Hans Peter Luhn（英語：Hans Peter Luhn）創造了這個短語，並在他的研究中應用這個概念，推動了這個概念的使用^[1]。

參見編輯

功能詞

參考資料編輯

^ Luhn, H. P. Keyword-in-Context Index for Technical Literature (KWIC Index). American Documentation (Yorktown Heights, NY: International Business Machines Corp.). 1959, 11 (4): 288–295. doi:10.1002/asi.5090110403.

外部連結編輯

List of English Stop Words (PHP array, CSV) （頁面存檔備份，存於互聯網檔案館）
Full-Text Stopwords in MySQL （頁面存檔備份，存於互聯網檔案館）
English Stop Words (CSV) （頁面存檔備份，存於互聯網檔案館）
Hindi Stop Words
German Stop Words （頁面存檔備份，存於互聯網檔案館）, German Stop Words and phrases，another list of German stop words
Polish Stop Words （頁面存檔備份，存於互聯網檔案館）

參照編輯

Stackoverflow: "One of our major performance optimizations for the 「related questions」 query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster."

[1] Luhn, H. P. Keyword-in-Context Index for Technical Literature (KWIC Index). American Documentation (Yorktown Heights, NY: International Business Machines Corp.). 1959, 11 (4): 288–295. doi:10.1002/asi.5090110403.

[1]

停用詞

參見 編輯

參考資料 編輯

外部連結 編輯

參照 編輯

參見編輯

參考資料編輯

外部連結編輯

參照編輯