-
In natural language processing technology, Chinese processing technology is far behind Western processing technology, and many Western processing methods cannot be directly adopted by Chinese, because Chinese must have the process of word segmentation. Chinese tokenization is the basis of other Chinese information processing, and the search engine is only an application of Chinese tokenization. Others, such as machine translation (MT), speech synthesis, automatic classification, automatic summarization, automatic proofreading, etc., all require word segmentation.
Because Chinese needs word segmentation, it may affect some research, but it also brings opportunities for some enterprises, because if foreign computer processing technology wants to enter the Chinese market, it is first necessary to solve the problem of Chinese word segmentation.
Word segmentation accuracy is very important for search engines, but if word segmentation is too slow, even if the accuracy is high, it is not available to search engines, because search engines need to process hundreds of millions of web pages, and if word segmentation takes too long, it will seriously affect the speed of search engine content updates. Therefore, for search engines, both the accuracy and speed of word segmentation need to meet high requirements. Tsinghua University, Peking University, Harbin Institute of Technology, Chinese Academy of Sciences, Beijing Language Institute, Shanxi University, Northeastern University, IBM Research Institute, Microsoft Research China, etc. have their own research teams, while commercial companies that are really professional in the study of Chinese participles are almost no longer in addition to massive technology.Chinese
Most of the technologies researched by scientific research institutions cannot be productized quickly, and the power of a professional company is limited after all, and it seems that Chinese word segmentation technology still has a long way to go if it wants to better serve more products.
-
Chinese Chinese word segmentation refers to the splitting of a sequence of Chinese characters into individual words. Word segmentation is the process of recombining successive word sequences into word sequences according to certain specifications. We know that in English, spaces are used as natural demarcation marks between words, and Chinese is only words, sentences and paragraphs can be simply demarcated through obvious demarcation characters, but words do not have a formal demarcation character, although English also has the problem of dividing phrases, but at the word level, Chinese is much more complex and difficult than English.
-
Chinese word segmentation is a basic step of Chinese text processing and a basic module of Chinese human-computer natural language interaction. Different from English, there is no word boundary in Chinese sentences, so when performing Chinese natural language processing, it is usually necessary to segment words first, and the effect of word segmentation will directly affect the effect of modules such as parts of speech and syntactic tree. Of course, word segmentation is just a tool, and the requirements are different for different scenarios.
In human-computer natural language interaction, mature Chinese word segmentation algorithms can achieve better natural language processing effects and help computers understand complex Chinese languages. When building the Chinese natural language dialogue system, Emotibot has continuously optimized it in combination with linguistics and trained a set of algorithm models with good word segmentation effects, laying a foundation for the machine to better understand Chinese natural language. Here, for the Chinese word segmentation scheme, the problems existing in the current word segmenter, and the factors and related resources that need to be considered in Chinese word segmentation, the Emotibot Intelligent Natural Language and Deep Learning Group has sorted out and summarized the years.
According to the implementation principle and characteristics, Chinese tokenization is mainly divided into the following two categories:
1. Dictionary-based word segmentation algorithm, also known as string matching word segmentation algorithm. The algorithm matches the string to be matched with a word in an established "sufficiently large" dictionary according to a certain strategy, and if a certain entry is found, it means that the match is successful and the word is identified. Common dictionary-based algorithms for dividing Huai words are divided into the following types:
Forward maximum matching method, reverse maximum matching method and bidirectional matching word segmentation method, etc. The dictionary-based word segmentation algorithm is the most widely used and the fastest word segmentation. For a long time, researchers have been optimizing the string-based matching method, such as the maximum length setting, the way strings are stored and searched, and the organization of the vocabulary, such as the use of trie index trees, hash indexes, etc.
2. Statistical-based machine learning algorithms, which are currently commonly used algorithms such as HMM, CRF, SVM, deep learning and other algorithms, such as Stanford and HANLP word segmentation tools are based on CRF algorithms. Taking CRF as an example, the basic idea is to label Chinese characters, which not only considers the frequency of words, but also considers the context, and has a good learning ability, so it has a good effect on the recognition of ambiguous words and unregistered words. Nianwenxue first proposed to label each character in its ** "Combining Classifiers for Chinese Word Segmentation", trained the classifier through machine learning algorithms for word segmentation, and elaborated the word segmentation method based on word annotation in ** "Chinese Word Segmentation as Character Tagging".
Common tokenizers use a combination of machine learning algorithms and dictionaries, which can improve the accuracy of word segmentation on the one hand, and improve domain adaptability on the other hand.
-
1.A good dictionary is importantNo matter what kind of word segmentation method, a good dictionary is essential, and the more you use an old dictionary to segment a newer text, the more it will break into a mess. How to construct a good dictionary and quickly discover new words.
2.The algorithm follows the demand, and it is recommended to choose different algorithms according to different needs, for example, similar to the autocomplete part of Zhihu's head search, it pays attention to the speed and interest relevance (priority is to find content related to your account and may be interested), and the word segmentation algorithm is secondary. And long texts like full-text search.
I think it's more about accuracy, and you should choose an algorithm like CRF.
What is the SEO word segmentation technology in the search engine, that is, the search engine establishes an index database, splits the text on the page, and then puts the divided words in the index database one by one, which is called word segmentation; This kind of technology of search engines is called word segmentation technology. >>>More
Present participle. The present participle is formed by the verb plus ing. >>>More
Song: Return.
Singing Chen Xiaodong. >>>More
Huang Guojun - I really love your cloud.
Lin Junjie - Cultivating love. >>>More
Three nice lights, my family photograph, enough soothing clothes. >>>More