Language processing applications, such as machine translation, language analysis, language understanding and information retrieval have to know/understand the words in a text so that the text can be processed. A Chinese sentence contains no delimiters, such as a space, to separate words. Therefore, a typical word segmentation system tries to find the possible word compositions of a sentence by comparing it with a lexicon, which results in word segmentation ambiguities. Most Chinese word segmentation systems deal with the problem of resolving ambiguity, rather than identifying unknown words which make up 3% to 5% of all the words in an article. Therefore, unknown word identification is an important issue for a word segmentation algorithm. High frequency keywords are easier to extract and identify offline, while low frequency keywords must be extracted on-the-fly by using morphological rules, morphemes, and word collocations.

Our system is a Chinese word segmentation method with unknown word identification and part-of-speech tagging. The system contains a 100,000-entry lexicon with pos tags, word frequencies, pos tag bigram information, etc. The word segmentation process is based on the lexicons, morphological rules for quantifier words and reduplicated words. Pos tagging is for both known and unknown words.


Our word segmentation system was ranked first for traditional Chinese word segmentation evaluation at the First International Chinese Word Segmentation Bakeoff held by ACL SIGHAN. It is the first word segmentation system with out-of-vocabulary word identification and syntactic category prediction capabilities.


 A simplified version of the word segmentation server is available to the public at


Wei-Yun Ma, Huan-Hsing Liu, Yu-Fang Tsai, Chia-Hung Tai, Ming-Hong Bai, Jia-Zen Fan,
Yu-Ming Hsieh

