 "Academia Sinica Balanced Corpus of Modern Chinese", simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The preliminary version of Sinica Corpus was developed on a small-scale and opened to the academic community in 1994 with the major purpose of obtaining feedback. The present corpus (Sinica 3.0), which was completed in 1997, has 5 million words. The new version (Sinica 5.0) will target 10 million words and is expected to be completed before 2006.

In addition to data-collection and data cleaning in the construction of a Chinese Balanced Corpus, we are also concerned with: 1) balancing and classifying collected data, 2) Chinese word segmentation, and 3) the design of pos-tag sets (Chen 1994).

1. Data extraction and classification for a Balanced Corpus
Topical distribution of the Sinica corpus:

2. Issues of Chinese word segmentation:
The word segmentation standard for Chinese information processing issued by the Central Standards Bureau was adopted as the guideline for segmenting words in the Sinica corpus.

3. The Part-of Speech tagging system and its Interpretation:
In accordance with the Tagset of 178 syntactic categories from the CKIP lexicon(CKIP 1993), a reduced tagset of 46 different tags (43 tags plus 3 features) is applied by Sinica Corpus.

4. Part-of-speech analysis: Technical Report no.93-05. This technical report includes detail PoS analysis and the corresponding argument structures.

  • The Sinica corpus, a Balanced Corpus of Modern Chinese with 10 million words:

    • 10 million words collected, primarily since 1996.

    • Texts in the corpus are being collected from different areas and classified according to five criteria: genre, style, mode, topic, and source.

    • Every text is segmented, and each segmented word is tagged with its pos.

    • The Sinica Corpus web-interface is designed for statistical comparison according to users' specification of topics, genres, etc.

    • The web-interface address for Sinica Corpus:

