Corpora for Written Chinese
By: Max • Essay • 914 Words • November 16, 2009 • 1,209 Views
Essay title: Corpora for Written Chinese
Corpora for Written Chinese:
an Investigation into its Availability
Abstract
This report will investigate the availability of corpora for Chinese language. The first part is a brief introduction to the history and development of Chinese corpora. The second part will specially introduce the current situation of corpora for mandarin Chinese, including a list of such corpora existed today. Then it moves to the third part, a deeper investigation into three of the chosen corpora, introducing their purposes, contents, makers, availability, formatting, annotation, etc. The fourth part will illustrate what kinds of corpora are in the making, and finally a conclusion will be made on what kinds of corpora still need to be built in the future.
Key words: corpus, mandarin Chinese, availability, PeopleЎЇs Daily corpus, HSK corpus, LIVAC corpus
1. Introduction: Chinese Corpora: its History
When talking about corpora, one may easily reflect the Brown corpus, the LOB corpus, the London-Lund corpus, etc., most of which are English Corpora. If we emphasize corpora of Ў°other languagesЎ±, we may know the Swedish SUC corpus, the RWC Japanese corpus, but what about Chinese corpora? It seems little investigation has been made on corpora of this language in western countries. How about the development of corpus linguistics in China? Is it well developed? Is there many Chinese corpora available? Which is the largest and most famous one? And what about the details of these Chinese corpora? So many questions haunted in our minds. This assignment will try to seek answers to the related questions.
The first Chinese Corpus might be the one called Ў°Applied Glossary of Modern ChineseЎ±, which was created in the 1920s, and itЎЇs not a machine-readable one. Chen Heqin, the maker of this corpus, collected about five hundred thousand Chinese words in his work, and aimed to use it in designing the textbook of Chinese language in primary school. (Feng Zhiwei, 2002, p.3-4). Computer readable corpus in China was designed from 1979, and now it has already became a significant research field in linguistic studies.(Feng Zhiwei, 2002, p.5). Not only widely applied in lexicography, language teaching and machine translation, corpus linguistics in China is also a main studying subject in colleges and various academic institutions.(Journal of Chinese Language and Computing, 11(2) 125)
One can divide the existed corpora in China into three types: Chinese corpora, English corpora, and parallel corpora. This report will focus on corpora for mandarin Chinese.
2Ј®Corpora for Mandarin Chinese: Current Situation
Basically there are at least 12 relatively large-sized Chinese corpora existed nowadays, which basic information can be found in the following table. Among them, the Ў°National Balanced CorpusЎ± is the largest tagged corpus. In Part 3, I will make a deeper investigation on the Ў°PeopleЎЇs Daily CorpusЎ± (Corpus for common use), the Ў°LIVAC CorpusЎ± (Corpus for comparative studies), and the Ў°HSK Open CorpusЎ± (Corpus for educational use).
Table 1: Existed Corpora for Mandarin Chinese
Name Maker Year Size(million)
Corpus for Contemporary Literature Wuhan University 1979 5.27
Corpus for Modern Chinese Beihang University 1983 20
Chinese Corpus for Middle-School textbook Beijing Normal University 1983 1.068
Word Frequency Counting Corpus Beijing Language & Culture University 1983 1.82
HSK Open Corpus Beijing Language & Culture University Open corpus 10
National Balanced Corpus China Language Committee unfinished 70
Corpus of People's Daily Peking University 1998 27
China News Corpus Shanxi University 1988 2.5
Untagged Corpus Shanghai Normal University Open corpus 30
Corpus of Writer's Digest Shanghai Normal University Open corpus 1
LIVAC(Linguistic Variety in Chinese Communities) City University of Hong Kong 2005 15