Documentation for chnvlookup (Chinese Vocabulary Lookup), version 1.2 Copyright (C) 1998,1999 David Hiebeler Dept of Mathematics and Statistics University of Maine Orono, ME 04469-5752 http://www.math.umaine.edu/faculty/hiebeler File: cedictlookupdoc.txt, version 1.2 History: June 2005: Version 1.2 Just a renumbering of the same code, to go with the updated cedictlookup.el version 1.2 so all pieces will have the same version number June 2005: Version 1.1.01 Renamed from cedictlookup.doc to cedictlookupdoc.txt, and updated my address info above June 1999: Version 1.1 August 1998: Version 1.0 This documents the chnvlookup Perl script, which does word lookups (by Chinese, pinyin, or English) from one or more vocabulary files in CEDICT format (see "http://www.mandarintools.com/cedict.html" for information about CEDICT). QUICK SUMMARY ============= This is for people who hate to read directions. Say you have CEDICT in the file /usr/local/lib/cedict.gb, and your own vocabulary file (in CEDICT format) in /home/base/chinese/mywords.gb. Then run "chnvlookup -vf /usr/local/lib/cedict.gb:/home/base/chinese/mywords.gb" When it prompts you to "Enter word:", type in a word in Chinese, pinyin (as in CEDICT format, but without the square brackets around it), or English, and chnvlookup will look through the vocab files for the words you typed, and spit out the Chinese, pinyin, and English. But if you don't read the directions, you will only be able to do exact matching, which is fairly limiting, so you may have to force yourself to plow through some of this documentation... RUNNING UNDER EMACS =================== If you use emacs, and don't have it already, you may want to get chnvlookup.el, some emacs-lisp code which makes it convenient to call chnvlookup from within emacs while you are reading Chinese documents. This can be downloaded from under the "Chinese resources" section of my web page; the URL is at the top of this document. There is information inside of that file about how to use it. COMMAND LINE ARGUMENTS ====================== There are several basic command-line arguments possible: -v : Enable verbose mode, which prints a little extra info at startup and when parsing user input. Mainly used for debugging. -emacs : This is used when running the script from under emacs; it makes chnvlookup modify its output slightly, e.g. not prompt the user to enter a word to look up. -i : Do case-insensitive matching when looking up words by English (note that all pinyin is assumed to be lower case, according to the CEDICT format specification). This is the default behavior. +i : Do case-sensitive English matching. -vd path : "path" specifies the Vocabulary Directory where the vocabulary files may be found. This path is prepended to the vocabulary filenames when they are opened. If your vocabulary files are in different directories, don't set "-vd", give the pathnames in the vocab files (although you could use "-vd" to specify a common root, e.g. "-vd /home/hiebeler/chinese", and then "-vf subdir1/vocabfile1:subdir2/vocabfile2"). -vf file1:file2:...:fileN : Specify Vocabulary Files to use. This is a colon-separated list of filenames. Next, there are two arguments which are used to specify the Match Mode and the Anchor Mode, which will be discussed below. -mm x : Set the Match Mode to "x", where "x" is one of the following letters: e: Exact l: Longer s: Shorter -am y : Set the Anchor Mode to "y", where "y" is one of the following letters: s: Start e: End n: None Finally, there is a performance-related argument: -fastexact x: Set the fastExact mode. This makes some parts of the vocabulary lookup process much faster, at the expense of using more memory. My informal test showed that enabling fastExact matches for Chinese only increased memory usage by about 30%, and enabling it for both Chinese and pinyin increased memory usage by about 75%. The argument "x" should either be the character "0" (zero), indicating that you don't want to do fastExact matching, or else it should contain one or both of the characters "c" and "p", optionally followed by a "+". If you include a "c", it means enable fastExact matches for Chinese. If you include a "p", it means enable fastExact matches for pinyin. If you add a "+" at the end, it means that after a fastExact search finds a match, it will still continue on to do Longer or Shorter matches according to the Match Mode. (If Match Mode is Exact, nothing further will happen). Note that by default, the fastExact mode is set to "c", i.e. fastExact matches are only done in Chinese, and if a match is found, further non-exact matches are not searched for. Also note that you can abbreviate "-fastexact" as just "-fe". MATCH MODE AND ANCHOR MODE ========================== When you look up a word (or phrase), there are several ways you could do the search. The Match Mode specifies how you want to search. Possible Match Modes are: Exact : Perform exact matches only. If you do a pinyin search for "shi1", you will only be shown entries in the vocabulary file(s) whose pinyin fields are exactly "[shi1]" (there may be more than one, corresponding to different Chinese characters with that pronunciation, or you may even have more than one entry for the same character among your vocab files, or even within a single vocab file). The behavior of Exact Match Mode is not affected by Anchor Mode, so if Match Mode is set to "Exact", Anchor Mode is ignored. Longer : Find matches which contain the specified string, i.e. find entries in your vocabulary file(s) which are as long or longer than the word(s) you entered. The behavior of this Match Mode is affected by Anchor Mode, as discussed below. Shorter : Find matches which are contained within the specified string, i.e. find entries in your vocabulary file(s) which are shorter than the word(s) you entered. The behavior of this Match Mode is affected by Anchor Mode, as discussed below. Anchor Mode specifies how one string should be located within another, longer string. It affects the behavior of searches when using a Match Mode of either "Longer" or "Shorter". Possible Anchor Modes are: Start : The shorter string must be at the beginning of the longer string. End : The shorter string must be at the end of the longer string. None : The shorter string can be anywhere within the longer string. I realize the theory of it all sounds pretty vague and confusing. I hope several examples will clarify everything: Examples with Match Mode = Longer: If you use Match Mode = Longer, with Anchor Mode = Start, and do a pinyin search for "mei3", you will find all entries in your vocab files whose pinyin fields BEGIN with "[mei3", e.g. you will see "mei3", "mei3 ge2", "mei3 guo2", "mei3 shu4", "mei3 guo2 hang2 kong1 gong1 si1", etc. If you use Match Mode = Longer, with Anchor Mode = End, and do a pinyin search for "mei3", you will find all entries in your vocab files whose pinyin fields END with "mei3]", e.g. you will see "mei3", "you1 mei3", "la1 mei3", "zan4 mei3", etc. If you use Match Mode = Longer, with Anchor Mode = None, and do a pinyin search for "mei3", you will see all of the above matches (entries which matched at the beginning or the end), as well as other entries such as "bei3 mei3 zhou1", "nan2 mei3 zhou1", etc. which have a "mei3" anywhere in the middle. Examples with Match Mode = Shorter: If you use Match Mode = Shorter, with Anchor Mode = Start, and do a pinyin search for "zhong1 hua2 ren2 min2 gong4 he2 guo2", you will find all entries in your vocab files which match part of the beginning of the phrase, e.g. you will see "zhong1", "zhong1 hua2", "zhong1 hua2 ren2 min2", and "zhong1 hua2 ren2 min2 gong1 he2 guo2" (assuming they are all in your vocabulary database). If you use Match Mode = Shorter, with Anchor Mode = End, and do a pinyin search for "zhong1 hua2 ren2 min2 gong4 he2 guo2", you will find all entries in your vocab files which match part of the end of the phrase, e.g. you will see "guo2" and "gong4 he2 guo2". If you use Match Mode = Shorter, with Anchor Mode = None, and do a pinyin search for "zhong1 hua2 ren2 min2 gong4 he2 guo2", you will find all entries in your vocab files which are contained within the phrase you are searching on. This usually means you will probably see every single-character entry such as "zhong1", "hua2", "ren2", etc., as well as multi-character phrases within your search string, such as "ren2 min2", "gong4 he2 guo2", "zhong1 hua2", etc. Be careful using these settings with long phrases, as you may get a ton of output! The above examples were for pinyin searches, but the behavior is exactly the same for Chinese searches, and essentially the same for English searches. For English searches, since each entry in a CEDICT file may have multiple English definitions (e.g. "/first definition/another def/yet another definition"), Anchor Mode anchors within each definition. Thus with Match Mode = Longer and Anchor Mode = End, if you searched for "up", you would find matches whose English fields were "/jump up/", "/disperse/break up/", "/to blow up/to explode/", and so on -- anything where at least one of the English definitions ended in the word "up". If you want to search for entries which have English words which end in "up" (but not necessarily "up" as a separate word), see the section below on Special Flags. LOOKING UP WORDS ================ chnvlookup will prompt you for a word to look up. Just type in a word or phrase, in Chinese, pinyin, or English. chnvlookup guesses what you typed in, by looking at the first word. If the high bit is set on the first character, it assumes you typed in Chinese. Otherwise, if the first word ends with a digit from 1-5, it assumes it is a pinyin tone, and thus assumes you typed in pinyin. (Actually, if the first word ends with a "0", it also assumes pinyin; see discussion below.) If none of the above rules match, it assumes you typed English. FAST-EXACT LOOKUPS ================== If the Match Mode is not Exact (i.e. it is Longer or Shorter), chnvlookup can be a little slow, depending on your computer and how big your vocabulary list is, since it needs to do a pattern match for every entry in the vocabulary list. To alleviate this problem, Fast Exact matches are done in some circumstances. In the current release, Fast Exact matches are only possible when looking up Chinese or pinyin words, not English. (English Fast Exact searches may be available in a future release, although I suspect it will use a huge amount of memory). When Fast Exact matches are enabled for the language you are searching on (by default they are enabled only for Chinese searches), then when you do a non-exact vocabulary lookup of a word, exact matches are found first via the Fast Exact lookup, which should be very fast. If the FastExact mode did not have a "+" at the end, the search will stop if any matches were found. If the FastExact mode did have a "+" at the end, the search will go on and list any other (non-exact) matches that are found. This usually means that you see the most precise information very quickly, and then see some additional information after some delay. In my opinion, it's better than suffering through the delay to get everything, although I would prefer it if the precise match were moved down to the end again after the non-exact matches are found, so I could see it more easily. When Match Mode is exact, Fast Exact matches are done by default when looking up Chinese or pinyin words (if you enabled Fast Exact searches for those input languages), and the FastExact mode has no effect. If you didn't enable Fast Exact matches for the language you are searching on, a regular slower search is done, even if the Match Mode is Exact. Because enabling Fast Exact matches uses more memory, you may want to disable it if your computer doesn't have much memory available. You can do this via the "-fastexact 0" command-line argument. By default, the FastExact mode is "c". You can specify it explicitly on the command line by using the "-fastexact" command-line argument (which can also be abbreviated as "-fe"). SPECIAL FLAGS WHEN DOING LOOKUPS ================================ There are also a few special things you can type on the input line when looking up words, to change the behavior of chnvlookup. These special flags must appear at the beginning of the line, and all begin with a hyphen '-'. Permanently changing Modes: First, if you decide to change the Match Mode or Anchor Mode, you can just specify them as you would on the command line, e.g. if you type "-mm l -am s" (without the quotes), it will set Match Mode to Longer, and Anchor Mode to Start. You can set just one of them if you like; you don't need to set both. When you set the Modes in this way, they retain their new value, until you change them again. This lets you change the Modes without restarting chnvlookup, since it usually takes a relatively long time to read in the vocabulary files. Note that you cannot change the FastExact mode in this way, because the program needs to know whether FastExact matches will be allowed when it first starts up, so that when it's reading in the vocabulary files, it can build extra hash table(s) for fast searching. Temporarily changing Modes: If you just want to temporarily change the mode for a single search, you can specify the modes to use, followed by the search you want to do, on that same line. For example, if you ran chnvlookup with Match Mode = Longer, but suddenly decide you want to do an exact search for "hai3", you can enter the following at the prompt (without quotes): "-mm e hai3". This tells chnvlookup to temporarily set the Match Mode to Exact only for this search, and then put it back the way it was (i.e. Longer, in this example). Query: If you enter "-q" on a line by itself, chnvlookup will tell you the current Match Mode and Anchor Mode. It will also do this if you just hit return without typing anything. License: If you enter "-license" on a line by itself, some licensing information is displayed. Doing "wildcard" matches: What if you want to do a pinyin search but don't quite remember it, e.g. you forgot whether the word you want is pronounced "bu4 jin1" or "bu4 jing1"? You can do a "Wildcard Pinyin" match, by entering the following input (without quotes): "-wp bu4 jin". The "-wp" flag tells chnvlookup to do wildcard matches on each pinyin sound. In this case, you would find the "bu4 jin1" you were looking for (as well as the unrelated word "bu4 jing3"). Wildcard Pinyin matching also does wildcard matching at the beginnings of words, e.g. "-wp ing1" will match "ping1", "qing1", "jing1", "ting1", "bing1", "ding1", etc. Note if you put a tone on the end of a pinyin sound, that pinyin will not do wildcard matching at its end, only at its beginning. Thus, in a wildcard pinyin match, "jin" will match "jin" as well as "jing", but "jin4" will not match "jing4". You can also do Wildcard English matching, by using the "-we" flag. E.g. entering "-we air" will match "chair", "airport", "dairy farm", "repair shop", "fairy tale", and so on. Be careful with this -- don't do wildcard English matches on very short strings, unless you like to see an awful lot of output go by on your screen. IF YOU ONLY FORGOT THE TONE =========================== Sometimes Wildcard Pinyin matching is overkill -- sometimes you just can't remember whether the tone of one character. E.g. say you wanted to search for "cha2 hu" where you forgot which tone "hu" is, but if you use Wildcard Pinyin matching on "cha2 hu" you would get things like "cha2 chu3", "cha2 hun2", etc. (ok, I admit there are no such words as far as I know; I couldn't think of a good Real Example) and you don't want to see stuff like that. To handle this case, a pinyin tone of "0" means you forgot the tone. So you could enter "cha2 hu0", and it would search for "cha2 hu" where the "hu" is any tone, but it wouldn't do the more liberal Wildcard Pinyin matching and pick up things like "cha2 hun2". You can put "0" tones on as many pinyin words as you like. You could even use "0" tones with Wildcard Pinyin matching. This actually would accomplish something; e.g. with Wildcard Pinyin matching, "jin" could match "jing1", "jin3", "jin4", etc., whereas "jin0" would match "jin3", "jin4", etc. but not "jing1". (Remember the note above, when doing Wildcard Pinyin matching, if you put a tone on the end of a pinyin sound, that pinyin will not do wildcard matching at its end, only at its beginning). Bug notice: this "0-tone" (or "forgotten-tone") pinyin lookup currently does NOT work with Match Mode = Shorter. Hopefully it will work in the next release. TO SEND ME FEEDBACK =================== As of the time I began this project (July 1998), I am fairly new to Perl, and even newer to writing multilingual vocabulary-lookup software. If you find problems, or have suggestions, contact me via the e-mail address on my home page, http://www.math.umaine.edu/faculty/hiebeler