Documentation for chnvlookup (Chinese Vocabulary Lookup), version 1.2

Copyright (C) 1998,1999  David Hiebeler
                         Dept of Mathematics and Statistics
                         University of Maine
                         Orono, ME 04469-5752
                         http://www.math.umaine.edu/faculty/hiebeler


File: cedictlookupdoc.txt, version 1.2
      History:
      June 2005: Version 1.2
         Just a renumbering of the same code, to go with the updated
         cedictlookup.el version 1.2 so all pieces will have the same
         version number
      June 2005: Version 1.1.01
         Renamed from cedictlookup.doc to cedictlookupdoc.txt, and updated
         my address info above
      June 1999: Version 1.1
      August 1998: Version 1.0


This documents the chnvlookup Perl script, which does word lookups
(by Chinese, pinyin, or English) from one or more vocabulary files in
CEDICT format (see "http://www.mandarintools.com/cedict.html"
for information about CEDICT).


QUICK SUMMARY
=============

This is for people who hate to read directions.  Say you have
CEDICT in the file /usr/local/lib/cedict.gb, and your own vocabulary
file (in CEDICT format) in /home/base/chinese/mywords.gb.  Then run
"chnvlookup -vf /usr/local/lib/cedict.gb:/home/base/chinese/mywords.gb"
When it prompts you to "Enter word:", type in a word in Chinese,
pinyin (as in CEDICT format, but without the square brackets around
it), or English, and chnvlookup will look through the vocab files for
the words you typed, and spit out the Chinese, pinyin, and English.

But if you don't read the directions, you will only be able to do
exact matching, which is fairly limiting, so you may have to force
yourself to plow through some of this documentation...


RUNNING UNDER EMACS
===================

If you use emacs, and don't have it already, you may want to get
chnvlookup.el, some emacs-lisp code which makes it convenient to call
chnvlookup from within emacs while you are reading Chinese documents.
This can be downloaded from under the "Chinese resources" section of
my web page; the URL is at the top of this document.  There is information
inside of that file about how to use it.


COMMAND LINE ARGUMENTS
======================

There are several basic command-line arguments possible:

-v : Enable verbose mode, which prints a little extra info at startup
     and when parsing user input.  Mainly used for debugging.

-emacs : This is used when running the script from under emacs; it makes
     chnvlookup modify its output slightly, e.g. not prompt the user to
     enter a word to look up.

-i : Do case-insensitive matching when looking up words by English
     (note that all pinyin is assumed to be lower case, according to the
     CEDICT format specification).  This is the default behavior.

+i : Do case-sensitive English matching.

-vd path : "path" specifies the Vocabulary Directory where the
     vocabulary files may be found.  This path is prepended to the
     vocabulary filenames when they are opened.  If your vocabulary
     files are in different directories, don't set "-vd", give the
     pathnames in the vocab files (although you could use "-vd" to
     specify a common root, e.g. "-vd /home/hiebeler/chinese", and
     then "-vf subdir1/vocabfile1:subdir2/vocabfile2").

-vf file1:file2:...:fileN : Specify Vocabulary Files to use.  This is
     a colon-separated list of filenames.

Next, there are two arguments which are used to specify the Match Mode
and the Anchor Mode, which will be discussed below.

-mm x : Set the Match Mode to "x", where "x" is one of the following letters:
        e: Exact
        l: Longer
        s: Shorter
-am y : Set the Anchor Mode to "y", where "y" is one of the following letters:
        s: Start
        e: End
        n: None

Finally, there is a performance-related argument:

-fastexact x: Set the fastExact mode.  This makes some parts of the vocabulary
    lookup process much faster, at the expense of using more memory.
    My informal test showed that enabling fastExact matches for Chinese only
    increased memory usage by about 30%, and enabling it for both Chinese
    and pinyin increased memory usage by about 75%.

    The argument "x" should either be the character "0" (zero), indicating
    that you don't want to do fastExact matching, or else it should contain
    one or both of the characters "c" and "p", optionally followed by a "+".
    If you include a "c", it means enable fastExact matches for Chinese.
    If you include a "p", it means enable fastExact matches for pinyin.
    If you add a "+" at the end, it means that after a fastExact search
    finds a match, it will still continue on to do Longer or Shorter matches
    according to the Match Mode.  (If Match Mode is Exact, nothing further
    will happen).

    Note that by default, the fastExact mode is set to "c", i.e. fastExact
    matches are only done in Chinese, and if a match is found, further
    non-exact matches are not searched for.

    Also note that you can abbreviate "-fastexact" as just "-fe".

MATCH MODE AND ANCHOR MODE
==========================

When you look up a word (or phrase), there are several ways you could
do the search.  The Match Mode specifies how you want to search.

Possible Match Modes are:
Exact : Perform exact matches only.  If you do a pinyin search for
     "shi1", you will only be shown entries in the vocabulary file(s)
     whose pinyin fields are exactly "[shi1]" (there may be more than
     one, corresponding to different Chinese characters with that
     pronunciation, or you may even have more than one entry for the
     same character among your vocab files, or even within a single
     vocab file).  The behavior of Exact Match Mode is not affected by
     Anchor Mode, so if Match Mode is set to "Exact", Anchor Mode is ignored.

Longer : Find matches which contain the specified string, i.e. find entries
     in your vocabulary file(s) which are as long or longer than the word(s)
     you entered.  The behavior of this Match Mode is affected by Anchor Mode,
     as discussed below.

Shorter : Find matches which are contained within the specified string,
     i.e. find entries in your vocabulary file(s) which are shorter than
     the word(s) you entered.  The behavior of this Match Mode is affected
     by Anchor Mode, as discussed below.


Anchor Mode specifies how one string should be located within another,
longer string.  It affects the behavior of searches when using a Match
Mode of either "Longer" or "Shorter".

Possible Anchor Modes are:
Start : The shorter string must be at the beginning of the longer
     string.

End : The shorter string must be at the end of the longer string.

None : The shorter string can be anywhere within the longer string.


I realize the theory of it all sounds pretty vague and confusing.  I
hope several examples will clarify everything:


Examples with Match Mode = Longer:

If you use Match Mode = Longer, with Anchor Mode = Start, and do a
pinyin search for "mei3", you will find all entries in your vocab
files whose pinyin fields BEGIN with "[mei3", e.g. you will see
"mei3", "mei3 ge2", "mei3 guo2", "mei3 shu4",
"mei3 guo2 hang2 kong1 gong1 si1", etc.

If you use Match Mode = Longer, with Anchor Mode = End, and do a
pinyin search for "mei3", you will find all entries in your vocab
files whose pinyin fields END with "mei3]", e.g. you will see "mei3",
"you1 mei3", "la1 mei3", "zan4 mei3", etc.

If you use Match Mode = Longer, with Anchor Mode = None, and do a
pinyin search for "mei3", you will see all of the above matches
(entries which matched at the beginning or the end), as well as other
entries such as "bei3 mei3 zhou1", "nan2 mei3 zhou1", etc. which have
a "mei3" anywhere in the middle.


Examples with Match Mode = Shorter:

If you use Match Mode = Shorter, with Anchor Mode = Start, and do a
pinyin search for "zhong1 hua2 ren2 min2 gong4 he2 guo2", you will
find all entries in your vocab files which match part of the beginning
of the phrase, e.g. you will see "zhong1", "zhong1 hua2", "zhong1 hua2
ren2 min2", and "zhong1 hua2 ren2 min2 gong1 he2 guo2" (assuming they
are all in your vocabulary database).

If you use Match Mode = Shorter, with Anchor Mode = End, and do a
pinyin search for "zhong1 hua2 ren2 min2 gong4 he2 guo2", you will
find all entries in your vocab files which match part of the end of
the phrase, e.g. you will see "guo2" and "gong4 he2 guo2".

If you use Match Mode = Shorter, with Anchor Mode = None, and do a
pinyin search for "zhong1 hua2 ren2 min2 gong4 he2 guo2", you will
find all entries in your vocab files which are contained within the
phrase you are searching on.  This usually means you will probably see every
single-character entry such as "zhong1", "hua2", "ren2", etc., as well
as multi-character phrases within your search string, such as "ren2
min2", "gong4 he2 guo2", "zhong1 hua2", etc.  Be careful using these
settings with long phrases, as you may get a ton of output!


The above examples were for pinyin searches, but the behavior is
exactly the same for Chinese searches, and essentially the same for
English searches.  For English searches, since each entry in a CEDICT
file may have multiple English definitions (e.g.
"/first definition/another def/yet another definition"), Anchor Mode
anchors within each definition.  Thus with Match Mode = Longer and 
Anchor Mode = End, if you searched for "up", you would find matches whose
English fields were "/jump up/", "/disperse/break up/",
"/to blow up/to explode/", and so on -- anything where at least one of
the English definitions ended in the word "up".  If you want to search
for entries which have English words which end in "up" (but not
necessarily "up" as a separate word), see the section below on Special Flags.


LOOKING UP WORDS
================

chnvlookup will prompt you for a word to look up.  Just type in a word
or phrase, in Chinese, pinyin, or English.  chnvlookup guesses what
you typed in, by looking at the first word.  If the high bit is set on
the first character, it assumes you typed in Chinese.  Otherwise, if
the first word ends with a digit from 1-5, it assumes it is a pinyin
tone, and thus assumes you typed in pinyin.  (Actually, if the first
word ends with a "0", it also assumes pinyin; see discussion below.)
If none of the above rules match, it assumes you typed English.


FAST-EXACT LOOKUPS
==================

If the Match Mode is not Exact (i.e. it is Longer or Shorter),
chnvlookup can be a little slow, depending on your computer and how
big your vocabulary list is, since it needs to do a pattern match for
every entry in the vocabulary list.

To alleviate this problem, Fast Exact matches are done in some
circumstances.  In the current release, Fast Exact matches are only
possible when looking up Chinese or pinyin words, not English.  (English
Fast Exact searches may be available in a future release, although I
suspect it will use a huge amount of memory).

When Fast Exact matches are enabled for the language you are searching
on (by default they are enabled only for Chinese searches), then when
you do a non-exact vocabulary lookup of a word, exact matches are
found first via the Fast Exact lookup, which should be very fast.  If
the FastExact mode did not have a "+" at the end, the search will stop
if any matches were found.  If the FastExact mode did have a "+" at
the end, the search will go on and list any other (non-exact) matches
that are found.  This usually means that you see the most precise
information very quickly, and then see some additional information
after some delay.  In my opinion, it's better than suffering through
the delay to get everything, although I would prefer it if the precise
match were moved down to the end again after the non-exact matches are
found, so I could see it more easily.

When Match Mode is exact, Fast Exact matches are done by default when
looking up Chinese or pinyin words (if you enabled Fast Exact searches
for those input languages), and the FastExact mode has no effect.  If
you didn't enable Fast Exact matches for the language you are
searching on, a regular slower search is done, even if the Match Mode
is Exact.

Because enabling Fast Exact matches uses more memory, you may want to
disable it if your computer doesn't have much memory available.  You
can do this via the "-fastexact 0" command-line argument.

By default, the FastExact mode is "c".  You can specify it explicitly
on the command line by using the "-fastexact" command-line argument
(which can also be abbreviated as "-fe").


SPECIAL FLAGS WHEN DOING LOOKUPS
================================

There are also a few special things you can type on the input line
when looking up words, to change the behavior of chnvlookup.  These
special flags must appear at the beginning of the line, and all begin
with a hyphen '-'.

Permanently changing Modes:
First, if you decide to change the Match Mode or Anchor Mode, you can
just specify them as you would on the command line, e.g. if you type
"-mm l -am s" (without the quotes), it will set Match Mode to Longer,
and Anchor Mode to Start.  You can set just one of them if you like;
you don't need to set both.  When you set the Modes in this way, they
retain their new value, until you change them again.  This lets you
change the Modes without restarting chnvlookup, since it usually takes
a relatively long time to read in the vocabulary files.  Note that you
cannot change the FastExact mode in this way, because the program
needs to know whether FastExact matches will be allowed when it first
starts up, so that when it's reading in the vocabulary files, it can
build extra hash table(s) for fast searching.

Temporarily changing Modes:
If you just want to temporarily change the mode for a single search,
you can specify the modes to use, followed by the search you want to
do, on that same line.  For example, if you ran chnvlookup with Match
Mode = Longer, but suddenly decide you want to do an exact search for
"hai3", you can enter the following at the prompt (without quotes):
"-mm e hai3".  This tells chnvlookup to temporarily set the Match Mode
to Exact only for this search, and then put it back the way it was
(i.e. Longer, in this example).

Query:
If you enter "-q" on a line by itself, chnvlookup will tell you the
current Match Mode and Anchor Mode.  It will also do this if you just
hit return without typing anything.

License:
If you enter "-license" on a line by itself, some licensing
information is displayed.

Doing "wildcard" matches:
What if you want to do a pinyin search but don't quite remember it,
e.g. you forgot whether the word you want is pronounced "bu4 jin1" or
"bu4 jing1"?  You can do a "Wildcard Pinyin" match, by entering the
following input (without quotes): "-wp bu4 jin".  The "-wp" flag tells
chnvlookup to do wildcard matches on each pinyin sound.  In this case,
you would find the "bu4 jin1" you were looking for (as well as the
unrelated word "bu4 jing3").  Wildcard Pinyin matching also does
wildcard matching at the beginnings of words, e.g. "-wp ing1" will
match "ping1", "qing1", "jing1", "ting1", "bing1", "ding1", etc.
Note if you put a tone on the end of a pinyin sound, that pinyin will
not do wildcard matching at its end, only at its beginning.  Thus, in
a wildcard pinyin match, "jin" will match "jin" as well as "jing", but
"jin4" will not match "jing4".

You can also do Wildcard English matching, by using the "-we" flag.
E.g. entering "-we air" will match "chair", "airport", "dairy farm",
"repair shop", "fairy tale", and so on.  Be careful with this -- don't
do wildcard English matches on very short strings, unless you like to
see an awful lot of output go by on your screen.


IF YOU ONLY FORGOT THE TONE
===========================

Sometimes Wildcard Pinyin matching is overkill -- sometimes you just
can't remember whether the tone of one character.  E.g. say you wanted
to search for "cha2 hu" where you forgot which tone "hu" is, but if
you use Wildcard Pinyin matching on "cha2 hu" you would get things
like "cha2 chu3", "cha2 hun2", etc. (ok, I admit there are no such
words as far as I know; I couldn't think of a good Real Example) and
you don't want to see stuff like that.  To handle this case, a pinyin
tone of "0" means you forgot the tone.  So you could enter "cha2 hu0",
and it would search for "cha2 hu" where the "hu" is any tone, but it
wouldn't do the more liberal Wildcard Pinyin matching and pick up
things like "cha2 hun2".  You can put "0" tones on as many pinyin
words as you like.  You could even use "0" tones with Wildcard Pinyin
matching.  This actually would accomplish something; e.g. with
Wildcard Pinyin matching, "jin" could match "jing1", "jin3", "jin4",
etc., whereas "jin0" would match "jin3", "jin4", etc. but not "jing1".
(Remember the note above, when doing Wildcard Pinyin matching, if you
put a tone on the end of a pinyin sound, that pinyin will not do
wildcard matching at its end, only at its beginning).

Bug notice: this "0-tone" (or "forgotten-tone") pinyin lookup
currently does NOT work with Match Mode = Shorter.  Hopefully it will
work in the next release.


TO SEND ME FEEDBACK
===================

As of the time I began this project (July 1998), I am fairly new to
Perl, and even newer to writing multilingual vocabulary-lookup
software.  If you find problems, or have suggestions, contact me via
the e-mail address on my home page, 
http://www.math.umaine.edu/faculty/hiebeler