LAMPWords Dictionary Databases ============================== This file describes how to produce dictionary databases for LAMPWords, a word-finding/anagramming program for Palm handhelds. It also details the format of the databases for the benefit of those developers wanting to customize LAMPWords. Setting Up ---------- To create custom dictionary databases for use with LAMPWords, follow these steps to install and set up the tools required. NOTE: These instructions are for Unix-type systems, and particularly for Linux. Creating dictionaries under Windows probably isn't too different, however you need to have a C compiler as well as perl, and also possibly Python before proceeding. 0) Make sure you've obtained all of the tools mentioned in the README. Ideally you should make a directory for LAMPWords database development and put everything you need in there (including the files from this package). I'll be referring to individual file names for the most part. NOTE: The reason I don't include the packages is because their distribution restrictions are unclear or unspecified. 1) First, you must patch dawg.c, dawg.h, and dawg_words.c from the dawg package so they will generate the extra information LAMPWords needs. The diff files dawg.c.diff, dawg.h.diff, and dawg_words.c.diff are included for this purpose. 2) Rename or copy the prc.pl script to pdb.pl, which tells the script to generate PDB (data) files instead of PRC (program) files. Be sure to make the script executable and check that the first line points to your copy of perl. 3) Compile the patched dawg.c and dawg_words.c (as separate programs) from the dawg package. I'll assume the executables are called dawg and dawg_words, respectively. 4) Compile dawgconv.c from this package (executable name: dawgconv). 5) Make the tag.py script executable and check that the first line points to your copy of Python. (If you don't have Python you can skip this step as this script isn't crucial.) Generating New Databases ------------------------ Here are the steps required to actually create a new dictionary database: 0) You need at least one list of words in plain text format, one word per line. Make sure the file has no duplicate words in it! I'll call this file mylist-raw.txt in the instructions below but of course it can be named whatever you like. NOTE: LAMPWords supports words of at most 15 characters in length at this time. 1) Create a DAWG Description file (called mylist.dwd in these instructions) for your dictionary. A full description of the file format is in the next section, but if you just have a single-category dictionary and don't want to muck around you can simply copy the example file twl.dwd and use it as a template. You'll need to edit lines 1-5 and line 8. Leave the other lines alone for now. 2) Each word in your word list must be tagged with a category number. The Python script tag.py is provided for this purpose. When you run tag.py you'll be asked for a set of input files. For a simple one-category dictionary you'll just have one, mylist-raw.txt. Enter this name, press ENTER again to end the list. Next you'll be asked for a list of category definitions. See below for how to use these for multi-category dictionaries. For a one-category dictionary, simply enter the number 0, then press ENTER again to end the list. Lastly, you'll be asked to enter an output file name. I'll call this mylist.txt in the examples below. If you don't have Python you can tag the file yourself. For a single-category list all you have to do is put a 0 immediately after each and every word in the file; there are a multitude of ways to accomplish this. Keep the original (untagged) file around in case you want to use it to generate multi-category lists later. For multi-category files, see the next section for further instructions. If you will be doing a lot of work with multi-category dictionaries you may want to obtain and install Python (http://www.python.org/) because tag.py is a lot more valuable if you are working with multi-category dictionaries. 3) Build the initial DAWG file (called mylist.dwg here) with this command: ./dawg mylist.txt mylist 4) Test the DAWG file by performing a few searches using the dawg_words program. This is of course optional, but it's best to verify everything went well with the generation. Here's a typical search command: ./dawg_words -d mylist.dwg -r SATINE? NOTE: If you discover that some or all words are missing their last letter, it means they were not tagged properly in step 2. 5) Convert the DAWG file into PDB records with this command: ./dawgconv mylist.dwg mylist.dwd Notice that here your DAWG Description File (created in step 1) is used to help perform the conversion. 6) Generate the PDB file with this command: ./pdb -t data -c psLD -n LAMPWordsDB *rec > LWDB.pdb You can call the output .pdb file whatever you like, but the parameters must be given exactly as shown. 7) Install the PDB file into your Palm. Note that you'll wipe out any existing dictionary, as LAMPWords currently only supports one dictionary database at a time. 8) Run LAMPWords and verify it works. Ideally, first put LAMPWords in Anagramming mode and then run the same searches you used with the dawg_words program in step 3. The results should of course be the same (keeping in mind that dawg_words ignores categories). But make sure to try several searches in each mode to be completely sure everything works. DAWG Description (DWD) Files & Multi-Category Dictionaries ---------------------------------------------------------- In addition to the actual words, LAMPWords needs a bit of extra information to generate a dictionary. Some of this information is for the user's benefit but some of it tells LAMPWords how to search for and display words. The most important information LAMPWords needs is the category information. LAMPWords supports the ability to divide lists into categories and furthermore define categories which include words from other categories. The reason all this category stuff is present is to compensate for the fact that Scrabble(r) organizations around the world are not united in the lexicons they use. However, categories might also be useful for multi-language dictionaries, multi-topic dictionaries, etc. LAMPWords supports up to 8 categories, numbered 0 thru 7. Every word is assigned a category number. Category numbers are assigned ("tagged") by placing the number immediately after each word in the text word list (no spaces), before it is made into a DAWG. The script tag.py can be used to help place the category numbers next to each word. To use tag.py you need Python, and you need to prepare your input word lists in a certain way. tag.py works asking for a list of input files, then asking for file lists to define the categories. A word must be in ALL of the files in a category's file list (and ONLY those files) to be considered part of that category. If a file list has just one file then the word must be in ONLY that file to be in the category. To use words from ANY of a list, you have to merge the lists together into a new file (eliminating duplicates) and then (possibly) pull them out of the original files. This is not too hard to do with standard Unix tools, so it's not supported by tag.py. tag.py will guide you through the process of entering the file names and category lists. It shouldn't be too hard to use if you understand everything else in this section. Categories are named, and the names are shown to the user in the category selection list, available in the upper-right corner of the LAMPWords main screen. Categories can also be given symbols so after searches involving multiple categories (see next paragraph) it is easy to tell which category each word came from. Each category also has what I call "include information". This tells LAMPWords which categories should be searched when a category is selected. The include information is simply a matrix of bits. Each row of the matrix gives the include information for one category, and the bits within each row tell LAMPWords the actual categories to include. For example, suppose the first row is 10110000. This is the include information for category 0, and it tells LAMPWords to search categories 0, 2, and 3, but not 1, 4, 5, 6, or 7. Note that you MUST set a category's own bit if there are words in that category! In other words, you must tell the category to "include itself" in the search. The only time you might not want to do this is if you have a category was composed entirely of other categories (e.g. an "All" category that searches every category in the dictionary). I know this sounds a little bizzare, but doing it this way keeps the category checking code as simple (and thus as fast) as possible. In addition to the category information, LAMPWords requires the character set to be used in the dictionary. LAMPWords supports up to 63 characters in a dictionary. The character set is just a line of acceptable characters. Normally this will be ABCDEFGHIJKLMNOPQRSTUVWXYZ. Note that acceptable characters currently CANNOT include the question mark, period, or space. Also, characters are limited to UPPER CASE (though this restriction does not apply to accented characters). NOTE: The utilities provided don't currently support character sets other than the one shown above. It would only take minor modifications to get them to do so, but I haven't had the time or the need yet. With that, we come to the format of the DAWG Description File (DWD File for short). This file normally shares the name of the main DAWG file, except that the extension is .dwd instead of .dwg. You can name it anything you like, however. The DWD File is a fixed-format text file, with each line having a specific meaning. The lines are as follows: Line 1 The short name of the dictionary. This is limited to 10 characters or less. Normally it should be the name or initials of the word source followed by a version number. The name is used by LAMPWords solely to determine if the dictionary database has changed: LAMPWords stores the name of the last dictionary and checks it each time it runs. If the dictionary has changed, LAMPWords will blow out any old results listing as it may show now up as gibberish. This name may also be used later in a list if support for multiple dictionaries is added. Lines 2-5 Dictionary information. This will be shown if the user selects the Dictionary Information option from the menu. You can put whatever you like on these lines, but the lines are intended for this information, in order: Title (long), version/date, list author/source, additional info. Each line is limited to 40 characters in length. Line 6 Character set. A string of characters with no spaces or any extra garbage. Note that the ASCII code of the character in the text file must correspond to the appropriate character code on the Palm. This means that what you see on your monitor might be different than what the Palm will show. Escape codes are not supported on this line. You can use up to 63 characters and may not include space, period, question mark, or unaccented lower case letters. Character #0 is reserved, so the first character on the line is character #1. NOTE: The provided tools work only with upper-case English characters! Line 7 Category symbols. Categories can have symbols so that the user can tell which category a word can come from if the search involves more than one category. You can put up to 8 characters on this line, one for each category. The first is for category 0, the second character for category 1, etc. An underscore (_) will be translated into a space, but you can also use real spaces. A space essentially means the category has no symbol. Lines 8-15 Category names. One name per line. Each name can be up to 10 characters long. The first line names category 0, the second line names category 1, etc. Leave lines blank if the categories are unused. Lines 16-23 Category include information. This is explained above. The first line is for category 0, the second for category 1, etc. Each line should be exactly 8 characters long. Use zeros wherever there are unused categories (rows and columns). At this point you should have enough information to create your own dictionary databases. The last section describes the actual format of the dictionary databases, and is intended for LAMPWords developers. If you need help or clarification on something, please feel free to contact me. The contact information is at the end of this file. The DAWG Format --------------- DAWG stands for Directed Acyclic Word Graph. Using a DAWG to represent a large word list is very economical in terms of both size and speed. (Those who used the prerelease version of LAMPWords, which used a straight ASCII list, can testify to this - the smallest dictionary was just under 1MB and even no-blank anagram searches took many seconds.) In a DAWG, each vertex on the graph is actually a list of letters, and each of the letters in the list has at most one edge associated with it. I call each of the letters in a list a "node". Searches begin at a fixed starting vertex. If a letter in the list matches a letter in the search pattern (how this is determined depends on the search type), its edge is traversed and the letters in the next vertex's list are then checked. DAWGs are fairly easy to search but are quite a pain to create. The hard part is finding all the matching patterns once all the words have been processed and "merging" duplicate branches. Thankfully, existing DAWG-creation tools already do a good job of this. I highly recommend reading the detailed description of DAWGs at http//www.wutka.com/dawg.html as this is was the main reference I used when implementing DAWGs in LAMPWords. I use a format that is basically the same as the format suggested in the file. However, I use only 6 bits for the character, 21 bits for the child node number, and add 3 bits for category information, with the 2 bits for flags staying the same. Thus the node size is still 4-bytes but the information inside is different. This is discussed further below. Before we go on, a reminder: The Palm's 68000-based processor is a big-endian device. Intel-based PCs are litle-endian, so some conversion may be needed. For example, the dawgconv program explicitly writes out the data in big-endian order because it is probably being run on a PC. The PDB file is organized very simply. Record 0 is the header information, which is basically just all of the information from the DWD File. The structure of record 0 is this (reproduced from the LAMPWords source): typedef struct { /* Magic string to identify the database. Used so we can tell the user to install a new dictionary if an old one is present. */ Char magic[5]; /* The database's name. Used so we can tell when the user switches to a new dictionary. Note this isn't displayed anywhere at this time. */ Char name[11]; /* Descriptive info, shown to the user on demand. */ Char title[41]; Char desc[41]; Char author[41]; Char extra[41]; /* Character translation info. */ Char ch[MAXCHARS]; UInt8 numchars; /* Category information. */ Char catsym[MAXCATEGORIES]; Char catname[MAXCATEGORIES][11]; Char catinclude[MAXCATEGORIES][MAXCATEGORIES]; UInt8 numcategories; /* Total number of nodes in the database. */ UInt32 numnodes; } DBHeader; About the only field that needs elaboration is the catinclude field. This is a matrix stored in row-major order. Each row contains the include information for one category; this is explained above. For each entry, "True" is represented by an ASCII 001, and "False" is an ASCII 000. I could have used one bit for each entry but that seemed liked too much trouble just to save a few bytes in the header. The remaining records contain the DAWG. Because Palm database records are limited to just under 64KB in size, each record in the DAWG may have at most 8092 nodes. This makes a record 32KB (plus overhead from the file format) in size. I could have used a larger number (like 15000) but I wanted to stick to powers of 2 and also leave a large safety margin. Node 0 of record 1 (the first DAWG record) is reserved. Searches begin at node 1 of the same record. Only the last record may have less than 8092 nodes; the others must have exactly this number. This makes the math easier when searching. DAWG record numbers must start at 1 and increment by 1 until the last record. LAMPWords will access the number of records when it starts to perform memory allocation and check the validity of the dictionary database. Nodes have the following format: typedef struct { UInt8 endoflist : 1; UInt8 endofword : 1; Char letter : 6; /* To display, use the header's translation info. */ UInt8 category : 3; /* Only defined when endofword is true. */ UInt32 child : 21; /* 0 indicates no child. */ } DBNode; The gcc cross-compiler for the Palm stores these bitfields in the order shown, i.e. the leftmost (most significant) bit of the 4-byte word is the end of list flag, and the child number is stored on the right (least significant) side of the 4-byte word. The order was chosen because it is fairly easy for the dawgconv program, which is usually run on a little-endian system, to encode. The letters in the dictionary are encoded; display is accomplished via a lookup table in the header record. This allows different character sets to be used. The character set is specified in the DWD File described earlier. Contact Information ------------------- The author of LAMPWords is Paul Sidorsky. He can be reached at: paulsid@shaw.ca You should also visit the LAMPWords web site for updates, tips, etc. The web site address is: http://members.shaw.ca/lampwords/ If you put together a dictionary that you think might be useful to others and has a freely-redistributable word list, please send it to me and I'll be happy to put it on the LAMPWords web site. -- End of file: lwdb.txt --