The heart of each sort program is a procedure which compares two strings of text and decides which one is "bigger", i.e. which goes latter in a dictionary. This task is not trivial in Czech language where many letters are modified with accents and diacritics signs.
Diacritics does not involve the sort weight in the first aproach, but there are four exceptions: ČŘŠŽ. Before comparing two Czech strings they must be translated to a reduced alphabet which consists only of digits 0123456789 and uppercase letters ABCČDEFGHIJKLMNOPQRŘSŠTUVWXYZŽ.
As an example we will compare four Czech words:
sije (he or she sows), šije (he or she sews),
šíje (isthmus) and Šíje (proper name).
So lets strip off the diacritics (considering exceptions
mentioned above) and convert the strings to uppercase:
Strings | Pass 1 translation | Order |
---|---|---|
sije | SIJE | 1. |
šije | ŠIJE | 2.? |
šíje | ŠIJE | 2.? |
Šíje | ŠIJE | 2.? |
Only when those translated strings are equal, an additinal pass must be done. Sort weight of letters with diacritics is involved in this order:
In the second pass accents will not be removed but strings are still case insensitive:
Strings | Pass 1 translation |
Pass 2 translation |
Order |
---|---|---|---|
sije | SIJE | SIJE | 1. |
šije | ŠIJE | ŠIJE | 2. |
šíje | ŠIJE | ŠÍJE | 3.? |
Šíje | ŠIJE | ŠÍJE | 3.? |
If the strings remain equal after the second translation, case of letters must be taken into account in pass three:
Strings | Pass 1 translation |
Pass 2 translation |
Pass 3 translation |
Order |
---|---|---|---|---|
sije | SIJE | SIJE | sije | 1. |
šije | ŠIJE | ŠIJE | šije | 2. |
šíje | ŠIJE | ŠÍJE | šíje | 3. |
Šíje | ŠIJE | ŠÍJE | Šíje | 4. |
Sort weight tables don't solve all requirements of correct Czech sorting order, which is specified in the national standard ČSN 01 0181. Couples of characters ch, Ch and CH (but not cH) are considered diphtongs. Unlike in English language, where ch sorts between cg and ci, the ch diphtong goes between h and i in Czech language.
Words in the compared string may be separated with multiple white spaces. The sorting program should reduce those multiple spaces to a single space before comparing. Leading spaces should also be omitted when comparing string.
Collating order of foreign letters in the Czech alphabet, such as Greek letter β (beta), depends on their phonetic equivalent.
The situation is even more complicated as there are at least four different Czech code pages being used in the world of personal computers. Manual creating of such tables is tedious work so that is why SortKit was written.
SortKit works like a special compiler which reads collating definitions from the source file and writes three-pass sort weight table in the syntax of selected programming language (Assembler, C or Pascal). It generates not only three translating tables but also the whole ready-to-compile bubble sort program. You may want to copy'n'paste only the SortWeightTable definition or the Compare function into your own sorting program.
Required target language is selected with parameters /Asm /Bin /C /Nasm /Pas (multiple selection is possible). The /Binary variant generates only 3*256=768 byte table (no source code). The /NASM variant is 32bit Windows console application.
Parameter | Language | Extension | Syntax |
---|---|---|---|
/A | Assembler | .ASM | Turbo Assembler 2 |
/N | Netwide Assembler | .NAS | NASMW 0.98 |
/C | C | .C | Microsoft C/C++ 7.0 |
/P | Pascal | .PAS | Turbo Pascal 5 |
/B | - | .BIN | - |
Next four parameters specify what codepage should SortKit use when it generates Czech comments into the bubblesort program source. It has nothing to do with characters defined in sort weights definition file.
Parameter | Code page |
---|---|
/ISO | ISO-8859-2 |
/KAM | CP895 Kamenických |
/LAT | CP852 PC Latin2 |
/WIN | CP1250 Windows CE |
Each source contains hints for compilation in the header comment. Here is an example how to create a sort utility in Pascal for OEM codepage Kamenických with "factory default" sequences VAHY1KAM.DSW:
SORTKIT.COM /Pascal /Kam VAHY1KAM.DSW
TPC.EXE VAHY1KAM.PAS
VAHY1KAM.EXE < VZOREK.TXT | MORE
Characters with the same sorting weight must be specified on the same row. Definitions of character may have decadic form (0..255), hexadecimal form (0x00..0xFF or 00h..FFh) or may be specified as a quoted or doublequoted string.
Definitions on the same line are separated with comma "," or with elipsis "..". Unquoted semicolon signalises a remark. Example of a valid definition line: "()",0x5B..0x5D,123..125 ; parenthesis, braces, slashes
The first row specifies characters with sort weight zero which is treated in a special way: those characters are not taken into account when comparing sort keys.
Recomended extension of definition file is DSW. It must declare sort weights for all three passes. Each pass specification starts with a line which has an asterix "*" in the first column. Use the included files VAHY1*.DSW as a starting point.
Here are examples or typical SortKit messages:
**Error** BADFILE.DSW(8) sloupec 6: chyba syntaxe.
Syntax error on line 8, column 6.
*Warning* BADFILE.DSW(14) sloupec 10: vícenásobná definice znaku 17
Character 17 on line 14, column 10 was already defined in the same pass.
*Warning* BADFILE.DSW(17) Tab.2: nedefinovaná váha znaku 252..255
Characters 252 to 255 were not defined in pass 2.
VAHY1ASC.DSW neobsahuje žádné syntaktické chyby.
No errors found.
VAHY1LAT.ASM byl úspěšně vygenerován.
Source was generated.
SortKit is freeware. Please distribute only the original package downloaded from vit$oft freeware which should contain the following files:
SORTKIT.COM sort weight compiler SORTKITC.HTM Czech manual SORTKIT.HTM This English manual VITSOFT.CSS Stylesheet VAHY1ISO.DSW Source sort weights in ISO-8859-2 VAHY1KAM.DSW Source sort weights in CP895 Kamenických VAHY1LAT.DSW Source sort weights in CP852 DOS Latin 2 VAHY1WIN.DSW Source sort weights in CP1250 Windows VZOREK.TXT Sample of unsorted text in codepage Kamenických VZOREKW.TXT Sample of unsorted text in codepage CP1250 CPP.ZIP C source modification for C++ by Petr Soucek FILE_ID.DIZ BBS distribution identifier
See the Czech manual.