SortKit

Czech sorting

The heart of each sort program is a procedure which compares two strings of text and decides which one is "bigger", i.e. which goes latter in a dictionary. This task is not trivial in Czech language where many letters are modified with accents and diacritics signs.

Diacritics does not involve the sort weight in the first aproach, but there are four exceptions: ČŘŠŽ. Before comparing two Czech strings they must be translated to a reduced alphabet which consists only of digits 0123456789 and uppercase letters ABCČDEFGHIJKLMNOPQRŘSŠTUVWXYZŽ.

As an example we will compare four Czech words:
sije (he or she sows), šije (he or she sews), šíje (isthmus) and Šíje (proper name). So lets strip off the diacritics (considering exceptions mentioned above) and convert the strings to uppercase:

StringsPass 1
translation
Order
sijeSIJE1.
šijeŠIJE2.?
šíjeŠIJE2.?
ŠíjeŠIJE2.?

Only when those translated strings are equal, an additinal pass must be done. Sort weight of letters with diacritics is involved in this order:

  1. no diacritics
  2. dot above
  3. acute
  4. circumflex
  5. breve
  6. caron
  7. diaeresis
  8. double acute
  9. ring above
  10. ogonek
  11. stroke
  12. cedilla

In the second pass accents will not be removed but strings are still case insensitive:

Strings Pass 1
translation
Pass 2
translation
Order
sijeSIJESIJE1.
šijeŠIJEŠIJE2.
šíjeŠIJEŠÍJE3.?
ŠíjeŠIJEŠÍJE3.?

If the strings remain equal after the second translation, case of letters must be taken into account in pass three:

Strings Pass 1
translation
Pass 2
translation
Pass 3
translation
Order
sijeSIJESIJEsije1.
šijeŠIJEŠIJEšije2.
šíjeŠIJEŠÍJEšíje3.
ŠíjeŠIJEŠÍJEŠíje4.

 

Sort weight tables don't solve all requirements of correct Czech sorting order, which is specified in the national standard ČSN 01 0181. Couples of characters ch, Ch and CH (but not cH) are considered diphtongs. Unlike in English language, where ch sorts between cg and ci, the ch diphtong goes between h and i in Czech language.

Words in the compared string may be separated with multiple white spaces. The sorting program should reduce those multiple spaces to a single space before comparing. Leading spaces should also be omitted when comparing string.

Collating order of foreign letters in the Czech alphabet, such as Greek letter β (beta), depends on their phonetic equivalent.

The situation is even more complicated as there are at least four different Czech code pages being used in the world of personal computers. Manual creating of such tables is tedious work so that is why SortKit was written.

SortKit utility

SortKit works like a special compiler which reads collating definitions from the source file and writes three-pass sort weight table in the syntax of selected programming language (Assembler, C or Pascal). It generates not only three translating tables but also the whole ready-to-compile bubble sort program. You may want to copy'n'paste only the SortWeightTable definition or the Compare function into your own sorting program.

Required target language is selected with parameters /Asm /Bin /C /Nasm /Pas (multiple selection is possible). The /Binary variant generates only 3*256=768 byte table (no source code). The /NASM variant is 32bit Windows console application.

ParameterLanguageExtensionSyntax
/AAssembler.ASMTurbo Assembler 2
/NNetwide Assembler.NASNASMW 0.98
/CC.CMicrosoft C/C++ 7.0
/PPascal.PASTurbo Pascal 5
/B-.BIN-

Next four parameters specify what codepage should SortKit use when it generates Czech comments into the bubblesort program source. It has nothing to do with characters defined in sort weights definition file.

ParameterCode page
/ISOISO-8859-2
/KAMCP895 Kamenických
/LATCP852 PC Latin2
/WINCP1250 Windows CE

Each source contains hints for compilation in the header comment. Here is an example how to create a sort utility in Pascal for OEM codepage Kamenických with "factory default" sequences VAHY1KAM.DSW:

  1. Convert weight definitions to a Pascal source:
    SORTKIT.COM /Pascal /Kam VAHY1KAM.DSW
  2. Compile using Turbo Pascal:
    TPC.EXE VAHY1KAM.PAS
  3. Check the sorting function on a sample text VZOREK.TXT:
    VAHY1KAM.EXE < VZOREK.TXT | MORE

Syntax of sort weight definition file

Characters with the same sorting weight must be specified on the same row. Definitions of character may have decadic form (0..255), hexadecimal form (0x00..0xFF or 00h..FFh) or may be specified as a quoted or doublequoted string.

Definitions on the same line are separated with comma "," or with elipsis "..". Unquoted semicolon signalises a remark. Example of a valid definition line: "()",0x5B..0x5D,123..125 ; parenthesis, braces, slashes

The first row specifies characters with sort weight zero which is treated in a special way: those characters are not taken into account when comparing sort keys.

Recomended extension of definition file is DSW. It must declare sort weights for all three passes. Each pass specification starts with a line which has an asterix "*" in the first column. Use the included files VAHY1*.DSW as a starting point.

SortKit reports

Here are examples or typical SortKit messages:
**Error** BADFILE.DSW(8) sloupec 6: chyba syntaxe. Syntax error on line 8, column 6. *Warning* BADFILE.DSW(14) sloupec 10: vícenásobná definice znaku 17 Character 17 on line 14, column 10 was already defined in the same pass. *Warning* BADFILE.DSW(17) Tab.2: nedefinovaná váha znaku 252..255 Characters 252 to 255 were not defined in pass 2. VAHY1ASC.DSW neobsahuje žádné syntaktické chyby. No errors found. VAHY1LAT.ASM byl úspěšně vygenerován. Source was generated.

License agreement

SortKit is freeware. Please distribute only the original package downloaded from vit$oft freeware which should contain the following files:

 SORTKIT.COM   sort weight compiler
 SORTKITC.HTM  Czech manual
 SORTKIT.HTM   This English manual
 VITSOFT.CSS   Stylesheet
 VAHY1ISO.DSW  Source sort weights in ISO-8859-2
 VAHY1KAM.DSW  Source sort weights in CP895 Kamenických
 VAHY1LAT.DSW  Source sort weights in CP852 DOS Latin 2
 VAHY1WIN.DSW  Source sort weights in CP1250 Windows
 VZOREK.TXT    Sample of unsorted text in codepage Kamenických
 VZOREKW.TXT   Sample of unsorted text in codepage CP1250
 CPP.ZIP       C source modification for C++ by Petr Soucek
 FILE_ID.DIZ   BBS distribution identifier

Links

See the Czech manual.