EuroConvertor
Manual in English language
Manuál v českém jazyce
Download the compiled binary + manuals
View the source code
Download the source + libraries + compiler
About EuroConvertor
Supported encodings
Command-line syntax
  Encoding labels
  Encoding special names
  Encoding autodetection
  Encoding endianess
  Encoding BOM
  Input encoding modifiers
  Output encoding modifiers
Input errors
Output errors
Recapitulation

About

EuroConvertor ver.20190125 changes the encoding of characters in text files.

Both 16bit DOS version and 32bit Windows version of this program are bundled in one executable file euroconv.exe. When supplied with all requested parameters, it works in batch scripts as a console application without human intervention. When it is run without parameters, for instance from Explorer, EuroConvertor launches graphic window where the files and their encodings can be selected interactively from menu. Controls and their focus can be switched by mouse or by keys Tab, Shift-Tab and SPACE. Focus can also be acquired by accelerated keys (Alt together with the character emphasised by underline).

EuroConvertor can be used in batch scripts in DOS, MS Windows, or interactively in Windows and in Unix-like systems (with the help of wine).

EuroConvertor is available free of charge, it is written in EuroAssembler, its source can be reviwed online and downloaded together with its assembler from EuroAssembler site.
Compiled binary file together with this manual is available at the site vit$oft freeware for download as euroconv.zip.

Encodings

Unicode Chart assigns an ordinal number (codepoint) to almost every character used by humans: letters, digits, symbols, semigraphic boxes, emojis, pictograms, ideograms. Encoding is the relation between the character's appearance (glyph) and its assigned value.

The first 128 Unicode codepoints were adopted from ASCII (American Standard Code for Information Interchange) and they are identical in all supported encodings.

Alternative to ASCII is EBCDIC (Extended Binary Coded Decimal Interchange Code) introduced by IBM on their mainframe systems. EBCDIC is not supported by EuroConvertor.

Each ASCII character encodes in 7 bits, the upper half of ASCII table is vacant. Those upper 128 positions is often used for letters with diacritic signs, non-Latin alphabets, semigraphic and other symbols. Many European languages are satisfied with limitation of 256 possible characters (one byte for each), such encodings are called OEM (Original Equipment Manufacturer) and ANSI (American National Standards Institute) code pages.

EuroConvertor supports following 8bit code pages:

IBM437, Mazovia, IBM737, IBM775, IBM850, IBM851, IBM852, IBM853, IBM855, IBM856, IBM857, IBM858, IBM859, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM867, IBM868, IBM869, IBM874, KOI8-R, KOI8-E, KOI8-T, KOI8-F, KOI8-CS, Kamenicky, IBM912, IBM1006, KOI8-RU, KOI8-U, Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257, Windows-1258, Mac-Roman, MAC-Arabic, Mac-Hebrew, Mac-Greek, Max-Cyrillic, Mac-Romanian, Mac-Ukrainian, Mac-Thai, Mac-CE, Mac-Icelandic, Mac-Inuit, Mac-Turkish, Mac-Croatian, Mac-Gaelic, Mac-Celtic, Mac-Latin, NextStep, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-12, ISO-8859-14, ISO-8859-15, ISO-8859-16.

Each character in ASCII, OEM and ANSI encodings occupies exactly 1 byte.
In Unicode encoding UTF-32 every character occupies 4 bytes.
In UTF-16 most characters occupy 2 bytes {if their codepoint can be saved as a 16bit number). Other characters are encoded using two 16bit numbers, so called surrogates.
Most frequently used Unicode encoding UTF-8 uses variable character size 1 to 4 bytes.

Combining characters (diacritic marks together with a plain letter) are not supported by EuroConvertor.

Syntax

EuroConvertor expects four space-separated arguments on its command line in fixed order:

  1. input encoding specification
  2. output encoding specification
  3. input file name
  4. output file name

Example:
euroconv.exe IBM852 UTF-8 input.txt output.txt

The undocumented fifth argument, if present, makes EuroConvertor wait until the user has pressed a key when the conversion finished. In window version it prevents the console with final summary from quick disappearing when the conversion is over.

Encoding labels

Syntax of encoding specification (first two command-line arguments) is relaxed. Its name is case insensitive, hyphens may be omitted, often only mere decimal number is sufficient to identify the encoding. Beside the encoding name specified above, EuroConvertor also accepts it aliases and code page identifier assigned by Microsoft, for instance the number 65001 was assigned to the encoding UTF-8.
To select Windows-1252 we could also use win-1252, CP1252 or just 1252.
Another example: encoding ISO-8859-10 (Latin 6) used in Nordic countries could be also specified as 8859-10, CP28600, 28600, IBM919 or 919 as well.

Encoding special names

When the word enc is used as the first command-line argument, EuroConvertor displays the list of all supported encodings:
euroconv enc

The word OEM or ANSI may be used instead of explicit encoding name. EuroConvertor will then use code page which the current user has selected in Regional Settings of their Windows.

Encoding autodetection

The word auto may be used instead of explicit input encoding name. EuroConvertor will analyze the frequencies of letters and guess the encoding of input file. Only the first 1 MB is analysed (in DOS version it is only the first 48 KB of input text).

Autodetection works best on files with plain text only, it will probably fail when the text is too short or if it contains symbols, non-Latin letters, semigraphics or machine code.

Encoding endianess

Endianess specifies how are numbers bigger than 255 stored in computer memory.
Little endian stores the least significant byte first. It is used on PC with Intel x86 processors.
Big endian architecture stores the most significant byte first. It is declared as default in Unicode when it's not otherwise specified.

Endianess is meaningfull only in UTF-16 and UTF-32 encodings, their names may be appended with LE or BE, optionally separated by hyphen, by slash or by nothing: UTF-16/LE, utf32be etc.

Encoding BOM

BOM alias Byte Order Mark is a special character which occurs at the very beginning of Unicode-encoded text file and it specifies its endianess.

EuroConvertor respects BOM in UTF-16 and UTF-32 texts if their endianess wasn't explicitly specified in the first argument by suffix /LE or /BE. When there is no suffix and no BOM in input text, endianess will be autodetected.

BOM character itself in input text is always skipped from conversion. It will be written to output text if the requested output encoding is UTF and if its presence in output file was explicitly requested with /BOM modifier.
Example: euroconv utf-8 utf-16le-bom input.txt output.txt.

Input encoding modifiers

EuroConvert can detect HTML entities in input text and converts them to their corresponding characters. This happens when the modifier /HTML is present in input encoding specification.
Similar modifier /HTM will convert only those entities, whose value is above 127, i.e. it will pass entitized ASCII characters &, <, >, " without change.
By default are all HTML entities ignored (they pass the conversion without change).

Output encoding modifiers

When the modifier /HTML is present in output encoding specification, all characters which are not legal in output encoding will be converted to HTML hexadecimal entities.

Output encoding modifier /QM specifies that illegal characters will be replaced with question mark ?.

When output encoding modifier /IGN is used, illegal characters will be ignored (omitted from output text).

By default, or if modifier /TRANSL is explicitly appended to the output encoding, characters, which are not available in output encoding, are replaced with their transliteration to ASCII.
Diacritics from Latin letters is removed, letters from non-Latin alphabets are transliterated to fonetically similar Latin letter(s), graphic symbols are transliterated to visually similar ASCII symbols.

Input errors

Occurence of illegal character in input file will increase input error counter. Illegal characters are those, which are not defined in proclaimed encoding.

If the input text is specified as ASCII encoded, every byte with value between 128 and 255 is illegal.

In OEM and ANSI 8 bit encoding all byte values are usually exploited. Exceptions are rare, for instance the South European encoding ISO-8859-3 declares byte values 165, 174, 190, 195, 208, 227, 240 as undefined.

Unicode standard defines as legal only codepoints from BMP (Basic Multilingual Plane) in the range 0..0xD7FF and 0xE000..0xFFFF, and codepoints from Supplementary Planes in the range 0x10000..10FFFF.
Incomplete characters (odd file size) and missing surrogate in UTF-16 are treated as input errors. Invalid UTF-8 sequences detected as input errors are described in Wikipedia (missing continuation bytes, overlong sequences, illegal values).

Output errors

Counter of output errors increases with each character which is not encodable in the output encoding. There are four methods how EuroConvertor treats nonencodable characters, they are selected with output encoding modifier /IGN, /QM, /HTML or /TRANSL, see above.

Recapitulation

Characters from supplementary Unicode planes (Asian ideograms, emojis etc) are not defined in OEM and ANSI encodings. Text written in such alphabet can only be successfully converted from one UTF encoding to another.

When the conversion is finished, EuroConvertor writes final recapitulation on standard output console. For both input and output it displays file name, file size, (autodetected) encoding and the number of errors.

Size of converted file is limited to 2 GB.

When both input and output errors are zero, the conversion is reversible. This means that the output file can be converted back to its original encoding and remain identical with the original input file.

Errorlevel returned by EuroConvertor:
0 ... terminated succesfully, no input and output errors were detected.
2 ... file was converted, but some characters were not encodable. Conversion is not reversible.
4 ... read or write file error.
8 ... syntax error, wrong arguments.


▲Back to the top▲