EuroConvertor ver.20190125 changes the encoding of characters in text files.
Both 16bit DOS version and 32bit Windows version of this program are bundled
in one executable file euroconv.exe
. When supplied with all requested parameters,
it works in batch scripts as a console application without human intervention.
When it is run without parameters, for instance from Explorer,
EuroConvertor launches graphic window where the files and their encodings
can be selected interactively from menu.
Controls and their focus can be switched by mouse or by keys
Tab, Shift-Tab and SPACE.
Focus can also be acquired by accelerated keys
(Alt together with the character emphasised by underline).
EuroConvertor can be used in batch scripts in DOS, MS Windows, or interactively in Windows and in Unix-like systems (with the help of wine).
EuroConvertor is available free of charge, it is written in EuroAssembler,
its source can be reviwed online
and downloaded together with its assembler from
EuroAssembler site.
Compiled binary file together with this manual is available
at the site vit$oft freeware for download as
euroconv.zip.
Unicode Chart assigns an ordinal number (codepoint) to almost every character used by humans: letters, digits, symbols, semigraphic boxes, emojis, pictograms, ideograms. Encoding is the relation between the character's appearance (glyph) and its assigned value.
The first 128 Unicode codepoints were adopted from ASCII (American Standard Code for Information Interchange) and they are identical in all supported encodings.
Alternative to ASCII is EBCDIC (Extended Binary Coded Decimal Interchange Code) introduced by IBM on their mainframe systems. EBCDIC is not supported by EuroConvertor.
Each ASCII character encodes in 7 bits, the upper half of ASCII table is vacant. Those upper 128 positions is often used for letters with diacritic signs, non-Latin alphabets, semigraphic and other symbols. Many European languages are satisfied with limitation of 256 possible characters (one byte for each), such encodings are called OEM (Original Equipment Manufacturer) and ANSI (American National Standards Institute) code pages.
EuroConvertor supports following 8bit code pages:
IBM437, Mazovia, IBM737, IBM775, IBM850, IBM851, IBM852, IBM853, IBM855, IBM856, IBM857, IBM858, IBM859, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM867, IBM868, IBM869, IBM874, KOI8-R, KOI8-E, KOI8-T, KOI8-F, KOI8-CS, Kamenicky, IBM912, IBM1006, KOI8-RU, KOI8-U, Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257, Windows-1258, Mac-Roman, MAC-Arabic, Mac-Hebrew, Mac-Greek, Max-Cyrillic, Mac-Romanian, Mac-Ukrainian, Mac-Thai, Mac-CE, Mac-Icelandic, Mac-Inuit, Mac-Turkish, Mac-Croatian, Mac-Gaelic, Mac-Celtic, Mac-Latin, NextStep, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-12, ISO-8859-14, ISO-8859-15, ISO-8859-16.Each character in ASCII, OEM and ANSI encodings occupies exactly 1 byte.
In Unicode encoding UTF-32
every character occupies 4 bytes.
In UTF-16 most characters
occupy 2 bytes {if their codepoint can be saved as a 16bit number). Other characters
are encoded using two 16bit numbers, so called surrogates.
Most frequently used Unicode encoding
UTF-8 uses variable character size
1 to 4 bytes.
Combining characters (diacritic marks together with a plain letter) are not supported by EuroConvertor.
EuroConvertor expects four space-separated arguments on its command line in fixed order:
Example:
euroconv.exe IBM852 UTF-8 input.txt output.txt
The undocumented fifth argument, if present, makes EuroConvertor wait until the user has pressed a key when the conversion finished. In window version it prevents the console with final summary from quick disappearing when the conversion is over.
Syntax of encoding specification (first two command-line arguments) is relaxed.
Its name is case insensitive, hyphens may be omitted, often only
mere decimal number is sufficient to identify the encoding.
Beside the encoding name specified above, EuroConvertor also accepts it aliases
and code page identifier
assigned by Microsoft, for instance the number 65001
was assigned
to the encoding UTF-8
.
To select Windows-1252
we could also use win-1252
,
CP1252
or just 1252
.
Another example: encoding ISO-8859-10
(Latin 6) used in Nordic countries
could be also specified as 8859-10
, CP28600
, 28600
,
IBM919
or 919
as well.
When the word enc
is used as the first command-line argument,
EuroConvertor displays the list of all supported encodings:
euroconv enc
The word OEM
or ANSI
may be used instead of explicit
encoding name. EuroConvertor will then use code page which the current user has selected
in Regional Settings of their Windows.
The word auto
may be used instead of explicit input
encoding name. EuroConvertor will analyze the frequencies of letters
and guess the encoding of input file. Only the first 1 MB is analysed
(in DOS version it is only the first 48 KB of input text).
Autodetection works best on files with plain text only, it will probably fail when the text is too short or if it contains symbols, non-Latin letters, semigraphics or machine code.
Endianess
specifies how are numbers bigger than 255 stored in computer memory.
Little endian stores the least significant byte first.
It is used on PC with Intel x86 processors.
Big endian architecture stores the most significant byte first.
It is declared as default in Unicode when it's not otherwise specified.
Endianess is meaningfull only in UTF-16 and UTF-32 encodings,
their names may be appended with LE
or BE
,
optionally separated by hyphen, by slash or by nothing:
UTF-16/LE
, utf32be
etc.
BOM alias Byte Order Mark is a special character which occurs at the very beginning of Unicode-encoded text file and it specifies its endianess.
EuroConvertor respects BOM in UTF-16 and UTF-32 texts if their
endianess wasn't explicitly specified in the first argument
by suffix /LE
or /BE
.
When there is no suffix and no BOM in input text, endianess
will be autodetected.
BOM character itself in input text is always skipped from conversion.
It will be written to output text if the requested output encoding
is UTF and if its presence in output file was explicitly requested with
/BOM
modifier.
Example:
euroconv utf-8 utf-16le-bom input.txt output.txt
.
EuroConvert can detect
HTML entities in input text and converts them to their corresponding characters.
This happens when the modifier /HTML
is present in input encoding
specification.
Similar modifier /HTM
will convert only those entities,
whose value is above 127, i.e. it will pass entitized ASCII characters &,
<, >, " without change.
By default are all HTML entities ignored (they pass the conversion without change).
When the modifier /HTML
is present in output encoding specification,
all characters which are not legal in output encoding will be converted
to HTML hexadecimal entities.
Output encoding modifier /QM
specifies that illegal characters
will be replaced with question mark ?.
When output encoding modifier /IGN
is used, illegal characters
will be ignored (omitted from output text).
By default, or if modifier /TRANSL
is explicitly appended to the output encoding,
characters, which are not available in output encoding, are
replaced with their transliteration to ASCII.
Diacritics from Latin letters is removed, letters from non-Latin alphabets
are transliterated to fonetically similar Latin letter(s), graphic symbols
are transliterated to visually similar ASCII symbols.
Occurence of illegal character in input file will increase input error counter. Illegal characters are those, which are not defined in proclaimed encoding.
If the input text is specified as ASCII encoded, every byte with value between 128 and 255 is illegal.
In OEM and ANSI 8 bit encoding all byte values are usually exploited. Exceptions are rare, for instance the South European encoding ISO-8859-3 declares byte values 165, 174, 190, 195, 208, 227, 240 as undefined.
Unicode standard defines as legal only codepoints from BMP
(Basic Multilingual Plane) in the range 0..0xD7FF and 0xE000..0xFFFF,
and codepoints from Supplementary Planes in the range 0x10000..10FFFF.
Incomplete characters (odd file size) and missing surrogate in UTF-16
are treated as input errors. Invalid UTF-8 sequences detected as input errors
are described in Wikipedia
(missing continuation bytes, overlong sequences, illegal values).
Counter of output errors increases with each character which is not encodable
in the output encoding.
There are four methods how EuroConvertor treats nonencodable characters,
they are selected with output encoding modifier /IGN
,
/QM
, /HTML
or /TRANSL
,
see above.
Characters from supplementary Unicode planes (Asian ideograms, emojis etc) are not defined in OEM and ANSI encodings. Text written in such alphabet can only be successfully converted from one UTF encoding to another.
When the conversion is finished, EuroConvertor writes final recapitulation on standard output console. For both input and output it displays file name, file size, (autodetected) encoding and the number of errors.
Size of converted file is limited to 2 GB.
When both input and output errors are zero, the conversion is reversible. This means that the output file can be converted back to its original encoding and remain identical with the original input file.
Errorlevel returned by EuroConvertor:
0 ... terminated succesfully, no input and output errors were detected.
2 ... file was converted, but some characters were not encodable. Conversion is not reversible.
4 ... read or write file error.
8 ... syntax error, wrong arguments.