XML uses Unicode as its character set, and so most XML tools use the UTF-8 encoding to cover all the possible characters. On the other hand, the non-XML world makes use of some other charsets[2], and in fact neither man nor Texinfo support UTF-8 very well. So db2x_manxml and db2x_texixml have to transcode their output.
“Transcoding” can be separated into three components:
UTF-8 is converted to another “native” charset, such as ISO-8879-1. Many tools exist for this purpose.
docbook2X uses iconv here.
Certain Unicode characters, such as dashes and directional quotes, are escaped with special Texinfo- or roff- specific markup.
This part can be problematic, because there is no official
mapping from Unicode to these markup-level escapes. Even if a
certain character has a markup-level escape, that does not
necessarily mean it should be escaped! Texinfo and roff
implementations often do not have much native charset support and
would use ASCII approximations for the escaped character even if
that character exists in the native charset. And if the document is
primarily in a non-English language, it becomes cumbersome to
escape all the non-ASCII characters. (For example: é
in French texts)
utf8trans, a program included in
docbook2X, converts some of these characters to markup-level
escapes. “Character maps”
for both roff and Texinfo are included in docbook2X under
charmaps/
. db2x_manxml and db2x_texixml will apply these character
mappings automatically.
Other Unicode characters are approximated using character sequences in the native charset. This part is clearly domain-specific: it depends on how the characters to be approximated are used in the document, the language, user preference, etc.
You can make custom character maps for utf8trans to do this, if your approximations are on a character-by-character basis and not context-dependent.
[2] “charset” is used very loosely here to mean any set of byte sequences used to represent characters. Other specifications typically do not make such fine distinctions between encoding and character set as the Unicode and XML standards do. Non-Unicode charsets are specifically referred to here.