##Adobe File Version: 1.000 #======================================================================= # FTP file name: README.TXT # # Contents: Background information on Unicode mapping tables # for Mac OS text encodings # # Copyright: (c) 1995-1999 by Apple Computer, Inc., all rights # reserved. # # Contact: charsets@apple.com # # Changes: # # b02 1999-Sep-22 Update information on Cyrillic. Update # contact e-mail address. # n07 1998-Feb-05 Rewrite to provide additional information # relevant to using the accompanying mapping # tables, and to delete some extraneous # information. Delete Bulgarian (no special # encoding, uses standard Cyrillic), add # Farsi, Devanagari, Gurmukhi, Gujarati, # Celtic, Gaelic, Inuit, Tibetan. # n04 1995-Nov-15 Update info for Hebrew and Thai # n03 1995-Apr-15 First version (after fixing some typos). # ################## 0. Preliminaries ---------------- For maximum interchangeability, this file and the accompanying Mac OS mapping tables use only ASCII characters. They are intended to be displayed in a monospaced font. Apple, the Apple logo, Mac, and Macintosh are trademarks of Apple Computer, Inc., registered in the United States and other countries. QuickDraw and TrueType are trademarks of Apple Computer, Inc. Unicode is a trademark of Unicode Inc. PostScript is a trademark of Adobe Systems Inc., which may be registered in certain jurisdictions. IBM is a registered trademark of International Business Machines Corporation. ITC Zapf Dingbats is a registered trademark of the International Typeface Corporation. For the sake of brevity, throughout this document and the accompanying tables, "Macintosh" can be used to refer to Macintosh computers and "Unicode" can be used to refer to the Unicode standard. Apple Computer, Inc. ("Apple") makes no warranty or representation, either express or implied, with respect to this document and the accompanying tables, their quality, accuracy, or fitness for a particular purpose. In no event will Apple be liable for direct, indirect, special, incidental, or consequential damages resulting from any defect or inaccuracy in this document or the accompanying tables. 1. Introduction --------------- This document summarizes some Unicode mapping considerations that are relevant for the accompanying mapping tables. It also provides an overview of Mac OS encodings. These mapping tables and character lists are subject to change. The latest tables should be available from the following: <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> <ftp://dev.apple.com/devworld/Technical_Documentation/Misc._Standards/> 2. Round-trip fidelity and overview of mapping techniques --------------------------------------------------------- For a particular set of national and international standards, Unicode provides round-trip fidelity: Text in one of those encodings can be mapped to Unicode and back again, yielding the original characters. Characters which are distinct in one of these source standards have a distinct counterpart in Unicode. Note that this counterpart might not be a single Unicode character; as is pointed out in "The Unicode Standard, Version 2.0" (page 2-10), "sometimes a single code value in another standard corresponds to a sequence of code values in the Unicode Standard, or vice versa." However, Unicode does not attempt to provide round-trip fidelity for most vendor standards. Nevertheless, Apple and other platform vendors may need to provide such round-trip fidelity for their current encodings (this can be important in file systems, for example). In order to do this, Apple makes use of some Unicode characters in the corporate-use zone (the upper end of the private use area). Corporate-zone characters must be used with care. Indiscriminate use of such characters can result in text which is not easily interchanged with other systems, since these characters have no standard meaning outside a particular platform. The mappings provided here are intended to minimize the use of private use characters, or to use them in such a way that basic text content will not be lost if the corporate zone characters are dropped when text is transferred to another system. The tables provided here have three goals, in the following order of importance: 1. Provide 100% round-trip mapping from a Mac OS encoding to Unicode and back (even if the mappings here are converted to maximal decompositions, see below). 2. Map characters in a Mac OS encoding into the Unicode characters that best represent the interpretation and usage of the Mac OS characters. 3. When mapping text in a Mac OS encoding to Unicode using the tables, the resulting Unicode text should be as interchangeable as possible. To satisfy these goals, the mappings use a variety of techniques. First we attempt to achieve round-trip mappings using any standard Unicode feature at our disposal, without resorting to corporate-zone characters. This can includes the following techniques: - Use of all Unicode characters defined in Unicode 2.1, including compatibility characters. - Mapping a single character in a Mac OS encoding to a sequence of standard Unicode characters, or vice versa. This requires grouping characters into appropriate chunks for lookup before mapping them (this mainly applies to sequences of Unicode characters). - Using Unicode direction overrides to force direction attributes when mapping to Unicode. This requires resolution of Unicode character direction, and use of this information, when mapping from Unicode back to certain Mac OS encodings. The requirements imposed on Unicode handling are necessary for other, non-transcoding operations in a full Unicode implementation anyway, so requiring them for transcoding should not impose much of a burden. Next, if round-trip fidelity cannot be achieved using the above techniques, we attempt to use corporate-zone characters only as "transcoding hints" (more on this below). These are combined with one or more standard Unicode characters to mark them as special for transcoding, but have no other function and can be deleted with no loss of basic text content (only of round-trip fidelity). Finally, if a character in a Mac OS encoding is unrelated to any Unicode or Unicode sequence, we may map it to a single corporate-zone Unicode code point. These techniques are described in more detail in the following sections. Some clients of these tables may have a different set of goals. For example, some clients may prefer to avoid compatibility characters, perhaps sacrificing round-trip fidelity if necessary. In most cases it is fairly easy to construct other types of mappings from the mappings given here. In particular, the mappings here have been designed so that if they are converted to maximal decomposition mappings (by recursive application of the canonical decompositions in the Unicode database), the resulting mappings will still provide 100% roundtrip fidelity. There is one more round-trip issue that should be mentioned. If a Unicode character or sequence can be mapped at all into a particular Mac encoding, then the reverse mapping back to Unicode should yield the original Unicode character or sequence (except for possible differences in direction overrides or other Unicode characters in the "Other, Format" category). The tables here also provide this. For a related issue, see the next section. 3. Mapping tolerance: Strict and loose -------------------------------------- In many character sets, a single character may have multiple semantics, either by explicit definition, ambiguous definition, or established usage. For example, the JIS character 0x2142, or 0x8161 in Shift-JIS, is specified in the JIS X0208 standard to have two meanings: "double vertical line" and "parallel". Each of these meanings corresponds to a different Unicode character: 0x2016 DOUBLE VERTICAL LINE and 0x2225 PARALLEL TO. When mapping from Unicode to Shift-JIS, it is normally desirable to map both of these Unicode characters to the single Shift-JIS character. However, when mapping the Shift-JIS character to Unicode, we can choose only one of the possible Unicode characters. For two encodings X and Y, we can define a set of "strict" mappings from one to the other as follows: If text in X can be mapped to Y using the strict mappings from X to Y, then the resulting text can be mapped back using the strict mappings from Y to X to end up with the original text from X. Similarly, if text in Y can be mapped to X using the strict mappings from Y to X, then the resulting text can be mapped back using the strict mappings from X to Y to end up with the original text from Y. There may be several characters in one encoding that all map to a single character in another encoding, but only one of these mappings can be strict; the others are "loose". The mappings given in the accompanying tables are strict mappings. However, the Mac OS Text Encoding Converter also supports loose mappings and fallback mappings. Some of the accompanying tables provide suggestions about possible loose mappings. 4. Mapping a Mac encoding character to a Unicode sequence or vice versa ----------------------------------------------------------------------- In some cases, a character in a Mac OS encoding maps to a sequence of Unicode characters. For example, the Mac OS Japanese encoding includes a character for the circled CJK ideograph "big". Although Unicode encodes other circl...
wendy6