comparison lisp/unicode.el @ 3767:6b2ef948e140

[xemacs-hg @ 2006-12-29 18:09:38 by aidan] etc/ChangeLog addition: 2006-12-21 Aidan Kehoe <kehoea@parhasard.net> * unicode/unicode-consortium/8859-7.TXT: Update the mapping to the 2003 version of ISO 8859-7. lisp/ChangeLog addition: 2006-12-21 Aidan Kehoe <kehoea@parhasard.net> * mule/cyrillic.el: * mule/cyrillic.el (iso-8859-5): * mule/cyrillic.el (cyrillic-koi8-r-encode-table): Add syntax, case support for Cyrillic; make some parentheses more Lispy. * mule/european.el: Content moved to latin.el, file deleted. * mule/general-late.el: If Unicode tables are to be loaded at dump time, do it here, not in loadup.el. * mule/greek.el: Add syntax, case support for Greek. * mule/latin.el: Move the content of european.el here. Change the case table mappings to use hexadecimal codes, to make cross reference to the standards easier. In all cases, take character syntax from similar characters in Latin-1 , rather than deciding separately what syntax they should take. Add (incomplete) support for case with Turkish. Remove description of the character sets used from the language environments' doc strings, since now that we create variant language environments on the fly, such descriptions will often be inaccurate. Set the native-coding-system language info property while setting the other coding-system properties of the language. * mule/misc-lang.el (ipa): Remove the language environment. The International Phonetic _Alphabet_ is not a language, it's inane to have a corresponding language environment in XEmacs. * mule/mule-cmds.el (create-variant-language-environment): Also modify the coding-priority when creating a new language environment; document that. * mule/mule-cmds.el (get-language-environment-from-locale): Recognise that the 'native-coding-system language-info property can be a list, interpret it correctly when it is one. 2006-12-21 Aidan Kehoe <kehoea@parhasard.net> * coding.el (coding-system-category): Use the new 'unicode-type property for finding what sort of Unicode coding system subtype a coding system is, instead of the overshadowed 'type property. * dumped-lisp.el (preloaded-file-list): mule/european.el has been removed. * loadup.el (really-early-error-handler): Unicode tables loaded at dump time are now in mule/general-late.el. * simple.el (count-lines): Add some backslashes to to parentheses in docstrings to help fontification along. * simple.el (what-cursor-position): Wrap a line to fit in 80 characters. * unicode.el: Use the 'unicode-type property, not 'type, for setting the Unicode coding-system subtype. src/ChangeLog addition: 2006-12-21 Aidan Kehoe <kehoea@parhasard.net> * file-coding.c: Update the make-coding-system docstring to reflect unicode-type * general-slots.h: New symbol, unicode-type, since 'type was being overridden when accessing a coding system's Unicode subtype. * intl-win32.c: Backslash a few parentheses, to help fontification along. * intl-win32.c (complex_vars_of_intl_win32): Use the 'unicode-type symbol, not 'type, when creating the Microsoft Unicode coding system. * unicode.c (unicode_putprop): * unicode.c (unicode_getprop): * unicode.c (unicode_print): Using 'type as the property name when working out what Unicode subtype a given coding system is was broken, since there's a general coding system property called 'type. Change the former to use 'unicode-type instead.
author aidan
date Fri, 29 Dec 2006 18:09:51 +0000
parents 4c8ad140bcec
children aa28d959af41
comparison
equal deleted inserted replaced
3766:a3dcf9d17a40 3767:6b2ef948e140
146 146
147 (make-coding-system 147 (make-coding-system
148 'utf-16 'unicode 148 'utf-16 'unicode
149 "UTF-16" 149 "UTF-16"
150 '(mnemonic "UTF-16" 150 '(mnemonic "UTF-16"
151 documentation 151 documentation
152 "UTF-16 Unicode encoding -- the standard (almost-) fixed-width 152 "UTF-16 Unicode encoding -- the standard (almost-) fixed-width
153 two-byte encoding, with surrogates. It will be fixed-width if all 153 two-byte encoding, with surrogates. It will be fixed-width if all
154 characters are in the BMP (Basic Multilingual Plane -- first 65536 154 characters are in the BMP (Basic Multilingual Plane -- first 65536
155 codepoints). Cannot represent characters with codepoints above 155 codepoints). Cannot represent characters with codepoints above
156 0x10FFFF (a little more than 1,000,000). Unicode and ISO guarantee 156 0x10FFFF (a little more than 1,000,000). Unicode and ISO guarantee
157 never to encode any characters outside this range -- all the rest are 157 never to encode any characters outside this range -- all the rest are
158 for private, corporate or internal use." 158 for private, corporate or internal use."
159 type utf-16)) 159 unicode-type utf-16))
160 160
161 (define-coding-system-alias 'utf-16-be 'utf-16) 161 (define-coding-system-alias 'utf-16-be 'utf-16)
162 162
163 (make-coding-system 163 (make-coding-system
164 'utf-16-bom 'unicode 164 'utf-16-bom 'unicode
165 "UTF-16 w/BOM" 165 "UTF-16 w/BOM"
166 '(mnemonic "UTF16-BOM" 166 '(mnemonic "UTF16-BOM"
167 documentation 167 documentation
168 "UTF-16 Unicode encoding with byte order mark (BOM) at the beginning. 168 "UTF-16 Unicode encoding with byte order mark (BOM) at the beginning.
169 The BOM is Unicode character U+FEFF -- i.e. the first two bytes are 169 The BOM is Unicode character U+FEFF -- i.e. the first two bytes are
170 0xFE and 0xFF, respectively, or reversed in a little-endian 170 0xFE and 0xFF, respectively, or reversed in a little-endian
171 representation. It has been sanctioned by the Unicode Consortium for 171 representation. It has been sanctioned by the Unicode Consortium for
172 use at the beginning of a Unicode stream as a marker of the byte order 172 use at the beginning of a Unicode stream as a marker of the byte order
182 -- neither byte sequence is at all likely in any other standard 182 -- neither byte sequence is at all likely in any other standard
183 encoding, particularly at the beginning of a stream 183 encoding, particularly at the beginning of a stream
184 184
185 This coding system will insert a BOM at the beginning of a stream when 185 This coding system will insert a BOM at the beginning of a stream when
186 writing and strip it off when reading." 186 writing and strip it off when reading."
187 type utf-16 187 unicode-type utf-16
188 need-bom t)) 188 need-bom t))
189 189
190 (make-coding-system 190 (make-coding-system
191 'utf-16-little-endian 'unicode 191 'utf-16-little-endian 'unicode
192 "UTF-16 Little Endian" 192 "UTF-16 Little Endian"
193 '(mnemonic "UTF16-LE" 193 '(mnemonic "UTF16-LE"
194 documentation 194 documentation
195 "Little-endian version of UTF-16 Unicode encoding. 195 "Little-endian version of UTF-16 Unicode encoding.
196 See `utf-16' coding system." 196 See `utf-16' coding system."
197 type utf-16 197 unicode-type utf-16
198 little-endian t)) 198 little-endian t))
199 199
200 (define-coding-system-alias 'utf-16-le 'utf-16-little-endian) 200 (define-coding-system-alias 'utf-16-le 'utf-16-little-endian)
201 201
202 (make-coding-system 202 (make-coding-system
205 '(mnemonic "MSW-Unicode" 205 '(mnemonic "MSW-Unicode"
206 documentation 206 documentation
207 "Little-endian version of UTF-16 Unicode encoding, with byte order mark. 207 "Little-endian version of UTF-16 Unicode encoding, with byte order mark.
208 Standard encoding for representing Unicode under MS Windows. See 208 Standard encoding for representing Unicode under MS Windows. See
209 `utf-16-bom' coding system." 209 `utf-16-bom' coding system."
210 type utf-16 210 unicode-type utf-16
211 little-endian t 211 little-endian t
212 need-bom t)) 212 need-bom t))
213 213
214 (make-coding-system 214 (make-coding-system
215 'ucs-4 'unicode 215 'ucs-4 'unicode
216 "UCS-4" 216 "UCS-4"
217 '(mnemonic "UCS4" 217 '(mnemonic "UCS4"
218 documentation 218 documentation
219 "UCS-4 Unicode encoding -- fully fixed-width four-byte encoding." 219 "UCS-4 Unicode encoding -- fully fixed-width four-byte encoding."
220 type ucs-4)) 220 unicode-type ucs-4))
221 221
222 (make-coding-system 222 (make-coding-system
223 'ucs-4-little-endian 'unicode 223 'ucs-4-little-endian 'unicode
224 "UCS-4 Little Endian" 224 "UCS-4 Little Endian"
225 '(mnemonic "UCS4-LE" 225 '(mnemonic "UCS4-LE"
226 documentation 226 documentation
227 ;; #### I don't think this is permitted by ISO 10646, only Unicode. 227 ;; #### I don't think this is permitted by ISO 10646, only Unicode.
228 ;; Call it UTF-32 instead? 228 ;; Call it UTF-32 instead?
229 "Little-endian version of UCS-4 Unicode encoding. See `ucs-4' coding system." 229 "Little-endian version of UCS-4 Unicode encoding. See `ucs-4' coding system."
230 type ucs-4 230 unicode-type ucs-4
231 little-endian t)) 231 little-endian t))
232 232
233 (make-coding-system 233 (make-coding-system
234 'utf-8 'unicode 234 'utf-8 'unicode
235 "UTF-8" 235 "UTF-8"
236 '(mnemonic "UTF8" 236 '(mnemonic "UTF8"
237 documentation 237 documentation "
238 "UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding 238 UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding
239 sharing the following principles with the Mule-internal encoding: 239 sharing the following principles with the Mule-internal encoding:
240 240
241 -- All ASCII characters (codepoints 0 through 127) are represented 241 -- All ASCII characters (codepoints 0 through 127) are represented
242 by themselves (i.e. using one byte, with the same value as the 242 by themselves (i.e. using one byte, with the same value as the
243 ASCII codepoint), and these bytes are disjoint from bytes 243 ASCII codepoint), and these bytes are disjoint from bytes
254 character are disjoint, so moving backwards is easy. 254 character are disjoint, so moving backwards is easy.
255 255
256 -- Given only the leading byte, you know how many following bytes 256 -- Given only the leading byte, you know how many following bytes
257 are present. 257 are present.
258 " 258 "
259 type utf-8)) 259 unicode-type utf-8))
260 260
261 (make-coding-system 261 (make-coding-system
262 'utf-8-bom 'unicode 262 'utf-8-bom 'unicode
263 "UTF-8 w/BOM" 263 "UTF-8 w/BOM"
264 '(mnemonic "MSW-UTF8" 264 '(mnemonic "MSW-UTF8"
265 documentation 265 documentation
266 "UTF-8 Unicode encoding, with byte order mark. 266 "UTF-8 Unicode encoding, with byte order mark.
267 Standard encoding for representing UTF-8 under MS Windows." 267 Standard encoding for representing UTF-8 under MS Windows."
268 type utf-8 268 unicode-type utf-8
269 little-endian t 269 little-endian t
270 need-bom t)) 270 need-bom t))
271 271
272 (defun decode-char (quote-ucs code &optional restriction) 272 (defun decode-char (quote-ucs code &optional restriction)
273 "FSF compatibility--return Mule character with Unicode codepoint CODE. 273 "FSF compatibility--return Mule character with Unicode codepoint CODE.
342 ; represented in Quoted-Printable and in Q as-is, with no further 342 ; represented in Quoted-Printable and in Q as-is, with no further
343 ; encoding. 343 ; encoding.
344 344
345 ; For more information, see Appendix A.1 of The Unicode Standard 2.0, or 345 ; For more information, see Appendix A.1 of The Unicode Standard 2.0, or
346 ; wherever it is in v3.0." 346 ; wherever it is in v3.0."
347 ; type utf-7)) 347 ; unicode-type utf-7))