comparison lisp/unicode.el @ 4834:b3ea9c582280

Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
author Ben Wing <ben@xemacs.org>
date Tue, 12 Jan 2010 01:38:04 -0600
parents 980575c76541
children c0934cef10c6
comparison
equal deleted inserted replaced
4833:4dd2389173fc 4834:b3ea9c582280
331 331
332 A fixed-width four-byte encoding, characters less than #x10FFFF are not 332 A fixed-width four-byte encoding, characters less than #x10FFFF are not
333 supported. " 333 supported. "
334 unicode-type ucs-4 little-endian t)) 334 unicode-type ucs-4 little-endian t))
335 335
336 (make-coding-system 336 ;; Now defined in unicode.c.
337 'utf-8 'unicode 337
338 "UTF-8" 338 ;;(make-coding-system
339 '(mnemonic "UTF8" 339 ;; 'utf-8 'unicode
340 documentation " 340 ;; "UTF-8"
341 UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding 341 ;; '(mnemonic "UTF8"
342 sharing the following principles with the Mule-internal encoding: 342 ;; documentation "..."
343 343 ;; unicode-type utf-8))
344 -- All ASCII characters (codepoints 0 through 127) are represented
345 by themselves (i.e. using one byte, with the same value as the
346 ASCII codepoint), and these bytes are disjoint from bytes
347 representing non-ASCII characters.
348
349 This means that any 8-bit clean application can safely process
350 UTF-8-encoded text as it were ASCII, with no corruption (e.g. a
351 '/' byte is always a slash character, never the second byte of
352 some other character, as with Big5, so a pathname encoded in
353 UTF-8 can safely be split up into components and reassembled
354 again using standard ASCII processes).
355
356 -- Leading bytes and non-leading bytes in the encoding of a
357 character are disjoint, so moving backwards is easy.
358
359 -- Given only the leading byte, you know how many following bytes
360 are present.
361 "
362 unicode-type utf-8))
363 344
364 (make-coding-system 345 (make-coding-system
365 'utf-8-bom 'unicode 346 'utf-8-bom 'unicode
366 "UTF-8 w/BOM" 347 "UTF-8 w/BOM"
367 '(mnemonic "MSW-UTF8" 348 '(mnemonic "MSW-UTF8"