Next: Disabling Multibyte, Up: Non-ASCII Characters [Contents][Index]
Emacs buffers and strings support a large repertoire of characters from many different scripts, allowing users to type and display text in almost any known written language.
To support this multitude of characters and scripts, Emacs closely
follows the Unicode Standard. The Unicode Standard assigns a
unique number, called a codepoint, to each and every character.
The range of codepoints defined by Unicode, or the Unicode
codespace, is 0..#x10FFFF
(in hexadecimal notation),
inclusive. Emacs extends this range with codepoints in the range
#x110000..#x3FFFFF
, which it uses for representing characters
that are not unified with Unicode and raw 8-bit bytes that
cannot be interpreted as characters. Thus, a character codepoint in
Emacs is a 22-bit integer.
To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint15. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte.
Outside Emacs, characters can be represented in many different encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts between these external encodings and its internal representation, as appropriate, when it reads text into a buffer or a string, or when it writes text to a disk file or passes it to some other process.
Occasionally, Emacs needs to hold and manipulate encoded text or binary non-text data in its buffers or strings. For example, when Emacs visits a file, it first reads the file’s text verbatim into a buffer, and only then converts it to the internal representation. Before the conversion, the buffer holds encoded text.
Encoded text is not really text, as far as Emacs is concerned, but
rather a sequence of raw 8-bit bytes. We call buffers and strings
that hold encoded text unibyte buffers and strings, because
Emacs treats them as a sequence of individual bytes. Usually, Emacs
displays unibyte buffers and strings as octal codes such as
\237
. We recommend that you never use unibyte buffers and
strings except for manipulating encoded text or binary non-text data.
In a buffer, the buffer-local value of the variable
enable-multibyte-characters
specifies the representation used.
The representation for a string is determined and recorded in the string
when the string is constructed.
This variable specifies the current buffer’s text representation.
If it is non-nil
, the buffer contains multibyte text; otherwise,
it contains unibyte encoded text or binary non-text data.
You cannot set this variable directly; instead, use the function
set-buffer-multibyte
to change a buffer’s representation.
Buffer positions are measured in character units. This function
returns the byte-position corresponding to buffer position
position in the current buffer. This is 1 at the start of the
buffer, and counts upward in bytes. If position is out of
range, the value is nil
.
Return the buffer position, in character units, corresponding to given
byte-position in the current buffer. If byte-position is
out of range, the value is nil
. In a multibyte buffer, an
arbitrary value of byte-position can be not at character
boundary, but inside a multibyte sequence representing a single
character; in this case, this function returns the buffer position of
the character whose multibyte sequence includes byte-position.
In other words, the value does not change for all byte positions that
belong to the same character.
Return t
if string is a multibyte string, nil
otherwise. This function also returns nil
if string is
some object other than a string.
This function returns the number of bytes in string.
If string is a multibyte string, this can be greater than
(length string)
.
This function concatenates all its argument bytes and makes the result a unibyte string.
This internal representation is based on one of the encodings defined by the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-bit bytes and characters not unified with Unicode.
Next: Disabling Multibyte, Up: Non-ASCII Characters [Contents][Index]