0001: @c -*- coding: utf-8 -*-
0002: @c This is part of the Emacs manual.
0003: @c Copyright (C) 1997, 1999-2019 Free Software Foundation, Inc.
0004: @c See file emacs.texi for copying conditions.
0005: @node International
0006: @chapter International Character Set Support
0007: @c This node is referenced in the tutorial.  When renaming or deleting
0008: @c it, the tutorial needs to be adjusted.  (TUTORIAL.de)
0009: @cindex international scripts
0010: @cindex multibyte characters
0011: @cindex encoding of characters
0012: 
0013: @cindex Han
0014: @cindex Hindi
0015: @cindex Hangul
0016:   Emacs supports a wide variety of international character sets,
0017: including European and Vietnamese variants of the Latin alphabet, as
0018: well as Arabic scripts, Brahmic scripts (for languages such as
0019: Bengali, Hindi, and Thai), Cyrillic, Ethiopic, Georgian, Greek, Han
0020: (for Chinese and Japanese), Hangul (for Korean), Hebrew and IPA@.
0021: Emacs also supports various encodings of these characters that are used by
0022: other internationalized software, such as word processors and mailers.
0023: 
0024:   Emacs allows editing text with international characters by supporting
0025: all the related activities:
0026: 
0027: @itemize @bullet
0028: @item
0029: You can visit files with non-@acronym{ASCII} characters, save non-@acronym{ASCII} text, and
0030: pass non-@acronym{ASCII} text between Emacs and programs it invokes (such as
0031: compilers, spell-checkers, and mailers).  Setting your language
0032: environment (@pxref{Language Environments}) takes care of setting up the
0033: coding systems and other options for a specific language or culture.
0034: Alternatively, you can specify how Emacs should encode or decode text
0035: for each command; see @ref{Text Coding}.
0036: 
0037: @item
0038: You can display non-@acronym{ASCII} characters encoded by the various
0039: scripts.  This works by using appropriate fonts on graphics displays
0040: (@pxref{Defining Fontsets}), and by sending special codes to text
0041: displays (@pxref{Terminal Coding}).  If some characters are displayed
0042: incorrectly, refer to @ref{Undisplayable Characters}, which describes
0043: possible problems and explains how to solve them.
0044: 
0045: @item
0046: Characters from scripts whose natural ordering of text is from right
0047: to left are reordered for display (@pxref{Bidirectional Editing}).
0048: These scripts include Arabic, Hebrew, Syriac, Thaana, and a few
0049: others.
0050: 
0051: @item
0052: You can insert non-@acronym{ASCII} characters or search for them.  To do that,
0053: you can specify an input method (@pxref{Select Input Method}) suitable
0054: for your language, or use the default input method set up when you choose
0055: your language environment.  If
0056: your keyboard can produce non-@acronym{ASCII} characters, you can select an
0057: appropriate keyboard coding system (@pxref{Terminal Coding}), and Emacs
0058: will accept those characters.  Latin-1 characters can also be input by
0059: using the @kbd{C-x 8} prefix, see @ref{Unibyte Mode}.
0060: 
0061: With the X Window System, your locale should be set to an appropriate
0062: value to make sure Emacs interprets keyboard input correctly; see
0063: @ref{Language Environments, locales}.
0064: @end itemize
0065: 
0066:   The rest of this chapter describes these issues in detail.
0067: 
0068: @menu
0069: * International Chars::     Basic concepts of multibyte characters.
0070: * Language Environments::   Setting things up for the language you use.
0071: * Input Methods::           Entering text characters not on your keyboard.
0072: * Select Input Method::     Specifying your choice of input methods.
0073: * Coding Systems::          Character set conversion when you read and
0074:                               write files, and so on.
0075: * Recognize Coding::        How Emacs figures out which conversion to use.
0076: * Specify Coding::          Specifying a file's coding system explicitly.
0077: * Output Coding::           Choosing coding systems for output.
0078: * Text Coding::             Choosing conversion to use for file text.
0079: * Communication Coding::    Coding systems for interprocess communication.
0080: * File Name Coding::        Coding systems for file @emph{names}.
0081: * Terminal Coding::         Specifying coding systems for converting
0082:                               terminal input and output.
0083: * Fontsets::                Fontsets are collections of fonts
0084:                               that cover the whole spectrum of characters.
0085: * Defining Fontsets::       Defining a new fontset.
0086: * Modifying Fontsets::      Modifying an existing fontset.
0087: * Undisplayable Characters:: When characters don't display.
0088: * Unibyte Mode::            You can pick one European character set
0089:                               to use without multibyte characters.
0090: * Charsets::                How Emacs groups its internal character codes.
0091: * Bidirectional Editing::   Support for right-to-left scripts.
0092: @end menu
0093: 
0094: @node International Chars
0095: @section Introduction to International Character Sets
0096: 
0097:   The users of international character sets and scripts have
0098: established many more-or-less standard coding systems for storing
0099: files.  These coding systems are typically @dfn{multibyte}, meaning
0100: that sequences of two or more bytes are used to represent individual
0101: non-@acronym{ASCII} characters.
0102: 
0103: @cindex Unicode
0104:   Internally, Emacs uses its own multibyte character encoding, which
0105: is a superset of the @dfn{Unicode} standard.  This internal encoding
0106: allows characters from almost every known script to be intermixed in a
0107: single buffer or string.  Emacs translates between the multibyte
0108: character encoding and various other coding systems when reading and
0109: writing files, and when exchanging data with subprocesses.
0110: 
0111: @kindex C-h h
0112: @findex view-hello-file
0113: @cindex undisplayable characters
0114: @cindex @samp{?} in display
0115:   The command @kbd{C-h h} (@code{view-hello-file}) displays the file
0116: @file{etc/HELLO}, which illustrates various scripts by showing
0117: how to say ``hello'' in many languages.  If some characters can't be
0118: displayed on your terminal, they appear as @samp{?} or as hollow boxes
0119: (@pxref{Undisplayable Characters}).
0120: 
0121:   Keyboards, even in the countries where these character sets are
0122: used, generally don't have keys for all the characters in them.  You
0123: can insert characters that your keyboard does not support, using
0124: @kbd{C-x 8 @key{RET}} (@code{insert-char}).  @xref{Inserting Text}.
0125: Shorthands are available for some common characters; for example, you
0126: can insert a left single quotation mark @t{‘} by typing @kbd{C-x 8
0127: [}, or in Electric Quote mode, usually by simply typing @kbd{`}.
0128: @xref{Quotation Marks}.  Emacs also supports
0129: various @dfn{input methods}, typically one for each script or
0130: language, which make it easier to type characters in the script.
0131: @xref{Input Methods}.
0132: 
0133: @kindex C-x RET
0134:   The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
0135: to multibyte characters, coding systems, and input methods.
0136: 
0137: @kindex C-x =@r{, and international characters}
0138: @findex what-cursor-position@r{, and international characters}
0139:   The command @kbd{C-x =} (@code{what-cursor-position}) shows
0140: information about the character at point.  In addition to the
0141: character position, which was described in @ref{Position Info}, this
0142: command displays how the character is encoded.  For instance, it
0143: displays the following line in the echo area for the character
0144: @samp{c}:
0145: 
0146: @smallexample
0147: Char: c (99, #o143, #x63) point=28062 of 36168 (78%) column=53
0148: @end smallexample
0149: 
0150:   The four values after @samp{Char:} describe the character that
0151: follows point, first by showing it and then by giving its character
0152: code in decimal, octal and hex.  For a non-@acronym{ASCII} multibyte
0153: character, these are followed by @samp{file} and the character's
0154: representation, in hex, in the buffer's coding system, if that coding
0155: system encodes the character safely and with a single byte
0156: (@pxref{Coding Systems}).  If the character's encoding is longer than
0157: one byte, Emacs shows @samp{file ...}.
0158: 
0159: @cindex eight-bit character set
0160: @cindex raw bytes
0161:   On rare occasions, Emacs encounters @dfn{raw bytes}: single bytes
0162: whose values are in the range 128 (0200 octal) through 255 (0377
0163: octal), which Emacs cannot interpret as part of a known encoding of
0164: some non-ASCII character.  Such raw bytes are treated as if they
0165: belonged to a special character set @code{eight-bit}; Emacs displays
0166: them as escaped octal codes (this can be customized; @pxref{Display
0167: Custom}).  In this case, @kbd{C-x =} shows @samp{raw-byte} instead of
0168: @samp{file}.  In addition, @kbd{C-x =} shows the character codes of
0169: raw bytes as if they were in the range @code{#x3FFF80..#x3FFFFF},
0170: which is where Emacs maps them to distinguish them from Unicode
0171: characters in the range @code{#x0080..#x00FF}.
0172: 
0173: @cindex character set of character at point
0174: @cindex font of character at point
0175: @cindex text properties at point
0176: @cindex face at point
0177:   With a prefix argument (@kbd{C-u C-x =}), this command displays a
0178: detailed description of the character in a window:
0179: 
0180: @itemize @bullet
0181: @item
0182: The character set name, and the codes that identify the character
0183: within that character set; @acronym{ASCII} characters are identified
0184: as belonging to the @code{ascii} character set.
0185: 
0186: @item
0187: The character's script, syntax and categories.
0188: 
0189: @item
0190: What keys to type to input the character in the current input method
0191: (if it supports the character).
0192: 
0193: @item
0194: The character's encodings, both internally in the buffer, and externally
0195: if you were to save the file.
0196: 
0197: @item
0198: If you are running Emacs on a graphical display, the font name and
0199: glyph code for the character.  If you are running Emacs on a text
0200: terminal, the code(s) sent to the terminal.
0201: 
0202: @item
0203: The character's text properties (@pxref{Text Properties,,,
0204: elisp, the Emacs Lisp Reference Manual}), including any non-default
0205: faces used to display the character, and any overlays containing it
0206: (@pxref{Overlays,,, elisp, the same manual}).
0207: @end itemize
0208: 
0209:   Here's an example, with some lines folded to fit into this manual:
0210: 
0211: @smallexample
0212:              position: 1 of 1 (0%), column: 0
0213:             character: ê (displayed as ê) (codepoint 234, #o352, #xea)
0214:     preferred charset: unicode (Unicode (ISO10646))
0215: code point in charset: 0xEA
0216:                script: latin
0217:                syntax: w        which means: word
0218:              category: .:Base, L:Left-to-right (strong), c:Chinese,
0219:                        j:Japanese, l:Latin, v:Viet
0220:              to input: type "C-x 8 RET ea" or
0221:                        "C-x 8 RET LATIN SMALL LETTER E WITH CIRCUMFLEX"
0222:           buffer code: #xC3 #xAA
0223:             file code: #xC3 #xAA (encoded by coding system utf-8-unix)
0224:               display: by this font (glyph code)
0225:     xft:-PfEd-DejaVu Sans Mono-normal-normal-
0226:         normal-*-15-*-*-*-m-0-iso10646-1 (#xAC)
0227: 
0228: Character code properties: customize what to show
0229:   name: LATIN SMALL LETTER E WITH CIRCUMFLEX
0230:   old-name: LATIN SMALL LETTER E CIRCUMFLEX
0231:   general-category: Ll (Letter, Lowercase)
0232:   decomposition: (101 770) ('e' '^')
0233: @end smallexample
0234: 
0235: @node Language Environments
0236: @section Language Environments
0237: @cindex language environments
0238: 
0239:   All supported character sets are supported in Emacs buffers whenever
0240: multibyte characters are enabled; there is no need to select a
0241: particular language in order to display its characters.
0242: However, it is important to select a @dfn{language
0243: environment} in order to set various defaults.  Roughly speaking, the
0244: language environment represents a choice of preferred script rather
0245: than a choice of language.
0246: 
0247:   The language environment controls which coding systems to recognize
0248: when reading text (@pxref{Recognize Coding}).  This applies to files,
0249: incoming mail, and any other text you read into Emacs.  It may also
0250: specify the default coding system to use when you create a file.  Each
0251: language environment also specifies a default input method.
0252: 
0253: @findex set-language-environment
0254: @vindex current-language-environment
0255:   To select a language environment, customize
0256: @code{current-language-environment} or use the command @kbd{M-x
0257: set-language-environment}.  It makes no difference which buffer is
0258: current when you use this command, because the effects apply globally
0259: to the Emacs session.  See the variable @code{language-info-alist} for
0260: the list of supported language environments, and use the command
0261: @kbd{C-h L @var{lang-env} @key{RET}} (@code{describe-language-environment})
0262: for more information about the language environment @var{lang-env}.
0263: Supported language environments include:
0264: 
0265: @c @cindex entries below are split between portions of the list to
0266: @c make them more accurate, i.e., land on the line that mentions the
0267: @c language.  However, makeinfo 4.x doesn't fill inside @quotation
0268: @c lines that follow a @cindex entry and whose text has no whitespace.
0269: @c To work around, we group the language environments together, so
0270: @c that the blank that separates them triggers refill.
0271: @quotation
0272: @cindex ASCII (language environment)
0273: @cindex Arabic
0274: ASCII, Arabic,
0275: @cindex Belarusian
0276: @cindex Bengali
0277: Belarusian, Bengali,
0278: @cindex Brazilian Portuguese
0279: @cindex Bulgarian
0280: Brazilian Portuguese, Bulgarian,
0281: @cindex Burmese
0282: @cindex Cham
0283: Burmese, Cham,
0284: @cindex Chinese
0285: Chinese-BIG5, Chinese-CNS, Chinese-EUC-TW, Chinese-GB,
0286: Chinese-GB18030, Chinese-GBK,
0287: @cindex Croatian
0288: @cindex Cyrillic
0289: Croatian, Cyrillic-ALT, Cyrillic-ISO, Cyrillic-KOI8,
0290: @cindex Czech
0291: @cindex Devanagari
0292: Czech, Devanagari,
0293: @cindex Dutch
0294: @cindex English
0295: Dutch, English,
0296: @cindex Esperanto
0297: @cindex Ethiopic
0298: Esperanto, Ethiopic,
0299: @cindex French
0300: @cindex Georgian
0301: French, Georgian,
0302: @cindex German
0303: @cindex Greek
0304: @cindex Gujarati
0305: German, Greek, Gujarati,
0306: @cindex Hebrew
0307: @cindex IPA
0308: Hebrew, IPA,
0309: @cindex Italian
0310: Italian,
0311: @cindex Japanese
0312: @cindex Kannada
0313: Japanese, Kannada,
0314: @cindex Khmer
0315: @cindex Korean
0316: @cindex Lao
0317: Khmer, Korean, Lao,
0318: @cindex Latin
0319: Latin-1, Latin-2, Latin-3, Latin-4, Latin-5, Latin-6, Latin-7,
0320: Latin-8, Latin-9,
0321: @cindex Latvian
0322: @cindex Lithuanian
0323: Latvian, Lithuanian,
0324: @cindex Malayalam
0325: @cindex Oriya
0326: Malayalam, Oriya,
0327: @cindex Persian
0328: @cindex Polish
0329: Persian, Polish,
0330: @cindex Punjabi
0331: @cindex Romanian
0332: Punjabi, Romanian,
0333: @cindex Russian
0334: @cindex Sinhala
0335: Russian, Sinhala,
0336: @cindex Slovak
0337: @cindex Slovenian
0338: @cindex Spanish
0339: Slovak, Slovenian, Spanish,
0340: @cindex Swedish
0341: @cindex TaiViet
0342: Swedish, TaiViet,
0343: @cindex Tajik
0344: @cindex Tamil
0345: Tajik, Tamil,
0346: @cindex Telugu
0347: @cindex Thai
0348: Telugu, Thai,
0349: @cindex Tibetan
0350: @cindex Turkish
0351: Tibetan, Turkish,
0352: @cindex UTF-8
0353: @cindex Ukrainian
0354: UTF-8, Ukrainian,
0355: @cindex Vietnamese
0356: @cindex Welsh
0357: Vietnamese, Welsh,
0358: @cindex Windows-1255
0359: and Windows-1255.
0360: @end quotation
0361: 
0362:   To display the script(s) used by your language environment on a
0363: graphical display, you need to have suitable fonts.
0364: @xref{Fontsets}, for more details about setting up your fonts.
0365: 
0366: @findex set-locale-environment
0367: @vindex locale-language-names
0368: @vindex locale-charset-language-names
0369: @cindex locales
0370:   Some operating systems let you specify the character-set locale you
0371: are using by setting the locale environment variables @env{LC_ALL},
0372: @env{LC_CTYPE}, or @env{LANG}.  (If more than one of these is
0373: set, the first one that is nonempty specifies your locale for this
0374: purpose.)  During startup, Emacs looks up your character-set locale's
0375: name in the system locale alias table, matches its canonical name
0376: against entries in the value of the variables
0377: @code{locale-charset-language-names} and @code{locale-language-names}
0378: (the former overrides the latter),
0379: and selects the corresponding language environment if a match is found.
0380: It also adjusts the display
0381: table and terminal coding system, the locale coding system, the
0382: preferred coding system as needed for the locale, and---last but not
0383: least---the way Emacs decodes non-@acronym{ASCII} characters sent by your keyboard.
0384: 
0385: @c This seems unlikely, doesn't it?
0386:   If you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG}
0387: environment variables while running Emacs (by using @kbd{M-x setenv}),
0388: you may want to invoke the @code{set-locale-environment}
0389: command afterwards to readjust the language environment from the new
0390: locale.
0391: 
0392: @vindex locale-preferred-coding-systems
0393:   The @code{set-locale-environment} function normally uses the preferred
0394: coding system established by the language environment to decode system
0395: messages.  But if your locale matches an entry in the variable
0396: @code{locale-preferred-coding-systems}, Emacs uses the corresponding
0397: coding system instead.  For example, if the locale @samp{ja_JP.PCK}
0398: matches @code{japanese-shift-jis} in
0399: @code{locale-preferred-coding-systems}, Emacs uses that encoding even
0400: though it might normally use @code{utf-8}.
0401: 
0402:   You can override the language environment chosen at startup with
0403: explicit use of the command @code{set-language-environment}, or with
0404: customization of @code{current-language-environment} in your init
0405: file.
0406: 
0407: @kindex C-h L
0408: @findex describe-language-environment
0409:   To display information about the effects of a certain language
0410: environment @var{lang-env}, use the command @kbd{C-h L @var{lang-env}
0411: @key{RET}} (@code{describe-language-environment}).  This tells you
0412: which languages this language environment is useful for, and lists the
0413: character sets, coding systems, and input methods that go with it.  It
0414: also shows some sample text to illustrate scripts used in this
0415: language environment.  If you give an empty input for @var{lang-env},
0416: this command describes the chosen language environment.
0417: 
0418: @vindex set-language-environment-hook
0419:   You can customize any language environment with the normal hook
0420: @code{set-language-environment-hook}.  The command
0421: @code{set-language-environment} runs that hook after setting up the new
0422: language environment.  The hook functions can test for a specific
0423: language environment by checking the variable
0424: @code{current-language-environment}.  This hook is where you should
0425: put non-default settings for specific language environments, such as
0426: coding systems for keyboard input and terminal output, the default
0427: input method, etc.
0428: 
0429: @vindex exit-language-environment-hook
0430:   Before it starts to set up the new language environment,
0431: @code{set-language-environment} first runs the hook
0432: @code{exit-language-environment-hook}.  This hook is useful for undoing
0433: customizations that were made with @code{set-language-environment-hook}.
0434: For instance, if you set up a special key binding in a specific language
0435: environment using @code{set-language-environment-hook}, you should set
0436: up @code{exit-language-environment-hook} to restore the normal binding
0437: for that key.
0438: 
0439: @node Input Methods
0440: @section Input Methods
0441: 
0442: @cindex input methods
0443:   An @dfn{input method} is a kind of character conversion designed
0444: specifically for interactive input.  In Emacs, typically each language
0445: has its own input method; sometimes several languages that use the same
0446: characters can share one input method.  A few languages support several
0447: input methods.
0448: 
0449:   The simplest kind of input method works by mapping @acronym{ASCII} letters
0450: into another alphabet; this allows you to use one other alphabet
0451: instead of @acronym{ASCII}.  The Greek and Russian input methods
0452: work this way.
0453: 
0454:   A more powerful technique is composition: converting sequences of
0455: characters into one letter.  Many European input methods use composition
0456: to produce a single non-@acronym{ASCII} letter from a sequence that consists of a
0457: letter followed by accent characters (or vice versa).  For example, some
0458: methods convert the sequence @kbd{o ^} into a single accented letter.
0459: These input methods have no special commands of their own; all they do
0460: is compose sequences of printing characters.
0461: 
0462:   The input methods for syllabic scripts typically use mapping followed
0463: by composition.  The input methods for Thai and Korean work this way.
0464: First, letters are mapped into symbols for particular sounds or tone
0465: marks; then, sequences of these that make up a whole syllable are
0466: mapped into one syllable sign.
0467: 
0468:   Chinese and Japanese require more complex methods.  In Chinese input
0469: methods, first you enter the phonetic spelling of a Chinese word (in
0470: input method @code{chinese-py}, among others), or a sequence of
0471: portions of the character (input methods @code{chinese-4corner} and
0472: @code{chinese-sw}, and others).  One input sequence typically
0473: corresponds to many possible Chinese characters.  You select the one
0474: you mean using keys such as @kbd{C-f}, @kbd{C-b}, @kbd{C-n},
0475: @kbd{C-p} (or the arrow keys), and digits, which have special meanings
0476: in this situation.
0477: 
0478:   The possible characters are conceptually arranged in several rows,
0479: with each row holding up to 10 alternatives.  Normally, Emacs displays
0480: just one row at a time, in the echo area; @code{(@var{i}/@var{j})}
0481: appears at the beginning, to indicate that this is the @var{i}th row
0482: out of a total of @var{j} rows.  Type @kbd{C-n} or @kbd{C-p} to
0483: display the next row or the previous row.
0484: 
0485:     Type @kbd{C-f} and @kbd{C-b} to move forward and backward among
0486: the alternatives in the current row.  As you do this, Emacs highlights
0487: the current alternative with a special color; type @kbd{C-@key{SPC}}
0488: to select the current alternative and use it as input.  The
0489: alternatives in the row are also numbered; the number appears before
0490: the alternative.  Typing a number selects the associated alternative
0491: of the current row and uses it as input.
0492: 
0493:   @key{TAB} in these Chinese input methods displays a buffer showing
0494: all the possible characters at once; then clicking @kbd{mouse-2} on
0495: one of them selects that alternative.  The keys @kbd{C-f}, @kbd{C-b},
0496: @kbd{C-n}, @kbd{C-p}, and digits continue to work as usual, but they
0497: do the highlighting in the buffer showing the possible characters,
0498: rather than in the echo area.
0499: 
0500:   In Japanese input methods, first you input a whole word using
0501: phonetic spelling; then, after the word is in the buffer, Emacs
0502: converts it into one or more characters using a large dictionary.  One
0503: phonetic spelling corresponds to a number of different Japanese words;
0504: to select one of them, use @kbd{C-n} and @kbd{C-p} to cycle through
0505: the alternatives.
0506: 
0507:   Sometimes it is useful to cut off input method processing so that the
0508: characters you have just entered will not combine with subsequent
0509: characters.  For example, in input method @code{latin-1-postfix}, the
0510: sequence @kbd{o ^} combines to form an @samp{o} with an accent.  What if
0511: you want to enter them as separate characters?
0512: 
0513:   One way is to type the accent twice; this is a special feature for
0514: entering the separate letter and accent.  For example, @kbd{o ^ ^} gives
0515: you the two characters @samp{o^}.  Another way is to type another letter
0516: after the @kbd{o}---something that won't combine with that---and
0517: immediately delete it.  For example, you could type @kbd{o o @key{DEL}
0518: ^} to get separate @samp{o} and @samp{^}.  Another method, more
0519: general but not quite as easy to type, is to use @kbd{C-\ C-\} between
0520: two characters to stop them from combining.  This is the command
0521: @kbd{C-\} (@code{toggle-input-method}) used twice.
0522: @ifnottex
0523: @xref{Select Input Method}.
0524: @end ifnottex
0525: 
0526: @cindex incremental search, input method interference
0527:   @kbd{C-\ C-\} is especially useful inside an incremental search,
0528: because it stops waiting for more characters to combine, and starts
0529: searching for what you have already entered.
0530: 
0531:   To find out how to input the character after point using the current
0532: input method, type @kbd{C-u C-x =}.  @xref{Position Info}.
0533: 
0534: @c TODO: document complex-only/default/t of
0535: @c @code{input-method-verbose-flag}
0536: @vindex input-method-verbose-flag
0537: @vindex input-method-highlight-flag
0538:   The variables @code{input-method-highlight-flag} and
0539: @code{input-method-verbose-flag} control how input methods explain
0540: what is happening.  If @code{input-method-highlight-flag} is
0541: non-@code{nil}, the partial sequence is highlighted in the buffer (for
0542: most input methods---some disable this feature).  If
0543: @code{input-method-verbose-flag} is non-@code{nil}, the list of
0544: possible characters to type next is displayed in the echo area (but
0545: not when you are in the minibuffer).
0546: 
0547: @vindex quail-activate-hook
0548: @findex quail-translation-keymap
0549:   You can modify how an input method works by making your changes in a
0550: function that you add to the hook variable @code{quail-activate-hook}.
0551: @xref{Hooks}.  For example, you can redefine some of the input
0552: method's keys by defining key bindings in the keymap returned by the
0553: function @code{quail-translation-keymap}, using @code{define-key}.
0554: @xref{Init Rebinding}.
0555: 
0556:   Another facility for typing characters not on your keyboard is by
0557: using @kbd{C-x 8 @key{RET}} (@code{insert-char}) to insert a single
0558: character based on its Unicode name or code-point; see @ref{Inserting
0559: Text}.
0560: 
0561: @node Select Input Method
0562: @section Selecting an Input Method
0563: 
0564: @table @kbd
0565: @item C-\
0566: Enable or disable use of the selected input method (@code{toggle-input-method}).
0567: 
0568: @item C-x @key{RET} C-\ @var{method} @key{RET}
0569: Select a new input method for the current buffer (@code{set-input-method}).
0570: 
0571: @item C-h I @var{method} @key{RET}
0572: @itemx C-h C-\ @var{method} @key{RET}
0573: @findex describe-input-method
0574: @kindex C-h I
0575: @kindex C-h C-\
0576: Describe the input method @var{method} (@code{describe-input-method}).
0577: By default, it describes the current input method (if any).  This
0578: description should give you the full details of how to use any
0579: particular input method.
0580: 
0581: @item M-x list-input-methods
0582: Display a list of all the supported input methods.
0583: @end table
0584: 
0585: @findex set-input-method
0586: @vindex current-input-method
0587: @kindex C-x RET C-\
0588:   To choose an input method for the current buffer, use @kbd{C-x
0589: @key{RET} C-\} (@code{set-input-method}).  This command reads the
0590: input method name from the minibuffer; the name normally starts with the
0591: language environment that it is meant to be used with.  The variable
0592: @code{current-input-method} records which input method is selected.
0593: 
0594: @findex toggle-input-method
0595: @kindex C-\
0596:   Input methods use various sequences of @acronym{ASCII} characters to
0597: stand for non-@acronym{ASCII} characters.  Sometimes it is useful to
0598: turn off the input method temporarily.  To do this, type @kbd{C-\}
0599: (@code{toggle-input-method}).  To reenable the input method, type
0600: @kbd{C-\} again.
0601: 
0602:   If you type @kbd{C-\} and you have not yet selected an input method,
0603: it prompts you to specify one.  This has the same effect as using
0604: @kbd{C-x @key{RET} C-\} to specify an input method.
0605: 
0606:   When invoked with a numeric argument, as in @kbd{C-u C-\},
0607: @code{toggle-input-method} always prompts you for an input method,
0608: suggesting the most recently selected one as the default.
0609: 
0610: @vindex default-input-method
0611:   Selecting a language environment specifies a default input method for
0612: use in various buffers.  When you have a default input method, you can
0613: select it in the current buffer by typing @kbd{C-\}.  The variable
0614: @code{default-input-method} specifies the default input method
0615: (@code{nil} means there is none).
0616: 
0617:   In some language environments, which support several different input
0618: methods, you might want to use an input method different from the
0619: default chosen by @code{set-language-environment}.  You can instruct
0620: Emacs to select a different default input method for a certain
0621: language environment, if you wish, by using
0622: @code{set-language-environment-hook} (@pxref{Language Environments,
0623: set-language-environment-hook}).  For example:
0624: 
0625: @lisp
0626: (defun my-chinese-setup ()
0627:   "Set up my private Chinese environment."
0628:   (if (equal current-language-environment "Chinese-GB")
0629:       (setq default-input-method "chinese-tonepy")))
0630: (add-hook 'set-language-environment-hook 'my-chinese-setup)
0631: @end lisp
0632: 
0633: @noindent
0634: This sets the default input method to be @code{chinese-tonepy}
0635: whenever you choose a Chinese-GB language environment.
0636: 
0637: You can instruct Emacs to activate a certain input method
0638: automatically.  For example:
0639: 
0640: @lisp
0641: (add-hook 'text-mode-hook
0642:   (lambda () (set-input-method "german-prefix")))
0643: @end lisp
0644: 
0645: @noindent
0646: This automatically activates the input method @code{german-prefix} in
0647: Text mode.
0648: 
0649: @findex quail-set-keyboard-layout
0650:   Some input methods for alphabetic scripts work by (in effect)
0651: remapping the keyboard to emulate various keyboard layouts commonly used
0652: for those scripts.  How to do this remapping properly depends on your
0653: actual keyboard layout.  To specify which layout your keyboard has, use
0654: the command @kbd{M-x quail-set-keyboard-layout}.
0655: 
0656: @findex quail-show-key
0657:   You can use the command @kbd{M-x quail-show-key} to show what key (or
0658: key sequence) to type in order to input the character following point,
0659: using the selected keyboard layout.  The command @kbd{C-u C-x =} also
0660: shows that information, in addition to other information about the
0661: character.
0662: 
0663: @findex list-input-methods
0664:   @kbd{M-x list-input-methods} displays a list of all the supported
0665: input methods.  The list gives information about each input method,
0666: including the string that stands for it in the mode line.
0667: 
0668: @node Coding Systems
0669: @section Coding Systems
0670: @cindex coding systems
0671: 
0672:   Users of various languages have established many more-or-less standard
0673: coding systems for representing them.  Emacs does not use these coding
0674: systems internally; instead, it converts from various coding systems to
0675: its own system when reading data, and converts the internal coding
0676: system to other coding systems when writing data.  Conversion is
0677: possible in reading or writing files, in sending or receiving from the
0678: terminal, and in exchanging data with subprocesses.
0679: 
0680:   Emacs assigns a name to each coding system.  Most coding systems are
0681: used for one language, and the name of the coding system starts with
0682: the language name.  Some coding systems are used for several
0683: languages; their names usually start with @samp{iso}.  There are also
0684: special coding systems, such as @code{no-conversion}, @code{raw-text},
0685: and @code{emacs-internal}.
0686: 
0687: @cindex international files from DOS/Windows systems
0688:   A special class of coding systems, collectively known as
0689: @dfn{codepages}, is designed to support text encoded by MS-Windows and
0690: MS-DOS software.  The names of these coding systems are
0691: @code{cp@var{nnnn}}, where @var{nnnn} is a 3- or 4-digit number of the
0692: codepage.  You can use these encodings just like any other coding
0693: system; for example, to visit a file encoded in codepage 850, type
0694: @kbd{C-x @key{RET} c cp850 @key{RET} C-x C-f @var{filename}
0695: @key{RET}}.
0696: 
0697:   In addition to converting various representations of non-@acronym{ASCII}
0698: characters, a coding system can perform end-of-line conversion.  Emacs
0699: handles three different conventions for how to separate lines in a file:
0700: newline (Unix), carriage return followed by linefeed (DOS), and just
0701: carriage return (Mac).
0702: 
0703: @table @kbd
0704: @item C-h C @var{coding} @key{RET}
0705: Describe coding system @var{coding} (@code{describe-coding-system}).
0706: 
0707: @item C-h C @key{RET}
0708: Describe the coding systems currently in use (@code{describe-coding-system}).
0709: 
0710: @item M-x list-coding-systems
0711: Display a list of all the supported coding systems.
0712: @end table
0713: 
0714: @kindex C-h C
0715: @findex describe-coding-system
0716:   The command @kbd{C-h C} (@code{describe-coding-system}) displays
0717: information about particular coding systems, including the end-of-line
0718: conversion specified by those coding systems.  You can specify a coding
0719: system name as the argument; alternatively, with an empty argument, it
0720: describes the coding systems currently selected for various purposes,
0721: both in the current buffer and as the defaults, and the priority list
0722: for recognizing coding systems (@pxref{Recognize Coding}).
0723: 
0724: @findex list-coding-systems
0725:   To display a list of all the supported coding systems, type @kbd{M-x
0726: list-coding-systems}.  The list gives information about each coding
0727: system, including the letter that stands for it in the mode line
0728: (@pxref{Mode Line}).
0729: 
0730: @cindex end-of-line conversion
0731: @cindex line endings
0732: @cindex MS-DOS end-of-line conversion
0733: @cindex Macintosh end-of-line conversion
0734:   Each of the coding systems that appear in this list---except for
0735: @code{no-conversion}, which means no conversion of any kind---specifies
0736: how and whether to convert printing characters, but leaves the choice of
0737: end-of-line conversion to be decided based on the contents of each file.
0738: For example, if the file appears to use the sequence carriage return
0739: and linefeed to separate lines, DOS end-of-line conversion will be used.
0740: 
0741:   Each of the listed coding systems has three variants, which specify
0742: exactly what to do for end-of-line conversion:
0743: 
0744: @table @code
0745: @item @dots{}-unix
0746: Don't do any end-of-line conversion; assume the file uses
0747: newline to separate lines.  (This is the convention normally used
0748: on Unix and GNU systems, and macOS.)
0749: 
0750: @item @dots{}-dos
0751: Assume the file uses carriage return followed by linefeed to separate
0752: lines, and do the appropriate conversion.  (This is the convention
0753: normally used on Microsoft systems.@footnote{It is also specified for
0754: MIME @samp{text/*} bodies and in other network transport contexts.  It
0755: is different from the SGML reference syntax record-start/record-end
0756: format, which Emacs doesn't support directly.})
0757: 
0758: @item @dots{}-mac
0759: Assume the file uses carriage return to separate lines, and do the
0760: appropriate conversion.  (This was the convention used in Classic Mac
0761: OS.)
0762: @end table
0763: 
0764:   These variant coding systems are omitted from the
0765: @code{list-coding-systems} display for brevity, since they are entirely
0766: predictable.  For example, the coding system @code{iso-latin-1} has
0767: variants @code{iso-latin-1-unix}, @code{iso-latin-1-dos} and
0768: @code{iso-latin-1-mac}.
0769: 
0770: @cindex @code{undecided}, coding system
0771:   The coding systems @code{unix}, @code{dos}, and @code{mac} are
0772: aliases for @code{undecided-unix}, @code{undecided-dos}, and
0773: @code{undecided-mac}, respectively.  These coding systems specify only
0774: the end-of-line conversion, and leave the character code conversion to
0775: be deduced from the text itself.
0776: 
0777: @cindex @code{raw-text}, coding system
0778:   The coding system @code{raw-text} is good for a file which is mainly
0779: @acronym{ASCII} text, but may contain byte values above 127 that are
0780: not meant to encode non-@acronym{ASCII} characters.  With
0781: @code{raw-text}, Emacs copies those byte values unchanged, and sets
0782: @code{enable-multibyte-characters} to @code{nil} in the current buffer
0783: so that they will be interpreted properly.  @code{raw-text} handles
0784: end-of-line conversion in the usual way, based on the data
0785: encountered, and has the usual three variants to specify the kind of
0786: end-of-line conversion to use.
0787: 
0788: @cindex @code{no-conversion}, coding system
0789:   In contrast, the coding system @code{no-conversion} specifies no
0790: character code conversion at all---none for non-@acronym{ASCII} byte values and
0791: none for end of line.  This is useful for reading or writing binary
0792: files, tar files, and other files that must be examined verbatim.  It,
0793: too, sets @code{enable-multibyte-characters} to @code{nil}.
0794: 
0795:   The easiest way to edit a file with no conversion of any kind is with
0796: the @kbd{M-x find-file-literally} command.  This uses
0797: @code{no-conversion}, and also suppresses other Emacs features that
0798: might convert the file contents before you see them.  @xref{Visiting}.
0799: 
0800: @cindex @code{emacs-internal}, coding system
0801:   The coding system @code{emacs-internal} (or @code{utf-8-emacs},
0802: which is equivalent) means that the file contains non-@acronym{ASCII}
0803: characters stored with the internal Emacs encoding.  This coding
0804: system handles end-of-line conversion based on the data encountered,
0805: and has the usual three variants to specify the kind of end-of-line
0806: conversion.
0807: 
0808: @node Recognize Coding
0809: @section Recognizing Coding Systems
0810: 
0811:   Whenever Emacs reads a given piece of text, it tries to recognize
0812: which coding system to use.  This applies to files being read, output
0813: from subprocesses, text from X selections, etc.  Emacs can select the
0814: right coding system automatically most of the time---once you have
0815: specified your preferences.
0816: 
0817:   Some coding systems can be recognized or distinguished by which byte
0818: sequences appear in the data.  However, there are coding systems that
0819: cannot be distinguished, not even potentially.  For example, there is no
0820: way to distinguish between Latin-1 and Latin-2; they use the same byte
0821: values with different meanings.
0822: 
0823:   Emacs handles this situation by means of a priority list of coding
0824: systems.  Whenever Emacs reads a file, if you do not specify the coding
0825: system to use, Emacs checks the data against each coding system,
0826: starting with the first in priority and working down the list, until it
0827: finds a coding system that fits the data.  Then it converts the file
0828: contents assuming that they are represented in this coding system.
0829: 
0830:   The priority list of coding systems depends on the selected language
0831: environment (@pxref{Language Environments}).  For example, if you use
0832: French, you probably want Emacs to prefer Latin-1 to Latin-2; if you use
0833: Czech, you probably want Latin-2 to be preferred.  This is one of the
0834: reasons to specify a language environment.
0835: 
0836: @findex prefer-coding-system
0837:   However, you can alter the coding system priority list in detail
0838: with the command @kbd{M-x prefer-coding-system}.  This command reads
0839: the name of a coding system from the minibuffer, and adds it to the
0840: front of the priority list, so that it is preferred to all others.  If
0841: you use this command several times, each use adds one element to the
0842: front of the priority list.
0843: 
0844:   If you use a coding system that specifies the end-of-line conversion
0845: type, such as @code{iso-8859-1-dos}, what this means is that Emacs
0846: should attempt to recognize @code{iso-8859-1} with priority, and should
0847: use DOS end-of-line conversion when it does recognize @code{iso-8859-1}.
0848: 
0849: @vindex file-coding-system-alist
0850:   Sometimes a file name indicates which coding system to use for the
0851: file.  The variable @code{file-coding-system-alist} specifies this
0852: correspondence.  There is a special function
0853: @code{modify-coding-system-alist} for adding elements to this list.  For
0854: example, to read and write all @samp{.txt} files using the coding system
0855: @code{chinese-iso-8bit}, you can execute this Lisp expression:
0856: 
0857: @smallexample
0858: (modify-coding-system-alist 'file "\\.txt\\'" 'chinese-iso-8bit)
0859: @end smallexample
0860: 
0861: @noindent
0862: The first argument should be @code{file}, the second argument should be
0863: a regular expression that determines which files this applies to, and
0864: the third argument says which coding system to use for these files.
0865: 
0866: @vindex inhibit-eol-conversion
0867: @cindex DOS-style end-of-line display
0868:   Emacs recognizes which kind of end-of-line conversion to use based on
0869: the contents of the file: if it sees only carriage returns, or only
0870: carriage return followed by linefeed sequences, then it chooses the
0871: end-of-line conversion accordingly.  You can inhibit the automatic use
0872: of end-of-line conversion by setting the variable
0873: @code{inhibit-eol-conversion} to non-@code{nil}.  If you do that,
0874: DOS-style files will be displayed with the @samp{^M} characters
0875: visible in the buffer; some people prefer this to the more subtle
0876: @samp{(DOS)} end-of-line type indication near the left edge of the
0877: mode line (@pxref{Mode Line, eol-mnemonic}).
0878: 
0879: @vindex inhibit-iso-escape-detection
0880: @cindex escape sequences in files
0881:   By default, the automatic detection of the coding system is sensitive to
0882: escape sequences.  If Emacs sees a sequence of characters that begin
0883: with an escape character, and the sequence is valid as an ISO-2022
0884: code, that tells Emacs to use one of the ISO-2022 encodings to decode
0885: the file.
0886: 
0887:   However, there may be cases that you want to read escape sequences
0888: in a file as is.  In such a case, you can set the variable
0889: @code{inhibit-iso-escape-detection} to non-@code{nil}.  Then the code
0890: detection ignores any escape sequences, and never uses an ISO-2022
0891: encoding.  The result is that all escape sequences become visible in
0892: the buffer.
0893: 
0894:   The default value of @code{inhibit-iso-escape-detection} is
0895: @code{nil}.  We recommend that you not change it permanently, only for
0896: one specific operation.  That's because some Emacs Lisp source files
0897: in the Emacs distribution contain non-@acronym{ASCII} characters encoded in the
0898: coding system @code{iso-2022-7bit}, and they won't be
0899: decoded correctly when you visit those files if you suppress the
0900: escape sequence detection.
0901: @c I count a grand total of 3 such files, so is the above really true?
0902: 
0903: @vindex auto-coding-alist
0904: @vindex auto-coding-regexp-alist
0905:   The variables @code{auto-coding-alist} and
0906: @code{auto-coding-regexp-alist} are
0907: the strongest way to specify the coding system for certain patterns of
0908: file names, or for files containing certain patterns, respectively.
0909: These variables even override @samp{-*-coding:-*-} tags in the file
0910: itself (@pxref{Specify Coding}).  For example, Emacs
0911: uses @code{auto-coding-alist} for tar and archive files, to prevent it
0912: from being confused by a @samp{-*-coding:-*-} tag in a member of the
0913: archive and thinking it applies to the archive file as a whole.
0914: @ignore
0915: @c This describes old-style BABYL files, which are no longer relevant.
0916: Likewise, Emacs uses @code{auto-coding-regexp-alist} to ensure that
0917: RMAIL files, whose names in general don't match any particular
0918: pattern, are decoded correctly.
0919: @end ignore
0920: 
0921: @vindex auto-coding-functions
0922:   Another way to specify a coding system is with the variable
0923: @code{auto-coding-functions}.  For example, one of the builtin
0924: @code{auto-coding-functions} detects the encoding for XML files.
0925: Unlike the previous two, this variable does not override any
0926: @samp{-*-coding:-*-} tag.
0927: 
0928: @node Specify Coding
0929: @section Specifying a File's Coding System
0930: 
0931:   If Emacs recognizes the encoding of a file incorrectly, you can
0932: reread the file using the correct coding system with @kbd{C-x
0933: @key{RET} r} (@code{revert-buffer-with-coding-system}).  This command
0934: prompts for the coding system to use.  To see what coding system Emacs
0935: actually used to decode the file, look at the coding system mnemonic
0936: letter near the left edge of the mode line (@pxref{Mode Line}), or
0937: type @kbd{C-h C} (@code{describe-coding-system}).
0938: 
0939: @vindex coding
0940:   You can specify the coding system for a particular file in the file
0941: itself, using the @w{@samp{-*-@dots{}-*-}} construct at the beginning,
0942: or a local variables list at the end (@pxref{File Variables}).  You do
0943: this by defining a value for the ``variable'' named @code{coding}.
0944: Emacs does not really have a variable @code{coding}; instead of
0945: setting a variable, this uses the specified coding system for the
0946: file.  For example, @w{@samp{-*-mode: C; coding: latin-1; -*-}} specifies
0947: use of the Latin-1 coding system, as well as C mode.  When you specify
0948: the coding explicitly in the file, that overrides
0949: @code{file-coding-system-alist}.
0950: 
0951: @node Output Coding
0952: @section Choosing Coding Systems for Output
0953: 
0954: @vindex buffer-file-coding-system
0955:   Once Emacs has chosen a coding system for a buffer, it stores that
0956: coding system in @code{buffer-file-coding-system}.  That makes it the
0957: default for operations that write from this buffer into a file, such
0958: as @code{save-buffer} and @code{write-region}.  You can specify a
0959: different coding system for further file output from the buffer using
0960: @code{set-buffer-file-coding-system} (@pxref{Text Coding}).
0961: 
0962:   You can insert any character Emacs supports into any Emacs buffer,
0963: but most coding systems can only handle a subset of these characters.
0964: Therefore, it's possible that the characters you insert cannot be
0965: encoded with the coding system that will be used to save the buffer.
0966: For example, you could visit a text file in Polish, encoded in
0967: @code{iso-8859-2}, and add some Russian words to it.  When you save
0968: that buffer, Emacs cannot use the current value of
0969: @code{buffer-file-coding-system}, because the characters you added
0970: cannot be encoded by that coding system.
0971: 
0972:   When that happens, Emacs tries the most-preferred coding system (set
0973: by @kbd{M-x prefer-coding-system} or @kbd{M-x
0974: set-language-environment}).  If that coding system can safely encode
0975: all of the characters in the buffer, Emacs uses it, and stores its
0976: value in @code{buffer-file-coding-system}.  Otherwise, Emacs displays
0977: a list of coding systems suitable for encoding the buffer's contents,
0978: and asks you to choose one of those coding systems.
0979: 
0980:   If you insert the unsuitable characters in a mail message, Emacs
0981: behaves a bit differently.  It additionally checks whether the
0982: @c What determines this?
0983: most-preferred coding system is recommended for use in MIME messages;
0984: if not, it informs you of this fact and prompts you for another coding
0985: system.  This is so you won't inadvertently send a message encoded in
0986: a way that your recipient's mail software will have difficulty
0987: decoding.  (You can still use an unsuitable coding system if you enter
0988: its name at the prompt.)
0989: 
0990: @c It seems that select-message-coding-system does this.
0991: @c Both sendmail.el and smptmail.el call it; i.e., smtpmail.el still
0992: @c obeys sendmail-coding-system.
0993: @vindex sendmail-coding-system
0994:   When you send a mail message (@pxref{Sending Mail}),
0995: Emacs has four different ways to determine the coding system to use
0996: for encoding the message text.  It first tries the buffer's own value of
0997: @code{buffer-file-coding-system}, if that is non-@code{nil}.
0998: Otherwise, it uses the value of @code{sendmail-coding-system}, if that
0999: is non-@code{nil}.  Thirdly, it uses the value of
1000: @code{default-sendmail-coding-system}.
1001: If all of these three values are @code{nil}, Emacs encodes outgoing
1002: mail using the default coding system for new files (i.e., the
1003: default value of @code{buffer-file-coding-system}), which is
1004: controlled by your choice of language environment.
1005: 
1006: @node Text Coding
1007: @section Specifying a Coding System for File Text
1008: 
1009:   In cases where Emacs does not automatically choose the right coding
1010: system for a file's contents, you can use these commands to specify
1011: one:
1012: 
1013: @table @kbd
1014: @item C-x @key{RET} f @var{coding} @key{RET}
1015: Use coding system @var{coding} to save or revisit the file in
1016: the current buffer (@code{set-buffer-file-coding-system}).
1017: 
1018: @item C-x @key{RET} c @var{coding} @key{RET}
1019: Specify coding system @var{coding} for the immediately following
1020: command (@code{universal-coding-system-argument}).
1021: 
1022: @item C-x @key{RET} r @var{coding} @key{RET}
1023: Revisit the current file using the coding system @var{coding}
1024: (@code{revert-buffer-with-coding-system}).
1025: 
1026: @item M-x recode-region @key{RET} @var{right} @key{RET} @var{wrong} @key{RET}
1027: Convert a region that was decoded using coding system @var{wrong},
1028: decoding it using coding system @var{right} instead.
1029: @end table
1030: 
1031: @kindex C-x RET f
1032: @findex set-buffer-file-coding-system
1033:   The command @kbd{C-x @key{RET} f}
1034: (@code{set-buffer-file-coding-system}) sets the file coding system for
1035: the current buffer (i.e., the coding system to use when saving or
1036: reverting the file).  You specify which coding system using the
1037: minibuffer.  You can also invoke this command by clicking with
1038: @kbd{mouse-3} on the coding system indicator in the mode line
1039: (@pxref{Mode Line}).
1040: 
1041:   If you specify a coding system that cannot handle all the characters
1042: in the buffer, Emacs will warn you about the troublesome characters,
1043: and ask you to choose another coding system, when you try to save the
1044: buffer (@pxref{Output Coding}).
1045: 
1046: @cindex specify end-of-line conversion
1047:   You can also use this command to specify the end-of-line conversion
1048: (@pxref{Coding Systems, end-of-line conversion}) for encoding the
1049: current buffer.  For example, @kbd{C-x @key{RET} f dos @key{RET}} will
1050: cause Emacs to save the current buffer's text with DOS-style
1051: carriage return followed by linefeed line endings.
1052: 
1053: @kindex C-x RET c
1054: @findex universal-coding-system-argument
1055:   Another way to specify the coding system for a file is when you visit
1056: the file.  First use the command @kbd{C-x @key{RET} c}
1057: (@code{universal-coding-system-argument}); this command uses the
1058: minibuffer to read a coding system name.  After you exit the minibuffer,
1059: the specified coding system is used for @emph{the immediately following
1060: command}.
1061: 
1062:   So if the immediately following command is @kbd{C-x C-f}, for example,
1063: it reads the file using that coding system (and records the coding
1064: system for when you later save the file).  Or if the immediately following
1065: command is @kbd{C-x C-w}, it writes the file using that coding system.
1066: When you specify the coding system for saving in this way, instead
1067: of with @kbd{C-x @key{RET} f}, there is no warning if the buffer
1068: contains characters that the coding system cannot handle.
1069: 
1070:   Other file commands affected by a specified coding system include
1071: @kbd{C-x i} and @kbd{C-x C-v}, as well as the other-window variants
1072: of @kbd{C-x C-f}.  @kbd{C-x @key{RET} c} also affects commands that
1073: start subprocesses, including @kbd{M-x shell} (@pxref{Shell}).  If the
1074: immediately following command does not use the coding system, then
1075: @kbd{C-x @key{RET} c} ultimately has no effect.
1076: 
1077:   An easy way to visit a file with no conversion is with the @kbd{M-x
1078: find-file-literally} command.  @xref{Visiting}.
1079: 
1080:   The default value of the variable @code{buffer-file-coding-system}
1081: specifies the choice of coding system to use when you create a new file.
1082: It applies when you find a new file, and when you create a buffer and
1083: then save it in a file.  Selecting a language environment typically sets
1084: this variable to a good choice of default coding system for that language
1085: environment.
1086: 
1087: @kindex C-x RET r
1088: @findex revert-buffer-with-coding-system
1089:   If you visit a file with a wrong coding system, you can correct this
1090: with @kbd{C-x @key{RET} r} (@code{revert-buffer-with-coding-system}).
1091: This visits the current file again, using a coding system you specify.
1092: 
1093: @findex recode-region
1094:   If a piece of text has already been inserted into a buffer using the
1095: wrong coding system, you can redo the decoding of it using @kbd{M-x
1096: recode-region}.  This prompts you for the proper coding system, then
1097: for the wrong coding system that was actually used, and does the
1098: conversion.  It first encodes the region using the wrong coding system,
1099: then decodes it again using the proper coding system.
1100: 
1101: @node Communication Coding
1102: @section Coding Systems for Interprocess Communication
1103: 
1104:   This section explains how to specify coding systems for use
1105: in communication with other processes.
1106: 
1107: @table @kbd
1108: @item C-x @key{RET} x @var{coding} @key{RET}
1109: Use coding system @var{coding} for transferring selections to and from
1110: other graphical applications (@code{set-selection-coding-system}).
1111: 
1112: @item C-x @key{RET} X @var{coding} @key{RET}
1113: Use coding system @var{coding} for transferring @emph{one}
1114: selection---the next one---to or from another graphical application
1115: (@code{set-next-selection-coding-system}).
1116: 
1117: @item C-x @key{RET} p @var{input-coding} @key{RET} @var{output-coding} @key{RET}
1118: Use coding systems @var{input-coding} and @var{output-coding} for
1119: subprocess input and output in the current buffer
1120: (@code{set-buffer-process-coding-system}).
1121: @end table
1122: 
1123: @kindex C-x RET x
1124: @kindex C-x RET X
1125: @findex set-selection-coding-system
1126: @findex set-next-selection-coding-system
1127:   The command @kbd{C-x @key{RET} x} (@code{set-selection-coding-system})
1128: specifies the coding system for sending selected text to other windowing
1129: applications, and for receiving the text of selections made in other
1130: applications.  This command applies to all subsequent selections, until
1131: you override it by using the command again.  The command @kbd{C-x
1132: @key{RET} X} (@code{set-next-selection-coding-system}) specifies the
1133: coding system for the next selection made in Emacs or read by Emacs.
1134: 
1135: @vindex x-select-request-type
1136:   The variable @code{x-select-request-type} specifies the data type to
1137: request from the X Window System for receiving text selections from
1138: other applications.  If the value is @code{nil} (the default), Emacs
1139: tries @code{UTF8_STRING} and @code{COMPOUND_TEXT}, in this order, and
1140: uses various heuristics to choose the more appropriate of the two
1141: results; if none of these succeed, Emacs falls back on @code{STRING}.
1142: If the value of @code{x-select-request-type} is one of the symbols
1143: @code{COMPOUND_TEXT}, @code{UTF8_STRING}, @code{STRING}, or
1144: @code{TEXT}, Emacs uses only that request type.  If the value is a
1145: list of some of these symbols, Emacs tries only the request types in
1146: the list, in order, until one of them succeeds, or until the list is
1147: exhausted.
1148: 
1149: @kindex C-x RET p
1150: @findex set-buffer-process-coding-system
1151:   The command @kbd{C-x @key{RET} p} (@code{set-buffer-process-coding-system})
1152: specifies the coding system for input and output to a subprocess.  This
1153: command applies to the current buffer; normally, each subprocess has its
1154: own buffer, and thus you can use this command to specify translation to
1155: and from a particular subprocess by giving the command in the
1156: corresponding buffer.
1157: 
1158:   You can also use @kbd{C-x @key{RET} c}
1159: (@code{universal-coding-system-argument}) just before the command that
1160: runs or starts a subprocess, to specify the coding system for
1161: communicating with that subprocess.  @xref{Text Coding}.
1162: 
1163:   The default for translation of process input and output depends on the
1164: current language environment.
1165: 
1166: @vindex locale-coding-system
1167: @cindex decoding non-@acronym{ASCII} keyboard input on X
1168:   The variable @code{locale-coding-system} specifies a coding system
1169: to use when encoding and decoding system strings such as system error
1170: messages and @code{format-time-string} formats and time stamps.  That
1171: coding system is also used for decoding non-@acronym{ASCII} keyboard
1172: input on the X Window System and for encoding text sent to the
1173: standard output and error streams when in batch mode.  You should
1174: choose a coding system that is compatible
1175: with the underlying system's text representation, which is normally
1176: specified by one of the environment variables @env{LC_ALL},
1177: @env{LC_CTYPE}, and @env{LANG}.  (The first one, in the order
1178: specified above, whose value is nonempty is the one that determines
1179: the text representation.)
1180: 
1181: @node File Name Coding
1182: @section Coding Systems for File Names
1183: 
1184: @table @kbd
1185: @item C-x @key{RET} F @var{coding} @key{RET}
1186: Use coding system @var{coding} for encoding and decoding file
1187: names (@code{set-file-name-coding-system}).
1188: @end table
1189: 
1190: @findex set-file-name-coding-system
1191: @kindex C-x RET F
1192: @cindex file names with non-@acronym{ASCII} characters
1193:   The command @kbd{C-x @key{RET} F} (@code{set-file-name-coding-system})
1194: specifies a coding system to use for encoding file @emph{names}.  It
1195: has no effect on reading and writing the @emph{contents} of files.
1196: 
1197: @vindex file-name-coding-system
1198:   In fact, all this command does is set the value of the variable
1199: @code{file-name-coding-system}.  If you set the variable to a coding
1200: system name (as a Lisp symbol or a string), Emacs encodes file names
1201: using that coding system for all file operations.  This makes it
1202: possible to use non-@acronym{ASCII} characters in file names---or, at
1203: least, those non-@acronym{ASCII} characters that the specified coding
1204: system can encode.
1205: 
1206:   If @code{file-name-coding-system} is @code{nil}, Emacs uses a
1207: default coding system determined by the selected language environment,
1208: and stored in the @code{default-file-name-coding-system} variable.
1209: @c FIXME?  Is this correct?  What is the "default language environment"?
1210: In the default language environment, non-@acronym{ASCII} characters in
1211: file names are not encoded specially; they appear in the file system
1212: using the internal Emacs representation.
1213: 
1214: @cindex file-name encoding, MS-Windows
1215: @vindex w32-unicode-filenames
1216:   When Emacs runs on MS-Windows versions that are descendants of the
1217: NT family (Windows 2000, XP, and all the later versions), the value of
1218: @code{file-name-coding-system} is largely ignored, as Emacs by default
1219: uses APIs that allow passing Unicode file names directly.  By
1220: contrast, on Windows 9X, file names are encoded using
1221: @code{file-name-coding-system}, which should be set to the codepage
1222: (@pxref{Coding Systems, codepage}) pertinent for the current system
1223: locale.  The value of the variable @code{w32-unicode-filenames}
1224: controls whether Emacs uses the Unicode APIs when it calls OS
1225: functions that accept file names.  This variable is set by the startup
1226: code to @code{nil} on Windows 9X, and to @code{t} on newer versions of
1227: MS-Windows.
1228: 
1229:   @strong{Warning:} if you change @code{file-name-coding-system} (or the
1230: language environment) in the middle of an Emacs session, problems can
1231: result if you have already visited files whose names were encoded using
1232: the earlier coding system and cannot be encoded (or are encoded
1233: differently) under the new coding system.  If you try to save one of
1234: these buffers under the visited file name, saving may use the wrong file
1235: name, or it may encounter an error.  If such a problem happens, use @kbd{C-x
1236: C-w} to specify a new file name for that buffer.
1237: 
1238: @findex recode-file-name
1239:   If a mistake occurs when encoding a file name, use the command
1240: @kbd{M-x recode-file-name} to change the file name's coding
1241: system.  This prompts for an existing file name, its old coding
1242: system, and the coding system to which you wish to convert.
1243: 
1244: @node Terminal Coding
1245: @section Coding Systems for Terminal I/O
1246: 
1247: @table @kbd
1248: @item C-x @key{RET} t @var{coding} @key{RET}
1249: Use coding system @var{coding} for terminal output
1250: (@code{set-terminal-coding-system}).
1251: 
1252: @item C-x @key{RET} k @var{coding} @key{RET}
1253: Use coding system @var{coding} for keyboard input
1254: (@code{set-keyboard-coding-system}).
1255: @end table
1256: 
1257: @kindex C-x RET t
1258: @findex set-terminal-coding-system
1259:   The command @kbd{C-x @key{RET} t} (@code{set-terminal-coding-system})
1260: specifies the coding system for terminal output.  If you specify a
1261: character code for terminal output, all characters output to the
1262: terminal are translated into that coding system.
1263: 
1264:   This feature is useful for certain character-only terminals built to
1265: support specific languages or character sets---for example, European
1266: terminals that support one of the ISO Latin character sets.  You need to
1267: specify the terminal coding system when using multibyte text, so that
1268: Emacs knows which characters the terminal can actually handle.
1269: 
1270:   By default, output to the terminal is not translated at all, unless
1271: Emacs can deduce the proper coding system from your terminal type or
1272: your locale specification (@pxref{Language Environments}).
1273: 
1274: @kindex C-x RET k
1275: @findex set-keyboard-coding-system
1276: @vindex keyboard-coding-system
1277:   The command @kbd{C-x @key{RET} k} (@code{set-keyboard-coding-system}),
1278: or the variable @code{keyboard-coding-system}, specifies the coding
1279: system for keyboard input.  Character-code translation of keyboard
1280: input is useful for terminals with keys that send non-@acronym{ASCII}
1281: graphic characters---for example, some terminals designed for ISO
1282: Latin-1 or subsets of it.
1283: 
1284:   By default, keyboard input is translated based on your system locale
1285: setting.  If your terminal does not really support the encoding
1286: implied by your locale (for example, if you find it inserts a
1287: non-@acronym{ASCII} character if you type @kbd{M-i}), you will need to set
1288: @code{keyboard-coding-system} to @code{nil} to turn off encoding.
1289: You can do this by putting
1290: 
1291: @lisp
1292: (set-keyboard-coding-system nil)
1293: @end lisp
1294: 
1295: @noindent
1296: in your init file.
1297: 
1298:   There is a similarity between using a coding system translation for
1299: keyboard input, and using an input method: both define sequences of
1300: keyboard input that translate into single characters.  However, input
1301: methods are designed to be convenient for interactive use by humans, and
1302: the sequences that are translated are typically sequences of @acronym{ASCII}
1303: printing characters.  Coding systems typically translate sequences of
1304: non-graphic characters.
1305: 
1306: @node Fontsets
1307: @section Fontsets
1308: @cindex fontsets
1309: 
1310:   A font typically defines shapes for a single alphabet or script.
1311: Therefore, displaying the entire range of scripts that Emacs supports
1312: requires a collection of many fonts.  In Emacs, such a collection is
1313: called a @dfn{fontset}.  A fontset is defined by a list of font specifications,
1314: each assigned to handle a range of character codes, and may fall back
1315: on another fontset for characters that are not covered by the fonts
1316: it specifies.
1317: 
1318: @cindex fonts for various scripts
1319: @cindex Intlfonts package, installation
1320:   Each fontset has a name, like a font.  However, while fonts are
1321: stored in the system and the available font names are defined by the
1322: system, fontsets are defined within Emacs itself.  Once you have
1323: defined a fontset, you can use it within Emacs by specifying its name,
1324: anywhere that you could use a single font.  Of course, Emacs fontsets
1325: can use only the fonts that the system supports.  If some characters
1326: appear on the screen as empty boxes or hex codes, this means that the
1327: fontset in use for them has no font for those characters.  In this
1328: case, or if the characters are shown, but not as well as you would
1329: like, you may need to install extra fonts.  Your operating system may
1330: have optional fonts that you can install; or you can install the GNU
1331: Intlfonts package, which includes fonts for most supported
1332: scripts.@footnote{If you run Emacs on X, you may need to inform the X
1333: server about the location of the newly installed fonts with commands
1334: such as:
1335: @c FIXME?  I feel like this may be out of date.
1336: @c E.g., the intlfonts tarfile is ~ 10 years old.
1337: 
1338: @example
1339:  xset fp+ /usr/local/share/emacs/fonts
1340:  xset fp rehash
1341: @end example
1342: }
1343: 
1344:   Emacs creates three fontsets automatically: the @dfn{standard
1345: fontset}, the @dfn{startup fontset} and the @dfn{default fontset}.
1346: @c FIXME?  The doc of *standard*-fontset-spec says:
1347: @c "You have the biggest chance to display international characters
1348: @c with correct glyphs by using the *standard* fontset." (my emphasis)
1349: @c See https://lists.gnu.org/r/emacs-devel/2012-04/msg00430.html
1350: The default fontset is most likely to have fonts for a wide variety of
1351: non-@acronym{ASCII} characters, and is the default fallback for the
1352: other two fontsets, and if you set a default font rather than fontset.
1353: However, it does not specify font family names, so results can be
1354: somewhat random if you use it directly.  You can specify a particular
1355: fontset by starting Emacs with the @samp{-fn} option.  For example,
1356: 
1357: @example
1358: emacs -fn fontset-standard
1359: @end example
1360: 
1361: @noindent
1362: You can also specify a fontset with the @samp{Font} resource (@pxref{X
1363: Resources}).
1364: 
1365:   If no fontset is specified for use, then Emacs uses an
1366: @acronym{ASCII} font, with @samp{fontset-default} as a fallback for
1367: characters the font does not cover.  The standard fontset is only used if
1368: explicitly requested, despite its name.
1369: 
1370: @findex describe-fontset
1371:   To show the information about a specific fontset, use the
1372: @w{@kbd{M-x describe-fontset}} command.  It prompts for a fontset
1373: name, defaulting to the one used by the current frame, and then
1374: displays all the subranges of characters and the fonts assigned to
1375: them in that fontset.
1376: 
1377:   A fontset does not necessarily specify a font for every character
1378: code.  If a fontset specifies no font for a certain character, or if
1379: it specifies a font that does not exist on your system, then it cannot
1380: display that character properly.  It will display that character as a
1381: hex code or thin space or an empty box instead.  (@xref{Text Display, ,
1382: glyphless characters}, for details.)
1383: 
1384: @node Defining Fontsets
1385: @section Defining Fontsets
1386: 
1387: @vindex standard-fontset-spec
1388: @vindex w32-standard-fontset-spec
1389: @vindex ns-standard-fontset-spec
1390: @cindex standard fontset
1391:   When running on X, Emacs creates a standard fontset automatically according to the value
1392: of @code{standard-fontset-spec}.  This fontset's name is
1393: 
1394: @example
1395: -*-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard
1396: @end example
1397: 
1398: @noindent
1399: or just @samp{fontset-standard} for short.
1400: 
1401:   On GNUstep and macOS, the standard fontset is created using the value of
1402: @code{ns-standard-fontset-spec}, and on MS Windows it is
1403: created using the value of @code{w32-standard-fontset-spec}.
1404: 
1405: @c FIXME?  How does one access these, or do anything with them?
1406: @c Does it matter?
1407:   Bold, italic, and bold-italic variants of the standard fontset are
1408: created automatically.  Their names have @samp{bold} instead of
1409: @samp{medium}, or @samp{i} instead of @samp{r}, or both.
1410: 
1411: @cindex startup fontset
1412:   Emacs generates a fontset automatically, based on any default
1413: @acronym{ASCII} font that you specify with the @samp{Font} resource or
1414: the @samp{-fn} argument, or the default font that Emacs found when it
1415: started.  This is the @dfn{startup fontset} and its name is
1416: @code{fontset-startup}.  Emacs generates this fontset by replacing the
1417: @var{charset_registry} field with @samp{fontset}, and replacing the
1418: @var{charset_encoding} field with @samp{startup}, then using the
1419: resulting string to specify a fontset.
1420: 
1421:   For instance, if you start Emacs with a font of this form,
1422: 
1423: @c FIXME?  I think this is a little misleading, because you cannot (?)
1424: @c actually specify a font with wildcards, it has to be a complete spec.
1425: @c Also, an X font specification of this form hasn't (?) been
1426: @c mentioned before now, and is somewhat obsolete these days.
1427: @c People are more likely to use a form like
1428: @c emacs -fn "DejaVu Sans Mono-12"
1429: @c How does any of this apply in that case?
1430: @example
1431: emacs -fn "*courier-medium-r-normal--14-140-*-iso8859-1"
1432: @end example
1433: 
1434: @noindent
1435: Emacs generates the following fontset and uses it for the initial X
1436: window frame:
1437: 
1438: @example
1439: -*-courier-medium-r-normal-*-14-140-*-*-*-*-fontset-startup
1440: @end example
1441: 
1442:   The startup fontset will use the font that you specify, or a variant
1443: with a different registry and encoding, for all the characters that
1444: are supported by that font, and fallback on @samp{fontset-default} for
1445: other characters.
1446: 
1447:   With the X resource @samp{Emacs.Font}, you can specify a fontset name
1448: just like an actual font name.  But be careful not to specify a fontset
1449: name in a wildcard resource like @samp{Emacs*Font}---that wildcard
1450: specification matches various other resources, such as for menus, and
1451: @c FIXME is this still true?
1452: menus cannot handle fontsets.  @xref{X Resources}.
1453: 
1454:   You can specify additional fontsets using X resources named
1455: @samp{Fontset-@var{n}}, where @var{n} is an integer starting from 0.
1456: The resource value should have this form:
1457: 
1458: @smallexample
1459: @var{fontpattern}, @r{[}@var{charset}:@var{font}@r{]@dots{}}
1460: @end smallexample
1461: 
1462: @noindent
1463: where @var{fontpattern} should have the form of a standard X font name
1464: (see the previous fontset-startup example), except for the last two
1465: fields.  They should have the form @samp{fontset-@var{alias}}.
1466: 
1467:   Each fontset has two names, one long and one short.  The long name
1468: is  @var{fontpattern}.  The short name is @samp{fontset-@var{alias}},
1469: the last 2 fields of the long name (e.g., @samp{fontset-startup} for
1470: the  fontset automatically created at startup).  You can refer to the
1471: fontset by either name.
1472: 
1473:   The construct @samp{@var{charset}:@var{font}} specifies which font to
1474: use (in this fontset) for one particular character set.  Here,
1475: @var{charset} is the name of a character set, and @var{font} is the
1476: font to use for that character set.  You can use this construct any
1477: number of times in defining one fontset.
1478: 
1479:   For the other character sets, Emacs chooses a font based on
1480: @var{fontpattern}.  It replaces @samp{fontset-@var{alias}} with values
1481: that describe the character set.  For the @acronym{ASCII} character font,
1482: @samp{fontset-@var{alias}} is replaced with @samp{ISO8859-1}.
1483: 
1484:   In addition, when several consecutive fields are wildcards, Emacs
1485: collapses them into a single wildcard.  This is to prevent use of
1486: auto-scaled fonts.  Fonts made by scaling larger fonts are not usable
1487: for editing, and scaling a smaller font is also not useful, because it is
1488: better to use the smaller font in its own size, which is what Emacs
1489: does.
1490: 
1491:   Thus if @var{fontpattern} is this,
1492: 
1493: @example
1494: -*-fixed-medium-r-normal-*-24-*-*-*-*-*-fontset-24
1495: @end example
1496: 
1497: @noindent
1498: the font specification for @acronym{ASCII} characters would be this:
1499: 
1500: @example
1501: -*-fixed-medium-r-normal-*-24-*-ISO8859-1
1502: @end example
1503: 
1504: @noindent
1505: and the font specification for Chinese GB2312 characters would be this:
1506: 
1507: @example
1508: -*-fixed-medium-r-normal-*-24-*-gb2312*-*
1509: @end example
1510: 
1511:   You may not have any Chinese font matching the above font
1512: specification.  Most X distributions include only Chinese fonts that
1513: have @samp{song ti} or @samp{fangsong ti} in the @var{family} field.  In
1514: such a case, @samp{Fontset-@var{n}} can be specified as:
1515: 
1516: @smallexample
1517: Emacs.Fontset-0: -*-fixed-medium-r-normal-*-24-*-*-*-*-*-fontset-24,\
1518:         chinese-gb2312:-*-*-medium-r-normal-*-24-*-gb2312*-*
1519: @end smallexample
1520: 
1521: @noindent
1522: Then, the font specifications for all but Chinese GB2312 characters have
1523: @samp{fixed} in the @var{family} field, and the font specification for
1524: Chinese GB2312 characters has a wild card @samp{*} in the @var{family}
1525: field.
1526: 
1527: @findex create-fontset-from-fontset-spec
1528:   The function that processes the fontset resource value to create the
1529: fontset is called @code{create-fontset-from-fontset-spec}.  You can also
1530: call this function explicitly to create a fontset.
1531: 
1532:   @xref{Fonts}, for more information about font naming.
1533: 
1534: @node Modifying Fontsets
1535: @section Modifying Fontsets
1536: @cindex fontsets, modifying
1537: @findex set-fontset-font
1538: 
1539:   Fontsets do not always have to be created from scratch.  If only
1540: minor changes are required it may be easier to modify an existing
1541: fontset.  Modifying @samp{fontset-default} will also affect other
1542: fontsets that use it as a fallback, so can be an effective way of
1543: fixing problems with the fonts that Emacs chooses for a particular
1544: script.
1545: 
1546: Fontsets can be modified using the function @code{set-fontset-font},
1547: specifying a character, a charset, a script, or a range of characters
1548: to modify the font for, and a font specification for the font to be
1549: used.  Some examples are:
1550: 
1551: @example
1552: ;; Use Liberation Mono for latin-3 charset.
1553: (set-fontset-font "fontset-default" 'iso-8859-3
1554:                   "Liberation Mono")
1555: 
1556: ;; Prefer a big5 font for han characters.
1557: (set-fontset-font "fontset-default"
1558:                   'han (font-spec :registry "big5")
1559:                   nil 'prepend)
1560: 
1561: ;; Use DejaVu Sans Mono as a fallback in fontset-startup
1562: ;; before resorting to fontset-default.
1563: (set-fontset-font "fontset-startup" nil "DejaVu Sans Mono"
1564:                   nil 'append)
1565: 
1566: ;; Use MyPrivateFont for the Unicode private use area.
1567: (set-fontset-font "fontset-default"  '(#xe000 . #xf8ff)
1568:                   "MyPrivateFont")
1569: 
1570: @end example
1571: 
1572: @cindex ignore font
1573: @cindex fonts, how to ignore
1574: @vindex face-ignored-fonts
1575:   Some fonts installed on your system might be broken, or produce
1576: unpleasant results for characters for which they are used, and you may
1577: wish to instruct Emacs to completely ignore them while searching for a
1578: suitable font required to display a character.  You can do that by
1579: adding the offending fonts to the value of the variable
1580: @code{face-ignored-fonts}, which is a list.  Here's an example to put
1581: in your @file{~/.emacs}:
1582: 
1583: @example
1584: (add-to-list 'face-ignored-fonts "Some Bad Font")
1585: @end example
1586: 
1587: @node Undisplayable Characters
1588: @section Undisplayable Characters
1589: 
1590:   There may be some non-@acronym{ASCII} characters that your
1591: terminal cannot display.  Most text terminals support just a single
1592: character set (use the variable @code{default-terminal-coding-system}
1593: to tell Emacs which one, @ref{Terminal Coding}); characters that
1594: can't be encoded in that coding system are displayed as @samp{?} by
1595: default.
1596: 
1597:   Graphical displays can display a broader range of characters, but
1598: you may not have fonts installed for all of them; characters that have
1599: no font appear as a hollow box.
1600: 
1601:   If you use Latin-1 characters but your terminal can't display
1602: Latin-1, you can arrange to display mnemonic @acronym{ASCII} sequences
1603: instead, e.g., @samp{"o} for o-umlaut.  Load the library
1604: @file{iso-ascii} to do this.
1605: 
1606: @vindex latin1-display
1607:   If your terminal can display Latin-1, you can display characters
1608: from other European character sets using a mixture of equivalent
1609: Latin-1 characters and @acronym{ASCII} mnemonics.  Customize the variable
1610: @code{latin1-display} to enable this.  The mnemonic @acronym{ASCII}
1611: sequences mostly correspond to those of the prefix input methods.
1612: 
1613: @node Unibyte Mode
1614: @section Unibyte Editing Mode
1615: 
1616: @cindex European character sets
1617: @cindex accented characters
1618: @cindex ISO Latin character sets
1619: @cindex Unibyte operation
1620:   The ISO 8859 Latin-@var{n} character sets define character codes in
1621: the range 0240 to 0377 octal (160 to 255 decimal) to handle the
1622: accented letters and punctuation needed by various European languages
1623: (and some non-European ones).  Note that Emacs considers bytes with
1624: codes in this range as raw bytes, not as characters, even in a unibyte
1625: buffer, i.e., if you disable multibyte characters.  However, Emacs can
1626: still handle these character codes as if they belonged to @emph{one}
1627: of the single-byte character sets at a time.  To specify @emph{which}
1628: of these codes to use, invoke @kbd{M-x set-language-environment} and
1629: specify a suitable language environment such as @samp{Latin-@var{n}}.
1630: @xref{Disabling Multibyte, , Disabling Multibyte Characters, elisp,
1631: GNU Emacs Lisp Reference Manual}.
1632: 
1633: @vindex unibyte-display-via-language-environment
1634:   Emacs can also display bytes in the range 160 to 255 as readable
1635: characters, provided the terminal or font in use supports them.  This
1636: works automatically.  On a graphical display, Emacs can also display
1637: single-byte characters through fontsets, in effect by displaying the
1638: equivalent multibyte characters according to the current language
1639: environment.  To request this, set the variable
1640: @code{unibyte-display-via-language-environment} to a non-@code{nil}
1641: value.  Note that setting this only affects how these bytes are
1642: displayed, but does not change the fundamental fact that Emacs treats
1643: them as raw bytes, not as characters.
1644: 
1645: @cindex @code{iso-ascii} library
1646:   If your terminal does not support display of the Latin-1 character
1647: set, Emacs can display these characters as @acronym{ASCII} sequences which at
1648: least give you a clear idea of what the characters are.  To do this,
1649: load the library @code{iso-ascii}.  Similar libraries for other
1650: Latin-@var{n} character sets could be implemented, but have not been
1651: so far.
1652: 
1653: @findex standard-display-8bit
1654: @cindex 8-bit display
1655:   Normally non-ISO-8859 characters (decimal codes between 128 and 159
1656: inclusive) are displayed as octal escapes.  You can change this for
1657: non-standard extended versions of ISO-8859 character sets by using the
1658: function @code{standard-display-8bit} in the @code{disp-table} library.
1659: 
1660:   There are two ways to input single-byte non-@acronym{ASCII}
1661: characters:
1662: 
1663: @itemize @bullet
1664: @cindex 8-bit input
1665: @item
1666: You can use an input method for the selected language environment.
1667: @xref{Input Methods}.  When you use an input method in a unibyte
1668: buffer, the non-@acronym{ASCII} character you specify with it is
1669: converted to unibyte.
1670: 
1671: @item
1672: If your keyboard can generate character codes 128 (decimal) and up,
1673: representing non-@acronym{ASCII} characters, you can type those
1674: character codes directly.
1675: 
1676: On a graphical display, you should not need to do anything special to
1677: use these keys; they should simply work.  On a text terminal, you
1678: should use the command @kbd{M-x set-keyboard-coding-system} or
1679: customize the variable @code{keyboard-coding-system} to specify which
1680: coding system your keyboard uses (@pxref{Terminal Coding}).  Enabling
1681: this feature will probably require you to use @key{ESC} to type Meta
1682: characters; however, on a console terminal or a terminal emulator such
1683: as @code{xterm}, you can arrange for Meta to be converted to @key{ESC}
1684: and still be able to type 8-bit characters present directly on the
1685: keyboard or using @key{Compose} or @key{AltGr} keys.  @xref{User Input}.
1686: 
1687: @cindex @code{iso-transl} library
1688: @cindex compose character
1689: @cindex dead character
1690: @item
1691: You can use the key @kbd{C-x 8} as a compose-character prefix for
1692: entry of non-@acronym{ASCII} Latin-1 and a few other printing
1693: characters.  @kbd{C-x 8} is good for insertion (in the minibuffer as
1694: well as other buffers), for searching, and in any other context where
1695: a key sequence is allowed.
1696: 
1697: @kbd{C-x 8} works by loading the @code{iso-transl} library.  Once that
1698: library is loaded, the @key{Alt} modifier key, if the keyboard has
1699: one, serves the same purpose as @kbd{C-x 8}: use @key{Alt} together
1700: with an accent character to modify the following letter.  In addition,
1701: if the keyboard has keys for the Latin-1 dead accent characters,
1702: they too are defined to compose with the following character, once
1703: @code{iso-transl} is loaded.
1704: 
1705: Use @kbd{C-x 8 C-h} to list all the available @kbd{C-x 8} translations.
1706: @end itemize
1707: 
1708: @node Charsets
1709: @section Charsets
1710: @cindex charsets
1711: 
1712:   In Emacs, @dfn{charset} is short for ``character set''.  Emacs
1713: supports most popular charsets (such as @code{ascii},
1714: @code{iso-8859-1}, @code{cp1250}, @code{big5}, and @code{unicode}), in
1715: addition to some charsets of its own (such as @code{emacs},
1716: @code{unicode-bmp}, and @code{eight-bit}).  All supported characters
1717: belong to one or more charsets.
1718: 
1719:   Emacs normally does the right thing with respect to charsets, so
1720: that you don't have to worry about them.  However, it is sometimes
1721: helpful to know some of the underlying details about charsets.
1722: 
1723:   One example is font selection (@pxref{Fonts}).  Each language
1724: environment (@pxref{Language Environments}) defines a priority
1725: list for the various charsets.  When searching for a font, Emacs
1726: initially attempts to find one that can display the highest-priority
1727: charsets.  For instance, in the Japanese language environment, the
1728: charset @code{japanese-jisx0208} has the highest priority, so Emacs
1729: tries to use a font whose @code{registry} property is
1730: @samp{JISX0208.1983-0}.
1731: 
1732: @findex list-charset-chars
1733: @cindex characters in a certain charset
1734: @findex describe-character-set
1735:   There are two commands that can be used to obtain information about
1736: charsets.  The command @kbd{M-x list-charset-chars} prompts for a
1737: charset name, and displays all the characters in that character set.
1738: The command @kbd{M-x describe-character-set} prompts for a charset
1739: name, and displays information about that charset, including its
1740: internal representation within Emacs.
1741: 
1742: @findex list-character-sets
1743:   @kbd{M-x list-character-sets} displays a list of all supported
1744: charsets.  The list gives the names of charsets and additional
1745: information to identity each charset; for more details, see the
1746: @url{https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf,
1747: ISO International Register of Coded Character Sets to be Used with
1748: Escape Sequences (ISO-IR)} maintained by
1749: the @url{https://www.itscj.ipsj.or.jp/itscj_english/,
1750: Information Processing Society of Japan/Information Technology
1751: Standards Commission of Japan (IPSJ/ITSCJ)}.  In this list,
1752: charsets are divided into two categories: @dfn{normal charsets} are
1753: listed first, followed by @dfn{supplementary charsets}.  A
1754: supplementary charset is one that is used to define another charset
1755: (as a parent or a subset), or to provide backward-compatibility for
1756: older Emacs versions.
1757: 
1758:   To find out which charset a character in the buffer belongs to, put
1759: point before it and type @kbd{C-u C-x =} (@pxref{International
1760: Chars}).
1761: 
1762: @node Bidirectional Editing
1763: @section Bidirectional Editing
1764: @cindex bidirectional editing
1765: @cindex right-to-left text
1766: 
1767:   Emacs supports editing text written in scripts, such as Arabic,
1768: Farsi, and Hebrew, whose natural ordering of horizontal text for
1769: display is from right to left.  However, digits and Latin text
1770: embedded in these scripts are still displayed left to right.  It is
1771: also not uncommon to have small portions of text in Arabic or Hebrew
1772: embedded in an otherwise Latin document; e.g., as comments and strings
1773: in a program source file.  For these reasons, text that uses these
1774: scripts is actually @dfn{bidirectional}: a mixture of runs of
1775: left-to-right and right-to-left characters.
1776: 
1777:   This section describes the facilities and options provided by Emacs
1778: for editing bidirectional text.
1779: 
1780: @cindex logical order
1781: @cindex visual order
1782:   Emacs stores right-to-left and bidirectional text in the so-called
1783: @dfn{logical} (or @dfn{reading}) order: the buffer or string position
1784: of the first character you read precedes that of the next character.
1785: Reordering of bidirectional text into the @dfn{visual} order happens
1786: at display time.  As a result, character positions no longer increase
1787: monotonically with their positions on display.  Emacs implements the
1788: Unicode Bidirectional Algorithm (UBA) described in the
1789: @uref{http://unicode.org/reports/tr9/, Unicode Standard Annex #9}, for
1790: reordering of bidirectional text for display.
1791: It deviates from the UBA only in how continuation lines are displayed
1792: when text direction is opposite to the base paragraph direction,
1793: e.g., when a long line of English text appears in a right-to-left
1794: paragraph.
1795: 
1796: @vindex bidi-display-reordering
1797:   The buffer-local variable @code{bidi-display-reordering} controls
1798: whether text in the buffer is reordered for display.  If its value is
1799: non-@code{nil}, Emacs reorders characters that have right-to-left
1800: directionality when they are displayed.  The default value is
1801: @code{t}.
1802: 
1803: @cindex base direction of paragraphs
1804: @cindex paragraph, base direction
1805: @vindex bidi-paragraph-start-re
1806: @vindex bidi-paragraph-separate-re
1807:   Each paragraph of bidirectional text can have its own @dfn{base
1808: direction}, either right-to-left or left-to-right.  Text in
1809: left-to-right paragraphs begins on the screen at the left margin of
1810: the window and is truncated or continued when it reaches the right
1811: margin.  By contrast, text in right-to-left paragraphs is displayed
1812: starting at the right margin and is continued or truncated at the left
1813: margin.  By default, paragraph boundaries are empty lines, i.e., lines
1814: consisting entirely of whitespace characters.  To change that, you can
1815: customize the two variables @code{bidi-paragraph-start-re} and
1816: @code{bidi-paragraph-separate-re}, whose values should be regular
1817: expressions (strings); e.g., to have a single newline start a new
1818: paragraph, set both of these variables to @code{"^"}.  These two
1819: variables are buffer-local (@pxref{Locals}).
1820: 
1821: @vindex bidi-paragraph-direction
1822:   Emacs determines the base direction of each paragraph dynamically,
1823: based on the text at the beginning of the paragraph.  However,
1824: sometimes a buffer may need to force a certain base direction for its
1825: paragraphs.  The variable @code{bidi-paragraph-direction}, if
1826: non-@code{nil}, disables the dynamic determination of the base
1827: direction, and instead forces all paragraphs in the buffer to have the
1828: direction specified by its buffer-local value.  The value can be either
1829: @code{right-to-left} or @code{left-to-right}.  Any other value is
1830: interpreted as @code{nil}.
1831: 
1832: @cindex LRM
1833: @cindex RLM
1834:   Alternatively, you can control the base direction of a paragraph by
1835: inserting special formatting characters in front of the paragraph.
1836: The special character @code{RIGHT-TO-LEFT MARK}, or @sc{rlm}, forces
1837: the right-to-left direction on the following paragraph, while
1838: @code{LEFT-TO-RIGHT MARK}, or @sc{lrm} forces the left-to-right
1839: direction.  (You can use @kbd{C-x 8 @key{RET}} to insert these characters.)
1840: In a GUI session, the @sc{lrm} and @sc{rlm} characters display as very
1841: thin blank characters; on text terminals they display as blanks.
1842: 
1843:   Because characters are reordered for display, Emacs commands that
1844: operate in the logical order or on stretches of buffer positions may
1845: produce unusual effects.  For example, the commands @kbd{C-f} and
1846: @kbd{C-b} move point in the logical order, so the cursor will
1847: sometimes jump when point traverses reordered bidirectional text.
1848: Similarly, a highlighted region covering a contiguous range of
1849: character positions may look discontinuous if the region spans
1850: reordered text.  This is normal and similar to the behavior of other
1851: programs that support bidirectional text.
1852: 
1853: @kindex RIGHT@r{, and bidirectional text}
1854: @kindex LEFT@r{, and bidirectional text}
1855: @findex right-char@r{, and bidirectional text}
1856: @findex left-char@r{, and bidirectional text}
1857:   Cursor motion commands bound to arrow keys, such as @key{LEFT} and
1858: @kbd{C-@key{RIGHT}}, are sensitive to the base direction of the
1859: current paragraph.  In a left-to-right paragraph, commands bound to
1860: @key{RIGHT} with or without modifiers move @emph{forward} through
1861: buffer text, but in a right-to-left paragraph they move
1862: @emph{backward} instead.  This reflects the fact that in a
1863: right-to-left paragraph buffer positions predominantly increase when
1864: moving to the left on display.
1865: 
1866:   When you move out of a paragraph, the meaning of the arrow keys
1867: might change if the base direction of the preceding or the following
1868: paragraph is different from the paragraph out of which you moved.
1869: When that happens, you need to adjust the arrow key you press to the
1870: new base direction.
1871: 
1872: @vindex visual-order-cursor-movement
1873: @cindex cursor, visual-order motion
1874:   By default, @key{LEFT} and @key{RIGHT} move in the logical order,
1875: but if @code{visual-order-cursor-movement} is non-@code{nil}, these
1876: commands move to the character that is, correspondingly, to the left
1877: or right of the current screen position, moving to the next or
1878: previous screen line as appropriate.  Note that this might potentially
1879: move point many buffer positions away, depending on the surrounding
1880: bidirectional context.
1881: