Ascii file: lex.lat


  ______________________________________________________________________

  2   Lexical conventions                                [lex]

  ______________________________________________________________________

1 A C++ program need not all be translated at the same time.   The  text
  of  the program is kept in units called source files in this standard.
  A source file together with all the headers (_lib.headers_) and source
  files   included   (_cpp.include_)  via  the  preprocessing  directive
  #include, less any source lines skipped  by  any  of  the  conditional
  inclusion  (_cpp.cond_) preprocessing directives, is called a transla
  tion unit.  Previously translated translation units can  be  preserved
  individually  or  in  libraries.   The separate translation units of a
  program communicate (_basic.link_) by (for example) calls to functions
  whose identifiers have external linkage, manipulation of objects whose
  identifiers have external linkage,  or  manipulation  of  data  files.
  Translation  units  can be separately translated and then later linked
  to produce an executable program.  (_basic.link_).

  2.1  Phases of translation                                [lex.phases]

1 The precedence among the syntax rules of translation is  specified  by
  the following phases.1)

    1 Physical source file characters are mapped to the source character
      set  (introducing  new-line characters for end-of-line indicators)
      if necessary.  Trigraph sequences (_lex.trigraph_) are replaced by
      corresponding single-character internal representations.

    2 Each instance of a new-line character and an immediately preceding
      backslash character is deleted, splicing physical source lines  to
      form  logical source lines.  A source file that is not empty shall
      end in a new-line character, which shall not be  immediately  pre
      ceded by a backslash character.

    3 The   source   file   is   decomposed  into  preprocessing  tokens
      (_lex.pptoken_) and sequences of white-space characters (including
      comments).  A source file shall not end in a partial preprocessing
      token or partial comment2).  Each comment is replaced by one space
  _________________________
  1)  Implementations must behave as if these separate phases occur, al
  though in practice different phases might be folded together.
  2) A partial preprocessing token would arise from a source file ending
  in  one  or  more  characters of a multi-character token followed by a
  "line-splicing" backslash.  A  partial  comment  would  arise  from  a
  source  file  ending with an unclosed /* comment, or a // comment line
  that ends with a "line-splicing" backslash.

      character.    New-line  characters  are  retained.   Whether  each
      nonempty sequence of white-space characters other than new-line is
      retained  or  replaced  by  one space character is implementation-
      defined.  The process of dividing a source file's characters  into
      preprocessing tokens is context-dependent.  [Example: see the han
      dling of < within a #include preprocessing directive.  ]

    4 Preprocessing directives are executed and  macro  invocations  are
      expanded.   A  #include  preprocessing  directive causes the named
      header or source file to be processed from phase 1  through  phase
      4, recursively.

    5 Each  source character set member and escape sequence in character
      literals and string literals is converted to a member of the  exe
      cution character set.

    6 Adjacent  character  string  literal  tokens  are concatenated and
      adjacent wide string literal tokens are concatenated.

    7 White-space characters separating tokens are  no  longer  signifi
      cant.   Each  preprocessing token is converted into a token.  (See
      _lex.token_).  The resulting tokens are syntactically and semanti
      cally  analyzed and translated.  The result of this process start
      ing from a single source file is called a translation unit.

    8 The translation units that will form a program are combined.   All
      external  object  and  function  references are resolved.  Library
      components are linked to satisfy external references to  functions
      and  objects  not  defined  in  the current translation.  All such
      translator output is collected into a program image which contains
      information needed for execution in its execution environment.

  2.2  Trigraph sequences                                 [lex.trigraph]

1 Before any other processing takes place, each occurrence of one of the
  following sequences of  three  characters  ("trigraph  sequences")  is
  replaced by the single character indicated in Table 1.

                       Table 1--trigraph sequences

  +-----------------------+------------------------+------------------------+
  |trigraph   replacement | trigraph   replacement | trigraph   replacement |
  +-----------------------+------------------------+------------------------+
  |  ??=           #      |   ??(           [      |   ??<           {      |
  +-----------------------+------------------------+------------------------+
  |  ??/           \      |   ??)           ]      |   ??>           }      |
  +-----------------------+------------------------+------------------------+
  |  ??'           ^      |   ??!           |      |   ??-           ~      |
  +-----------------------+------------------------+------------------------+

2 [Example:
          ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
  becomes
          #define arraycheck(a,b) a[b] || b[a]
   --end example]

  2.3  Preprocessing tokens                                [lex.pptoken]
          preprocessing-token:
                  header-name
                  identifier
                  pp-number
                  character-literal
                  string-literal
                  preprocessing-op-or-punc
                  each non-white-space character that cannot be one of the above

1 Each  preprocessing  token  that is converted to a token (_lex.token_)
  shall have the lexical form of a keyword, an identifier, a literal, an
  operator, or a punctuator.

2 A  preprocessing  token is the minimal lexical element of the language
  in translation phases 3 through 6.  The  categories  of  preprocessing
  token are: header names, identifiers, preprocessing numbers, character
  literals, string literals, preprocessing-op-or-punc, and  single  non-
  white-space  characters  that do not lexically match the other prepro
  cessing token categories.  If a ' or a " character  matches  the  last
  category, the behavior is undefined.  Preprocessing tokens can be sep
  arated by white space; this consists of comments  (_lex.comment_),  or
  white-space characters (space, horizontal tab, new-line, vertical tab,
  and form-feed), or both.  As described in  Clause  _cpp_,  in  certain
  circumstances  during translation phase 4, white space (or the absence
  thereof) serves as more than preprocessing  token  separation.   White
  space can appear within a preprocessing token only as part of a header
  name or between the quotation characters in  a  character  literal  or
  string literal.

3 If  the  input  stream  has been lexically analyzed into preprocessing
  tokens up to a given character, the next preprocessing  token  is  the
  longest  sequence  of characters that could constitute a preprocessing
  token, even if that would cause further lexical analysis to fail.

4 [Example: The program fragment 1Ex is parsed as a preprocessing number
  token  (one  that  is  not a valid floating or integer literal token),
  even though a parse as the pair of preprocessing tokens 1 and Ex might
  produce a valid expression (for example, if Ex were a macro defined as
  +1).  Similarly, the program fragment 1E1 is parsed as a preprocessing
  number  (one that is a valid floating literal token), whether or not E
  is a macro name.  ]

5 [Example: The program fragment x+++++y is parsed  as  x  ++  ++  +  y,
  which,  if  x  and  y  are of built-in types, violates a constraint on
  increment operators, even though the parse x ++ + ++ y might  yield  a
  correct expression.  ]

  2.4  Alternative tokens                                  [lex.digraph]

1 Alternative  token representations are provided for some operators and
  punctuators3).

2 In  all  respects  of the language, each alternative token behaves the
  same,  respectively,  as its primary token, except for its spelling4).
  The set of alternative tokens is defined in Table 2.

                       Table 2--alternative tokens

  +----------------------+-----------------------+-----------------------+
  |alternative   primary | alternative   primary | alternative   primary |
  +----------------------+-----------------------+-----------------------+
  |    <%           {    |     and         &&    |   and_eq        &=    |
  +----------------------+-----------------------+-----------------------+
  |    %>           }    |    bitor         |    |    or_eq        |=    |
  +----------------------+-----------------------+-----------------------+
  |    <:           [    |     or          ||    |   xor_eq        ^=    |
  +----------------------+-----------------------+-----------------------+
  |    :>           ]    |     xor          ^    |     not          !    |
  +----------------------+-----------------------+-----------------------+
  |    %:           #    |    compl         ~    |   not_eq        !=    |
  +----------------------+-----------------------+-----------------------+
  |   %:%:         ##    |   bitand         &    |                       |
  +----------------------+-----------------------+-----------------------+

  2.5  Tokens                                                [lex.token]
          token:
                  identifier
                  keyword
                  literal
                  operator
                  punctuator

1 There are five kinds of  tokens:  identifiers,  keywords,  literals,5)
  operators,  and  other  separators.   Blanks,  horizontal and vertical
  tabs, newlines, formfeeds, and comments (collectively, "white space"),
  as  described  below,  are  ignored  except  as they serve to separate
  tokens.  Some white space is required to separate  otherwise  adjacent
  identifiers, keywords, and literals.
  _________________________
  3) These include "digraphs" and additional reserved words.   The  term
  "digraph"  (token  consisting  of two characters) is not perfectly de
  scriptive, since one of the alternative preprocessing-tokens  is  %:%:
  and of course several primary tokens contain two characters.  Nonethe
  less, those alternative tokens that aren't lexical keywords are collo
  quially known as "digraphs".
  4)   Thus   [   and   <:   behave   differently   when    "stringized"
  (_cpp.stringize_), but can otherwise be freely interchanged.
  5) Literals include strings and character and numeric literals.

  2.6  Comments                                            [lex.comment]

1 The  characters  /* start a comment, which terminates with the charac
  ters */.  These comments do not nest.  The characters // start a  com
  ment, which terminates with the next new-line character. If there is a
  form-feed or a vertical-tab character in such a comment,  only  white-
  space  characters shall appear between it and the new-line that termi
  nates the comment; no diagnostic is required.  The comment  characters
  //,  /*,  and  */  have no special meaning within a // comment and are
  treated just like other characters.  Similarly, the comment characters
  // and /* have no special meaning within a /* comment.

  2.7  Identifiers                                            [lex.name]
          identifier:
                  nondigit
                  identifier nondigit
                  identifier digit
          nondigit: one of
                  _ a b c d e f g h i j k l m
                    n o p q r s t u v w x y z
                    A B C D E F G H I J K L M
                    N O P Q R S T U V W X Y Z
          digit: one of
                  0 1 2 3 4 5 6 7 8 9

1 An  identifier  is an arbitrarily long sequence of letters and digits.
  The first character is a letter; the underscore _ counts as a  letter.
  Upper-  and lower-case letters are different.  All characters are sig
  nificant.

  2.8  Keywords                                                [lex.key]

1 The identifiers shown in Table 3 are reserved for use as keywords, and
  shall not be used otherwise in phases 7 and 8:

                            Table 3--keywords

  +--------------------------------------------------------------------------+
  |asm          do             inline             short         typeid       |
  |auto         double         int                signed        typename     |
  |bool         dynamic_cast   long               sizeof        union        |
  |break        else           mutable            static        unsigned     |
  |case         enum           namespace          static_cast   using        |
  |catch        explicit       new                struct        virtual      |
  |char         extern         operator           switch        void         |
  |class        false          private            template      volatile     |
  |const        float          protected          this          wchar_t      |
  |const_cast   for            public             throw         while        |
  |continue     friend         register           true                       |
  |default      goto           reinterpret_cast   try                        |
  |delete       if             return             typedef                    |
  +--------------------------------------------------------------------------+

2 Furthermore, the alternative representations shown in Table 4 for cer
  tain operators and punctuators (_lex.digraph_) are reserved and  shall
  not be used otherwise:

                   Table 4--alternative representations

             +-----------------------------------------------+
             |bitand   and     bitor    or    xor      compl |
             |and_eq   or_eq   xor_eq   not   not_eq         |
             +-----------------------------------------------+

3 In addition, identifiers containing a double underscore (__) or begin
  ning with an underscore and an upper-case letter are reserved for  use
  by  C++ implementations and standard libraries and shall be avoided by
  users; no diagnostic is required.

4 The lexical representation of C++ programs includes a number  of  pre
  processing  tokens which are used in the syntax of the preprocessor or
  are converted into tokens for operators and punctuators:
          preprocessing-op-or-punc: one of
          {       }       [       ]       #       ##      =       (       )
          <:      :>      <%      %>      %:      %:%:    ;       :       ...
          new     delete  new[]   delete[]        ?       ::
          +       -       *       /       %       ^       &       |       ~
          !       =       <       >       +=      -=      *=      /=      %=
          ^=      &=      |=      <<      >>      >>=     <<=     ==      !=
          <=      >=      &&      ||      ++      --      ,       ->*     ->
          and     bitand  bitor   compl   new<%%> delete<%%>
          not     or      xor     and_eq  not_eq  or_eq   xor_eq

  After preprocessing, each preprocessing-op-or-punc is converted  to  a
  single token in translation phase 7 (_lex.phases_).

5 [Note:  Certain implementation-defined properties, such as the type of
  a sizeof (_expr.sizeof_) expression, the ranges of  fundamental  types
  (_basic.fundamental_),  and  the types of the most basic library func
  tions are defined in the standard  headers  <limits>,  <cstddef>,  and
  <new> (_lib.support_).   --end note]

  2.9  Literals                                            [lex.literal]

1 There are several kinds of literals.6)
          literal:
                  integer-literal
                  character-literal
                  floating-literal
                  string-literal
                  boolean-literal
  _________________________
  6)  The  term  "literal"  generally  designates, in this International
  Standard, those tokens that are called "constants" in ISO C.

  2.9.1  Integer literals                                     [lex.icon]
          integer-literal:
                  decimal-literal integer-suffixopt
                  octal-literal integer-suffixopt
                  hexadecimal-literal integer-suffixopt
          decimal-literal:
                  nonzero-digit
                  decimal-literal digit
          octal-literal:
                  0
                  octal-literal octal-digit
          hexadecimal-literal:
                  0x hexadecimal-digit
                  0X hexadecimal-digit
                  hexadecimal-literal hexadecimal-digit
          nonzero-digit: one of
                  1  2  3  4  5  6  7  8  9
          octal-digit: one of
                  0  1  2  3  4  5  6  7
          hexadecimal-digit: one of
                  0  1  2  3  4  5  6  7  8  9
                  a  b  c  d  e  f
                  A  B  C  D  E  F
          integer-suffix:
                  unsigned-suffix long-suffixopt
                  long-suffix unsigned-suffixopt
          unsigned-suffix: one of
                  u  U
          long-suffix: one of
                  l  L

1 An integer literal consisting of a sequence of digits is taken  to  be
  decimal  (base  ten) unless it begins with 0 (digit zero).  A sequence
  of octal digits7) starting with 0 is taken  to  be  an  octal  integer
  (base  eight).   A sequence of digits preceded by 0x or 0X is taken to
  be a hexadecimal  integer  (base  sixteen).   The  hexadecimal  digits
  include a or A through f or F with decimal values ten through fifteen.
  [Example: the number twelve can be written 12, 014, or 0XC.  ]

2 The type of an integer literal depends on its form, value, and suffix.
  If it is decimal and has no suffix, it has the first of these types in
  which its value can be represented: int, long int, unsigned long  int.
  If  it  is octal or hexadecimal and has no suffix, it has the first of
  these types in which its value can be represented: int, unsigned  int,
  long int, unsigned long int.  If it is suffixed by u or U, its type is
  the first of these types  in  which  its  value  can  be  represented:
  unsigned  int,  unsigned  long  int.  If it is suffixed by l or L, its
  type is the first of these types in which  its  value  can  be  repre
  sented: long int, unsigned long int.  If it is suffixed by ul, lu, uL,
  Lu, Ul, lU, UL, or LU, its type is unsigned long int.

  _________________________
  7) The digits 8 and 9 are not octal digits.

3 A program is ill-formed if it contains an integer literal that  cannot
  be represented by any of the allowed types.

  2.9.2  Character literals                                   [lex.ccon]
          character-literal:
                  'c-char-sequence'
                  L'c-char-sequence'
          c-char-sequence:
                  c-char
                  c-char-sequence c-char
          c-char:
                  any member of the source character set except
                          the single-quote ', backslash \, or new-line character
                  escape-sequence
          escape-sequence:
                  simple-escape-sequence
                  octal-escape-sequence
                  hexadecimal-escape-sequence
          simple-escape-sequence: one of
                  \'  \"  \?  \\
                  \a  \b  \f  \n  \r  \t  \v
          octal-escape-sequence:
                  \ octal-digit
                  octal-escape-sequence octal-digit
          hexadecimal-escape-sequence:
                  \x hexadecimal-digit
                  hexadecimal-escape-sequence hexadecimal-digit

1 A  character  literal  is  one  or  more characters enclosed in single
  quotes, as in 'x', optionally preceded by the letter L,  as  in  L'x'.
  Single  character  literals  that  do not begin with L have type char,
  with value equal to the  numerical  value  of  the  character  in  the
  machine's  character  set.   Multicharacter literals that do not begin
  with L have type int and implementation-defined value.

2 A character literal that begins with the letter L, such as L'ab', is a
  wide-character literal.  Wide-character literals have type  wchar_t.8)
  Wide-character literals have implementation-defined values, regardless
  of the number of characters in the literal.

3 Certain nongraphic characters, the single quote ', the double quote ",
  ?, and the backslash \, can be represented according to Table 5.

  _________________________
  8) They are intended for character sets where a character does not fit
  into a single byte.

                        Table 5--escape sequences

                   +----------------------------------+
                   |new-line          NL (LF)   \n    |
                   |horizontal tab    HT        \t    |
                   |vertical tab      VT        \v    |
                   |backspace         BS        \b    |
                   |carriage return   CR        \r    |
                   |form feed         FF        \f    |
                   |alert             BEL       \a    |
                   |backslash         \         \\    |
                   |question mark     ?         \?    |
                   |single quote      '         \'    |
                   |double quote      "         \"    |
                   |octal number      ooo       \ooo  |
                   |hex number        hhh       \xhhh |
                   +----------------------------------+
  If  the character following a backslash is not one of those specified,
  the behavior is undefined.  An  escape  sequence  specifies  a  single
  character.

4 The  escape  \ooo  consists  of  the backslash followed by one or more
  octal digits that are taken to specify the value of the desired  char
  acter.   The escape \xhhh consists of the backslash followed by x fol
  lowed by one or more hexadecimal digits that are taken to specify  the
  value  of  the  desired character.  There is no limit to the number of
  digits in either sequence.  A sequence of octal or hexadecimal  digits
  is  terminated  by the first character that is not an octal digit or a
  hexadecimal digit, respectively.  The value of a character literal  is
  implementation-defined  if  it  exceeds  that of the largest char (for
  ordinary literals) or wchar_t (for wide literals).

  2.9.3  Floating literals                                    [lex.fcon]
          floating-literal:
                  fractional-constant exponent-partopt floating-suffixopt
                  digit-sequence exponent-part floating-suffixopt
          fractional-constant:
                  digit-sequenceopt . digit-sequence
                  digit-sequence .
          exponent-part:
                  e signopt digit-sequence
                  E signopt digit-sequence
          sign: one of
                  +  -
          digit-sequence:
                  digit
                  digit-sequence digit
          floating-suffix: one of
                  f  l  F  L

1 A floating literal consists of an integer part,  a  decimal  point,  a
  fraction  part,  an e or E, an optionally signed integer exponent, and
  an optional type suffix.  The integer and fraction parts both  consist
  of  a  sequence of decimal (base ten) digits.  Either the integer part
  or the fraction part (not both) can be  missing;  either  the  decimal
  point  or the letter e (or E) and the exponent (not both) can be miss
  ing.  The type of a floating literal is double unless explicitly spec
  ified by a suffix.  The suffixes f and F specify float, the suffixes l
  and L specify long double.

  2.9.4  String literals                                    [lex.string]
          string-literal:
                  "s-char-sequenceopt"
                  L"s-char-sequenceopt"
          s-char-sequence:
                  s-char
                  s-char-sequence s-char
          s-char:
                  any member of the source character set except
                          the double-quote ", backslash \, or new-line character
                  escape-sequence

1 A  string  literal  is  a  sequence  of  characters  (as  defined   in
  _lex.ccon_) surrounded by double quotes, optionally beginning with the
  letter L, as in "..." or L"...".  A string literal that does not begin
  with  L  has  type  "array  of  n  char"  and  static storage duration
  (_basic.stc_), where n is the size of the string as defined below, and
  is initialized with the given characters.  Whether all string literals
  are distinct (that is, are stored in nonoverlapping objects) is imple
  mentation-defined.   The  effect of attempting to modify a string lit
  eral is undefined.

2 A string literal that begins with L, such as L"asdf", is a wide string
  literal.  A wide string literal has type "array of n wchar_t," where n
  is the size of the string as defined below.

3 Adjacent string literals are concatenated.  Adjacent wide string  lit
  erals  are  concatenated.   If a string literal token is adjacent to a
  wide string literal token, the behavior is undefined.   Characters  in
  concatenated strings are kept distinct.  [Example:
          "\xA" "B"
  contains the two characters '\xA' and 'B' after concatenation (and not
  the single hexadecimal character '\xAB').  ]

4 After any necessary concatenation '\0' is appended  so  that  programs
  that scan a string can find its end.  The size of a string is the num
  ber of its characters including this terminator.  Within a string, the
  double quote character " shall be preceded by a \.

5 Escape  sequences in string literals have the same meaning as in char
  acter literals (_lex.ccon_).

  2.9.5  Boolean literals                                     [lex.bool]
          boolean-literal:
                  false
                  true

1 The Boolean literals are the keywords false and true.   Such  literals
  have type bool and the given values.  They are not lvalues.