Friday, May 27, 2011

Python.org Lexical analysis

2. Lexical analysis

docs.python.org | May 9th 2011

2.4.1. String literals

String literals are described by the following lexical definitions:

stringliteral  ::= [stringprefix](shortstring | longstring) stringprefix  ::= "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR" | "b" | "B" | "br" | "Br" | "bR" | "BR" shortstring  ::= "'" shortstringitem* "'" | '"' shortstringitem* '"' longstring  ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""' shortstringitem ::= shortstringchar | escapeseq longstringitem  ::= longstringchar | escapeseq shortstringchar ::=  longstringchar  ::=  escapeseq  ::= "\"

One syntactic restriction not indicated by these productions is that whitespace is not allowed between the stringprefix and the rest of the string literal. The source character set is defined by the encoding declaration; it is ASCII if no encoding declaration is given in the source file; see section Encoding declarations.

In plain English: String literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings). The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences. A prefix of 'u' or 'U' makes the string a Unicode string. Unicode strings use the Unicode character set as defined by the Unicode Consortium and ISO 10646. Some additional escape sequences, described below, are available in Unicode strings. A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string. (A “quote” is the character used to open the string, i.e. either ' or ".)

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:

Notes:

  1. Individual code units which form parts of a surrogate pair can be encoded using this escape sequence.
  2. Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is compiled to use 16-bit code units (the default). Individual code units which form parts of a surrogate pair can be encoded using this escape sequence.
  3. As in Standard C, up to three octal digits are accepted.
  4. Unlike in Standard C, exactly two hex digits are required.
  5. In a string literal, hexadecimal and octal escapes denote the byte with the given value; it is not necessary that the byte encodes a character in the source character set. In a Unicode literal, these escapes denote a Unicode character with the given value.

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.) It is also important to note that the escape sequences marked as “(Unicode only)” in the table above fall into the category of unrecognized escapes for non-Unicode string literals.

When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase 'n'. String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.

When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed while all other backslashes are left in the string. For example, the string literal ur"\u0062\n" consists of three Unicode characters: ‘LATIN SMALL LETTER B’, ‘REVERSE SOLIDUS’, and ‘LATIN SMALL LETTER N’. Backslashes can be escaped with a preceding backslash; however, both remain in the string. As a result, \uXXXX escape sequences are only recognized when there are an odd number of backslashes.

2.4.2. String literal concatenation

Multiple adjacent string literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings, for example:

re.compile("[A-Za-z_]" # letter or underscore "[A-Za-z0-9_]*" # letter, digit or underscore )

Note that this feature is defined at the syntactical level, but implemented at compile time. The ‘+’ operator must be used to concatenate string expressions at run time. Also note that literal concatenation can use different quoting styles for each component (even mixing raw strings and triple quoted strings).

2.4.3. Numeric literals

There are four types of numeric literals: plain integers, long integers, floating point numbers, and imaginary numbers. There are no complex literals (complex numbers can be formed by adding a real number and an imaginary number).

Note that numeric literals do not include a sign; a phrase like -1 is actually an expression composed of the unary operator ‘-‘ and the literal 1.

2.4.4. Integer and long integer literals

Integer and long integer literals are described by the following lexical definitions:

longinteger  ::= integer ("l" | "L") integer  ::= decimalinteger | octinteger | hexinteger | bininteger decimalinteger ::= nonzerodigit digit* | "0" octinteger  ::= "0" ("o" | "O") octdigit+ | "0" octdigit+ hexinteger  ::= "0" ("x" | "X") hexdigit+ bininteger  ::= "0" ("b" | "B") bindigit+ nonzerodigit  ::= "1"..."9" octdigit  ::= "0"..."7" bindigit  ::= "0" | "1" hexdigit  ::= digit | "a"..."f" | "A"..."F"

Although both lower case 'l' and upper case 'L' are allowed as suffix for long integers, it is strongly recommended to always use 'L', since the letter 'l' looks too much like the digit '1'.

Plain integer literals that are above the largest representable plain integer (e.g., 2147483647 when using 32-bit arithmetic) are accepted as if they were long integers instead. [1] There is no limit for long integer literals apart from what can be stored in available memory.

Some examples of plain integer literals (first row) and long integer literals (second and third rows):

7 2147483647 0177 3L 79228162514264337593543950336L 0377L 0x100000000L 79228162514264337593543950336 0xdeadbeef

2.4.5. Floating point literals

Floating point literals are described by the following lexical definitions:

floatnumber  ::= pointfloat | exponentfloat pointfloat  ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart  ::= digit+ fraction  ::= "." digit+ exponent  ::= ("e" | "E") ["+" | "-"] digit+

Note that the integer and exponent parts of floating point numbers can look like octal integers, but are interpreted using radix 10. For example, 077e010 is legal, and denotes the same number as 77e10. The allowed range of floating point literals is implementation-dependent. Some examples of floating point literals:

3.14 10. .001 1e100 3.14e-10 0e0

Note that numeric literals do not include a sign; a phrase like -1 is actually an expression composed of the unary operator - and the literal 1.

2.4.6. Imaginary literals

Imaginary literals are described by the following lexical definitions:

imagnumber ::= (floatnumber | intpart) ("j" | "J")

An imaginary literal yields a complex number with a real part of 0.0. Complex numbers are represented as a pair of floating point numbers and have the same restrictions on their range. To create a complex number with a nonzero real part, add a floating point number to it, e.g., (3+4j). Some examples of imaginary literals:

Original Page: http://docs.python.org/reference/lexical_analysis.html#identifiers

Shared from Read It Later

Elyssa Durant, Ed.M.

United States of America

Posted via email from Whistleblower

No comments:

Post a Comment