Python 字符串

671 查看

所有用过 Python (2&3)的人应该都看过下面两行错误信息：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

这就是 Python 界的”锟斤拷”！

今天和接下来几期的内容将主要关注 Python 中的字符串（str）、字节（bytes）及两者之间的相互转换（encode/decode）。也许不能让你突然间解决所有乱码问题，但希望可以帮助你迅速找到问题所在。

定义

Python 中对字符串的定义如下：

Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points.

Python 3.5 中字符串是由一系列 Unicode 码位（code point）所组成的不可变序列：

1	('S' 'T' 'R' 'I' 'N' 'G')

'STRING'

不可变是指无法对字符串本身进行更改操作：

s = 'Hello'

print(s[3])

s[3] = 'o'

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

in ()

1 s = 'Hello'

2 print(s[3])

----> 3 s[3] = 'o'

TypeError: 'str' object does not support item assignment

而序列（sequence）则是指字符串继承序列类型（list/tuple/range）的通用操作：

1	[i.upper() for i in "hello"]

1	['H', 'E', 'L', 'L', 'O']

至于 Unicode 暂时可以看作一张非常大的地图，这张地图里面记录了世界上所有的符号，而码位则是每个符号所对应的坐标（具体内容将在后面的几期介绍）。

s = '雨'

print(s)

print(len(s))

print(s.encode())

雨

b'xe9x9bxa8'

常用操作

len：字符串长度；
split & join
find & index
strip
upper & lower & swapcase & title & capitalize
endswith & startswith & is*
zfill

>UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte