用Pandas完成Excel中常见的任务（2）

516 查看

介绍

读者对于本系列第一篇文章的回应，让我感到很兴奋。感谢大家正面的反馈。我想把本系列继续下去，重点介绍其他的一些你经常使用Excel完成的任务，并且展示给你如何在pandas 中使用相同的功能。

在第一篇文章中，我着重介绍了Excel中常见的数学计算工作，以及在pandas如何完成这些工作。在本文中，我们将着重介绍一些常见的选择和筛选任务，并且介绍如何在pandas中完成同样的事情。

设置

如果您想要继续下去，您可以下载本excel文件。

导入pandas和numpy模块。

1 2	import pandas as pd import numpy as np

导入我们样本公司销售年销售额的Excel文件。

1	df = pd.read_excel("sample-salesv3.xlsx")

快速浏览一下数据类型，以确保所以事情都能如预期一样运行。

df.dtypes

account number int64

name object

sku object

quantity int64

unit price float64

ext price float64

date object

dtype: object

你会注意到，我们的date列，显示的是一个通用对象。我们准备把它转换为日期对象，来简化将来会用到的一些选择操作。

1 2	df['date'] = pd.to_datetime(df['date']) df.head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

df.dtypes

account number int64

name object

sku object

quantity int64

unit price float64

ext price float64

date datetime64[ns]

dtype: object

现在，data变成了一个datetime类型的对象，这对于将来的操作是很有用的。

筛选数据

我认为在Excel中最方便的功能是筛选。我想几乎每一次有人拿到一个任意大小的Excel文件，当他们想要筛选数据的时候，都会使用这个功能。

如图，对本数据集使用该功能：

同Excel中的筛选功能一样，你可以使用pandas来筛选和选择某个特定数据的子集。

比方说，如果我们仅仅想查看一个特定的账号，我们可以简单是在Excel中完成，或是使用pandas完成操作。

下面是Excel的筛选解决方案：

在pandas中执行相关操作比Excel中更加直观。注意，我将会使用head 函数来显示前面几个结果。这仅仅是为了让本文保持简短。

1	df[df["account number"]==307599].head()

你还可以以数值为基准来进行筛选。我就不再举任何Excel的例子了。我相信你能明白。

1	df[df["quantity"] > 22].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
15	239344	Stokes LLC	S1-06532	34	71.51	2431.34	2014-01-04 11:34:58

如果我们想要更多复杂的筛选，我们可以可以使用map 来以多重标准进行筛选。在这个例子中，从B1中查找以“sku”中起始的项目。

1	df[df["sku"].map(lambda x: x.startswith('B1'))].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
17	239344	Stokes LLC	B1-50809	14	16.23	227.22	2014-01-04 22:14:32

把两个或更多的语句连接起来很简单，用&就可以。

1	df[df["sku"].map(lambda x: x.startswith('B1')) & (df["quantity"] > 22)].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
26	737550	Fritsch, Russel and Anderson	B1-53636	42	42.06	1766.52	2014-01-08 00:02:11
31	714466	Trantow-Barrows	B1-33087	32	19.56	625.92	2014-01-09 10:16:32

pandas支持的另外一个很有用的函数是isin。它使得我们可以定义一个列表，里面包含我们所希望查找的值

在这个例子中，我们查找包含两个特定account number值的全部项目。

1	df[df["account number"].isin([714466,218895])].head()

	account number	name	sku	quantity	unit price	ext price	date
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
5	714466	Trantow-Barrows	S2-77896	17	87.63	1489.71	2014-01-02 10:07:15
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
8	714466	Trantow-Barrows	S1-50961	22	84.09	1849.98	2014-01-03 11:29:02

pandas支持的另外一个函数叫做query，它使得我们可以有效的再数据集中选择数据。使用它需要安装numexpr ，所以请确保你在进行下面步骤前已经进行了安装。

如果你想要通过名字来得到一个消费者列表，你可以使用query来完成，和前面展示的python语法类似。

1	df.query('name == ["Kulas Inc","Barton LLC"]').head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
33	218895	Kulas Inc	S1-06532	3	22.36	67.08	2014-01-09 23:58:27
36	218895	Kulas Inc	S2-34077	16	73.04	1168.64	2014-01-10 12:07:30

这里只是做个简单的示例，query函数能做到的还不止这些。我在此展示这些函数的用法，以便当你有需要的时候，会意识到可以用它。

处理日期

使用pandas，你可以对日期进行更加复杂的筛选。在我们处理日期前，我建议你把日期栏进行一个排序，以便返回的结果如你所愿。

1 2	df = df.sort('date') df.head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

在操作日期前，为您展示python的筛选语法。

1	df[df['date'] >='20140905'].head()

	account number	name	sku	quantity	unit price	ext price	date
1042	163416	Purdy-Kunde	B1-38851	41	98.69	4046.29	2014-09-05 01:52:32
1043	714466	Trantow-Barrows	S1-30248	1	37.16	37.16	2014-09-05 06:17:19
1044	729833	Koepp Ltd	S1-65481	48	16.04	769.92	2014-09-05 08:54:41
1045	729833	Koepp Ltd	S2-11481	6	26.50	159.00	2014-09-05 16:33:15
1046	737550	Fritsch, Russel and Anderson	B1-33364	4	76.44	305.76	2014-09-06 08:59:08

pandas的一个特别棒的特性是它能够理解日期，所以它允许我们进行部分筛选。如果我只想要查看最近几个月的日期数据，我可以这样做。

1	df[df['date'] >='2014-03'].head()

	account number	name	sku	quantity	unit price	ext price	date
242	163416	Purdy-Kunde	S1-30248	19	65.03	1235.57	2014-03-01 16:07:40
243	527099	Sanford and Sons	S2-82423	3	76.21	228.63	2014-03-01 17:18:01
244	527099	Sanford and Sons	B1-50809	8	70.78	566.24	2014-03-01 18:53:09
245	737550	Fritsch, Russel and Anderson	B1-50809	20	50.11	1002.20	2014-03-01 23:47:17
246	688981	Keeling LLC	B1-86481	-1	97.16	-97.16	2014-03-02 01:46:44

当然，你可以把筛选标准链接起来。