使用python进行web抓取

671 查看

本文摘要自Web Scraping with Python – 2015

书籍下载地址：https://bitbucket.org/xurongzhong/python-chinese-library/downloads

源码地址：https://bitbucket.org/wswp/code

演示站点：http://example.webscraping.com/

演示站点代码：http://bitbucket.org/wswp/places

推荐的python基础教程： http://www.diveintopython.net

HTML和JavaScript基础：

http://www.w3schools.com

web抓取简介

为什么要进行web抓取？

网购的时候想比较下各个网站的价格，也就是实现惠惠购物助手的功能。有API自然方便，但是通常是没有API，此时就需要web抓取。

web抓取是否合法？

抓取的数据，个人使用不违法，商业用途或重新发布则需要考虑授权，另外需要注意礼节。根据国外已经判决的案例，一般来说位置和电话可以重新发布，但是原创数据不允许重新发布。

http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html

http://caselaw.findlaw.com/us-supreme-court/499/340.html

背景研究

robots.txt和Sitemap可以帮助了解站点的规模和结构，还可以使用谷歌搜索和WHOIS等工具。

比如：http://example.webscraping.com/robots.txt

# section 1

User-agent: BadCrawler

Disallow: /

# section 2

User-agent: *

Crawl-delay: 5

Disallow: /trap

# section 3

Sitemap: http://example.webscraping.com/sitemap.xml

更多关于web机器人的介绍参见 http://www.robotstxt.org。
Sitemap的协议： http://www.sitemaps.org/protocol.html，比如：

http://example.webscraping.com/view/Afghanistan-1

http://example.webscraping.com/view/Aland-Islands-2

http://example.webscraping.com/view/Albania-3

...

站点地图经常不完整。

站点大小评估：
通过google的site查询比如：site:automationtesting.sinaapp.com

站点技术评估：

# pip install builtwith

# ipython

In [1]: import builtwith

In [2]: builtwith.parse('http://automationtesting.sinaapp.com/')

Out[2]:

{u'issue-trackers': [u'Trac'],

u'javascript-frameworks': [u'jQuery'],

u'programming-languages': [u'Python'],

u'web-servers': [u'Nginx']}

分析网站所有者：

# pip install python-whois

# ipython

In [1]: import whois

In [2]: print whois.whois('http://automationtesting.sinaapp.com')

{

"updated_date": "2016-01-07 00:00:00",

"status": [

"serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited",

"serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited",

"serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited"

"name": null,

"dnssec": null,

"city": null,

"expiration_date": "2021-06-29 00:00:00",

"zipcode": null,

"domain_name": "SINAAPP.COM",

"country": null,

"whois_server": "whois.paycenter.com.cn",

"state": null,

"registrar": "XIN NET TECHNOLOGY CORPORATION",

"referral_url": "http://www.xinnet.com",

"address": null,

"name_servers": [

"NS1.SINAAPP.COM",

"NS2.SINAAPP.COM",

"NS3.SINAAPP.COM",

"NS4.SINAAPP.COM"

"org": null,

"creation_date": "2009-06-29 00:00:00",

"emails": null

}

抓取第一个站点

简单的爬虫(crawling)代码如下：