本文摘要自Web Scraping with Python – 2015
书籍下载地址:https://bitbucket.org/xurongzhong/python-chinese-library/downloads
源码地址:https://bitbucket.org/wswp/code
演示站点:http://example.webscraping.com/
演示站点代码:http://bitbucket.org/wswp/places
推荐的python基础教程: http://www.diveintopython.net
HTML和JavaScript基础:
web抓取简介
- 为什么要进行web抓取?
网购的时候想比较下各个网站的价格,也就是实现惠惠购物助手的功能。有API自然方便,但是通常是没有API,此时就需要web抓取。
- web抓取是否合法?
抓取的数据,个人使用不违法,商业用途或重新发布则需要考虑授权,另外需要注意礼节。根据国外已经判决的案例,一般来说位置和电话可以重新发布,但是原创数据不允许重新发布。
更多参考:
http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf
http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html
http://caselaw.findlaw.com/us-supreme-court/499/340.html
- 背景研究
robots.txt和Sitemap可以帮助了解站点的规模和结构,还可以使用谷歌搜索和WHOIS等工具。
比如:http://example.webscraping.com/robots.txt
1 2 3 4 5 6 7 8 9 10 11 |
# section 1 User-agent: BadCrawler Disallow: / # section 2 User-agent: * Crawl-delay: 5 Disallow: /trap # section 3 Sitemap: http://example.webscraping.com/sitemap.xml |
更多关于web机器人的介绍参见 http://www.robotstxt.org。
Sitemap的协议: http://www.sitemaps.org/protocol.html,比如:
1 2 3 4 |
http://example.webscraping.com/view/Afghanistan-1 http://example.webscraping.com/view/Aland-Islands-2 http://example.webscraping.com/view/Albania-3 ... |
站点地图经常不完整。
站点大小评估:
通过google的site查询 比如:site:automationtesting.sinaapp.com
站点技术评估:
1 2 3 4 5 6 7 8 9 10 |
# pip install builtwith # ipython In [1]: import builtwith In [2]: builtwith.parse('http://automationtesting.sinaapp.com/') Out[2]: {u'issue-trackers': [u'Trac'], u'javascript-frameworks': [u'jQuery'], u'programming-languages': [u'Python'], u'web-servers': [u'Nginx']} |
分析网站所有者:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# pip install python-whois # ipython In [1]: import whois In [2]: print whois.whois('http://automationtesting.sinaapp.com') { "updated_date": "2016-01-07 00:00:00", "status": [ "serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited", "serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited", "serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited" ], "name": null, "dnssec": null, "city": null, "expiration_date": "2021-06-29 00:00:00", "zipcode": null, "domain_name": "SINAAPP.COM", "country": null, "whois_server": "whois.paycenter.com.cn", "state": null, "registrar": "XIN NET TECHNOLOGY CORPORATION", "referral_url": "http://www.xinnet.com", "address": null, "name_servers": [ "NS1.SINAAPP.COM", "NS2.SINAAPP.COM", "NS3.SINAAPP.COM", "NS4.SINAAPP.COM" ], "org": null, "creation_date": "2009-06-29 00:00:00", "emails": null } |
- 抓取第一个站点
简单的爬虫(crawling)代码如下: