使用python爬虫抓站的一些技巧总结：进阶篇

494 查看

以前写过一篇使用python爬虫抓站的一些技巧总结，总结了诸多爬虫使用的方法；那篇东东现在看来还是挺有用的，但是当时很菜（现在也菜，但是比那时进步了不少），很多东西都不是很优，属于”只是能用”这么个层次。这篇进阶篇打算把“能用”提升到“用得省事省心”这个层次。

一、gzip/deflate支持

现在的网页普遍支持gzip压缩，这往往可以解决大量传输时间，以VeryCD的主页为例，未压缩版本247K，压缩了以后45K，为原来的1/5。这就意味着抓取速度会快5倍。

然而python的urllib/urllib2默认都不支持压缩，要返回压缩格式，必须在request的header里面写明’accept-encoding’，然后读取response后更要检查header查看是否有’content-encoding’一项来判断是否需要解码，很繁琐琐碎。如何让urllib2自动支持gzip, defalte呢？

其实可以继承BaseHanlder类，然后build_opener的方式来处理：

import urllib2

from gzip import GzipFile

from StringIO import StringIO

class ContentEncodingProcessor(urllib2.BaseHandler):

"""A handler to add gzip capabilities to urllib2 requests """

# add headers to requests

def http_request(self, req):

req.add_header("Accept-Encoding", "gzip, deflate")

return req

# decode

def http_response(self, req, resp):

old_resp = resp

# gzip

if resp.headers.get("content-encoding") == "gzip":

gz = GzipFile(

fileobj=StringIO(resp.read()),

mode="r"

)

resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code)

resp.msg = old_resp.msg

# deflate

if resp.headers.get("content-encoding") == "deflate":

gz = StringIO( deflate(resp.read()) )

resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) # 'class to add info() and

resp.msg = old_resp.msg

return resp

# deflate support

import zlib

def deflate(data): # zlib only provides the zlib compress format, not the deflate format;

try: # so on top of all there's this workaround:

return zlib.decompress(data, -zlib.MAX_WBITS)

except zlib.error:

return zlib.decompress(data)

然后就简单了，

encoding_support = ContentEncodingProcessor

opener = urllib2.build_opener( encoding_support, urllib2.HTTPHandler )

#直接用opener打开网页，如果服务器支持gzip/defalte则自动解压缩

content = opener.open(url).read()

二、更方便地多线程

总结一文的确提及了一个简单的多线程模板，但是那个东东真正应用到程序里面去只会让程序变得支离破碎，不堪入目。在怎么更方便地进行多线程方面我也动了一番脑筋。先想想怎么进行多线程调用最方便呢？

1、用twisted进行异步I/O抓取

事实上更高效的抓取并非一定要用多线程，也可以使用异步I/O法：直接用twisted的getPage方法，然后分别加上异步I/O结束时的callback和errback方法即可。例如可以这么干：

from twisted.web.client import getPage

from twisted.internet import reactor

links = [ 'http://www.verycd.com/topics/%d/'%i for i in range(5420,5430) ]

def parse_page(data,url):

print len(data),url

def fetch_error(error,url):

print error.getErrorMessage(),url

么个层次。这篇进阶篇打算把“能用”提升到“用得省事省心”这个层次。

一、gzip/deflate支持

其实可以继承BaseHanlder类，然后build_opener的方式来处理：

import urllib2

from gzip import GzipFile

from StringIO import StringIO

class ContentEncodingProcessor(urllib2.BaseHandler):

"""A handler to add gzip capabilities to urllib2 requests """

# add headers to requests

def http_request(self, req):

req.add_header("Accept-Encoding", "gzip, deflate")

return req

# decode

def http_response(self, req, resp):

old_resp = resp

# gzip

if resp.headers.get("content-encoding") == "gzip":

gz = GzipFile(

fileobj=StringIO(resp.read()),

mode="r"

)

resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code)

resp.msg = old_resp.msg

# deflate

if resp.headers.get("content-encoding") == "deflate":

gz = StringIO( deflate(resp.read()) )

resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) # 'class to add info() and

resp.msg = old_resp.msg

return resp

# deflate support

import zlib

def deflate(data): # zlib only provides the zlib compress format, not the deflate format;

try: # so on top of all there's this workaround:

return zlib.decompress(data, -zlib.MAX_WBITS)

except zlib.error:

return zlib.decompress(data)

然后就简单了，

encoding_support = ContentEncodingProcessor

opener = urllib2.build_opener( encoding_support, urllib2.HTTPHandler )

#直接用opener打开网页，如果服务器支持gzip/defalte则自动解压缩

content = opener.open(url).read()

二、更方便地多线程

1、用twisted进行异步I/O抓取

from twisted.web.client import getPage

from twisted.internet import reactor

links = [ 'http://www.verycd.com/topics/%d/'%i for i in range(5420,5430) ]

def parse_page(data,url):

print len(data),url

def fetch_error(error,url):

print error.getErrorMessage(),url

2-12"> # decode

def http_response(self, req, resp