老堂主爬虫交流--百度贴吧模拟回帖

1377 查看

老堂主虽然洋文好得很,但语文还是要学习一个,所以文章中种种错误还请各位海涵。

提前准备

  1. python 2.7.11
  2. requests

自己体验

老九门电视剧贴吧发个帖子,开启chrome开发模式感受一番。

  1. 打开http://tieba.baidu.com/p/4646166740
  2. 输入:苟利国家生死以 岂因祸福避趋之
  3. 点击发表按钮

截图1

从Network中感受一番

1 url为 http://tieba.baidu.com/f/commit/post/add
且为post方式。


截图2

2 下图为headers,还有cookies相关信息


截图3

3 附带的参数信息


截图4

4 response应答信息


截图5

想法设法

我们先来瞧瞧那个参数信息
ie: 应该就是编码格式,utf-8

kw: 不出意外就是贴吧名

fid: 这个20578208是什么大新闻?它从哪里来?回想一下我们先前体验的过程,首先点击http://tieba.baidu.com/p/4646166740 帖子,接着再点击发表。思来想去,这个20578208八成就是出现在帖子的源代码里,事实果真如此吗?我们去源代码里crtl-f一下。


截图6

哈哈果然不出所料,正则大法抠出来咯。

re.findall("""fid\s*:\s*'\s*(.*?)\s*'""", content)[0]

后面验证,此fid应为贴吧的标示id。

tid:4646166740?? http://tieba.baidu.com/p/4646166740 昭然若揭

vcode_md5: 就空值吧。

floor_num: 洋文好得很,发帖第几楼咯;经验证,此值没那么讲究,一般都行,比如500。

rich_text: 就1咯。

tbs: 还记得先前的fid么?

re.findall("""tbs:\s*'(.*?)',""", content)[0]

content: 就是先前回的那两句诗。

files: 我们没回附件,就[]即可。

mouse_pwd: 这是个大新闻啊,鼠标轨迹!到底是怎么个算法,四个字,无可奉告!但总有应对之策!

mouse_list = [    "118,112,113,110,115,123,114,117,75,115,110,114,110,115,110,114,110,115,110,114,110,115,110,114,110,115,110,114,75,117,113,118,114,75,115,112,122,114,110,122,114,114,",    "90,84,82,78,83,84,87,83,107,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,107,83,81,84,81,83,107,83,80,90,82,78,90,82,82,",    "9,4,1,28,1,9,8,0,57,1,28,0,28,1,28,0,28,1,28,0,28,1,28,0,28,1,28,0,57,4,7,6,5,57,1,2,8,0,28,8,0,0,",    "80,91,86,79,82,84,91,86,106,82,79,83,106,87,86,84,80,106,82,81,91,83,79,91,83,83,",    "96,101,96,125,96,104,96,101,88,96,125,97,125,96,125,97,125,96,125,97,125,96,125,97,125,96,125,97,88,103,98,97,105,88,96,99,105,97,125,105,97,97,",    "19,29,24,7,25,18,19,18,34,26,7,27,7,26,7,27,7,26,7,27,7,26,7,27,7,26,7,27,34,31,26,25,19,34,26,25,19,27,7,19,27,27,",    "59,63,59,38,50,63,51,50,3,59,38,58,38,59,38,58,38,59,38,58,38,59,38,58,38,59,38,58,3,60,58,56,62,3,59,56,50,58,38,50,58,58,",    "81,87,91,78,81,91,82,80,107,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,107,85,81,91,86,107,83,80,90,82,78,90,82,82,",    "30,30,28,0,30,20,28,27,37,29,0,28,0,29,0,28,0,29,0,28,0,29,0,28,0,29,0,28,37,31,26,21,25,37,29,30,20,28,0,20,28,28,",    "103,106,107,127,98,106,99,102,90,98,127,99,127,98,127,99,127,98,127,99,127,98,127,99,127,98,127,99,90,98,97,99,103,107,90,98,97,107,99,127,107,99,99,",    "37,32,32,58,36,39,34,46,31,39,58,38,58,39,58,38,58,39,58,38,58,39,58,38,58,39,58,38,31,34,46,34,39,37,31,47,32,38,58,35,34,38,",]

mouse_pwd_t: 按照基本法,这是时间戳。

str(time.time()).replace(".", "")

mouse_pwd_isclick: 0 决定就是0了

type: 也就是"reply"了

接着是登陆问题,老堂主还是偷个懒,用cookies的方式了。


截图7

抠出来然后

def get_cookies(): 
    #sb = "你自己的cookies"
    sb = "BAIDUID=A4C1F1C2DC2D78995C5E96C0B5823437:FG=1; PSTM=1469801697; BIDUPSID=C44D78ADAB15C76D89820BA40622B137; BDUSS=0FKempYflhxcjdZd2Z0Wnh5WmVlVW43U1VtSlVyNHZ6UkV-Q3NWeFkzWHM4Y0pYQUFBQUFBJCQAAAAAAAAAAAEAAADgvDV7ZGFyYnJhMDE4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOxkm1fsZJtXc; H_PS_PSSID=20739_1447_18280_17948_20416_17001_15706_11927_20698_20745_20705"
    tmap = {}
    for i in sb.strip().split(";"):       
        key, value = i.split("=", 1)       
        tmap[key.strip()] = value.strip()    
    return tmap

思路整理

第一步,对http://tieba.baidu.com/p/4646166740 页面进行get请求,得到fid, tbs等参数。
第二步,对 http://tieba.baidu.com/f/commit/post/add 带上fid,tid等一系列参数 以及自己的cookies进行post请求。
倘若最后得到的response的中的"error_code"为0,就是发帖成功了

话说这两句诗和广告有啥关系,I AM ANGRY!


截图8

代码交流

#-*-coding:utf-8-*-
import time
import requests
import re
import random

def get_cookies():
    sb = """BAIDUID=A4C1F1C2DC2D78995C5E96C0B5823437:FG=1; PSTM=1469801697; BIDUPSID=C44D78ADAB15C76D89820BA40622B137; BDUSS=0FKempYflhxcjdZd2Z0Wnh5WmVlVW43U1VtSlVyNHZ6UkV-Q3NWeFkzWHM4Y0pYQUFBQUFBJCQAAAAAAAAAAAEAAADgvDV7ZGFyYnJhMDE4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOxkm1fsZJtXc; H_PS_PSSID=20739_1447_18280_17948_20416_17001_15706_11927_20698_20745_20705"""
    tmap = {}
    for i in sb.strip().split(";"):
        key, value = i.split("=", 1)
        tmap[key.strip()] = value.strip()
    return tmap

def get_headers1():
    return {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding':'gzip,deflate,sdch',
        'Accept-Language':'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2',
        'Cache-Control':'max-age=0',
        'Connection':'keep-alive',
        'Host':'tieba.baidu.com',
        'Referer':'http://tieba.baidu.com/p/4695010754',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36'}

def get_headers2():
    return {'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2',
'Connection':'keep-alive',
'Content-Length':'487',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Host':'tieba.baidu.com',
'Origin':'http://tieba.baidu.com',
'Referer':'http://tieba.baidu.com/p/4664815593',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'

        }

def get_ie():
    return "utf-8"

def get_kw(content):
    return re.findall("""forumName:'(.*?)', """, content)[0]

def get_fid(content):
    return re.findall("""fid:'(.*?)',""", content)[0]

def get_tid(tid):
    return tid

def get_vcode_md5():
    return ""

def get_floor_num():
    return "500"

def get_rich_text():
    return "1"

def get_tbs(content):
    return re.findall("""tbs:\s*'(.*?)',""", content)[0]

def get_content():
    return "苟利国家生死以,岂因祸福避趋之"

def get_files():
    return "[]"

def get_sign_id(content):

    return re.findall('"sign_id":(.*?),', content)[0]

def get_mouse_pwd():
    return "113,114,115,111,113,116,118,122,74,114,111,115,111,114,111,115,111,114,111,115,111,114,111,115,111,114,111,115,74,114,115,117,112,114,74,122,117,115,111,118,119,115,"+str(time.time()).replace(".", "")

def get_mouse_pwd_t():
    return str(time.time()).replace(".", "")

def get_mouse_pwd_isclick():
    return "0"

def get_type():
    return "reply"


def post_one(tid):
    tid = random.choice(tid)
    s1 = requests.session()
    headers=get_headers1()
    g1 = s1.get("http://tieba.baidu.com/p/%s"%(tid), headers= headers,cookies=get_cookies())
    data = {
        "ie": get_ie(),
        "kw": get_kw(g1.content),
        "fid": get_fid(g1.content),
        "tid": get_tid(tid),
        "vcode_md5": get_vcode_md5(),
        "floor_num": get_floor_num(),
        "rich_text": get_rich_text(),
        "tbs": get_tbs(g1.content),
        "content": get_content(),
        "files": get_files(),
        "mouse_pwd": get_mouse_pwd(),
        "mouse_pwd_t": get_mouse_pwd_t(),
        "mouse_pwd_isclick": get_mouse_pwd_isclick(),
        "__type__": get_type()

        }
    headers=get_headers2()
    headers["Referer"] = 'http://tieba.baidu.com/p/%s'%(tid)
    p1 = s1.post("http://tieba.baidu.com/f/commit/post/add", headers=headers, cookies=get_cookies(), data=data)
    print p1.content
    return p1.content


post_one(["4646166740"])

亲测可用,同样的内容这次倒没删。


截图9

截图10

有趣,美好,这就是老堂主的爬虫小交流。

话说这是啥呢?


验证码1

验证码2

有机会再交流交流啊