python requests爬虫 – 槐梦(个人学习记录)

本篇介绍比较简单的一种爬虫，即使用python中的requests与regex实现一个简易爬虫。我使用的IDE是pycharm，在settings->python interpreter中添加相关库即可。

requests库的常用方法如下，使用requests主要是构造请求，得到网页的响应，采集的数据通常在返回的响应体中。

requests爬虫是最基础的爬虫，其思路是先李哟经request库在预备好的URL list中遍历目标网页，获得返回的html源码，然后使用regex库，通过正则表达式匹配需要的数据项，并在其中发现新的url链接添加到目标url列表，如此反复进行爬取。

以爬取百度贴吧为例：

//导入需要的库，分别对应请求、匹配、存储
import requests
import re
import pymysql

//数据库设置，需要注意charset设置，避免乱码
config = {
    'host':'127.0.0.1',
    'port':3306,
    'user':'root',
    'password':'root',
    'db':'ccnutieba',
    'charset':'utf8',
    'cursorclass':pymysql.cursors.DictCursor,
}

//建立数据库链接
connection = pymysql.connect(**config)

//通过for循环构造pagenumber与url列表实现翻页
for i in range(0,150,50):
    url = 'https://tieba.baidu.com/f?kw=%E5%8D%8E%E4%B8%AD%E5%B8%88%E8%8C%83%E5%A4%A7%E5%AD%A6&ie=utf-8&pn=' + str(i)
    response = requests.get(url)  //request库发送请求，获得网页响应
    regex_title = 'class="j_th_tit ">(.+?)</a>'  //正则表达式，匹配帖子标题(.+?)为捕捉内容
    pattern = re.compile(regex_title)
    result = pattern.findall(response.text)  //response.text就是html的文本源码
    for r in result:  //将匹配结果存入数据库中
        cursor = connection.cursor()
        sql = "INSERT INTO tiebatitle (title) VALUES ('" + str(r) + "')"
        cursor.execute(sql)

connection.commit()

最后的结果如下：

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30