python爬虫入门

一、urllib爬虫基础应用

1、导入模块

python

1 2	>>> import urllib >>> import urllib.request

2、爬取网页数据

python

1 2	>>> url = "http://www.jd.com" >>> data = urllib.request.urlopen(url).read().decode("utf-8","ignore")`

3 、获取标题

python

>>> import re
>>> reg="<title>(.*?)</title>"
>>> re.compile(reg,re.S).findall(data)
['京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物！']
>>>

4、下载网页

python

url = "http://www.jd.com"
urllib.request.urlretrieve(url,filename=r"E:\Users\asus\Desktop\jd.html")
('E:\\Users\\asus\\Desktop\\jd.html', <http.client.HTTPMessage object at 0x000001C14FFCD390>)
>>>

5、浏览器伪装

python

>>> url = "https://www.qiushibaike.com/"
>>> opener = urllib.request.build_opener()
>>> UA = ('user-agent','Mozilla/5.0')
>>> opener.addheaders=[UA]
>>> urllib.request.install_opener(opener)
>>> data = urllib.request.urlopen(url).read().decode("utf-8","ignore")

6、用户代理池

python

import random
uapools = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0"
]
def UA():
    opener = urllib.request.build_opener()
    newua = random.choice(uapools)
    ua = ("user-agent",newua)
    opener.addheaders=[ua]
    urllib.request.install_opener(opener)
    print("当前正在使用UA："+str(nowua))
'''
for i in range(0,10):
    UA()
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    print(len(data))
'''
#每隔三次换一下UA
for i in range(0,10):
    if (i%3)==0:
        UA()
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    print(len(data))

二、requests库爬取数据

1、请求方式：get/post/put…
2、请求里面的参数：params、headers、proxies、cookies、data
3、相应的数据：
text：响应数据
content：二进制的数据
encoding:网页的编码
cookies：响应的cookies
url：当前请求的url
status_code:响应的状态码
4、开始请求一个网页：

python

>>> import requests
>>> import re
>>> url = "http://www.baidu.com/"
>>> r = requests.get(url)
>>> r.encoding = r.apparent_encoding
>>> title = re.findall('<title>(.*?)</title>',r.text)
>>> title
['百度一下，你就知道']

5、伪装浏览器

python

1 2	>>> hd = ["user-agent":"Mozilla/5.0"] >>> r = requests.get(url,headers=hd)

三、scrapy爬虫框架的使用

1、常用指令
(1)开始初始化项目
控制台scrapy startproject 项目名
(2)查看爬虫模板
控制台scrapy genspider -l
(3)创建爬虫
控制台scrapy genspider -t 模板爬虫文件名域名
(4)运行爬虫
控制台scrapy crawl 爬虫文件名
(5)查看当前有哪些爬虫文件
控制台scrapy list
2、开始编写一个爬虫的项目：
创建爬虫项目
编写item
创建爬虫文件
编写爬虫文件
编写pipines
配置settings
小练习：爬取阿里文学的文章标题
①创建爬虫项目

cmd

1	>>>scrapy startproject aliwx_title

②编写item

python

import scrapy

class AliFirstItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()

③创建爬虫文件

cmd

1	>>> scrapy genspider -t basic fst aliwx.com.cn

④编写爬虫文件

python

# -*- coding: utf-8 -*-
import scrapy
from ali_first.items import AliFirstItem

class FstSpider(scrapy.Spider):
    name = 'fst'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://www.aliwx.com.cn/']

    def parse(self, response):
        item = AliFirstItem()
        item["title"]=response.xpath("//p[@class='title']/text()").extract()
        #print(item["title"])
        yield item

⑤编写pipelines

python

class AliFirstPipeline(object):
    def process_item(self, item, spider):
        for i in range(len(item["title"])):
            print("-------")
            print(item["title"][i])
        return item

⑥配置settings

python

1
2
3

ITEM_PIPELINES = {
    'ali_first.pipelines.AliFirstPipeline': 300,
}

最后，运行爬虫：

cmd

1	scrapy crawl fst

运行结果如下：