爬虫的第一步是向网页发起模拟请求,一般来说模拟请求的可以借助Python中的urllib模块以及requests模块,其中requests模块是对urllib模块的一个封装,从实用性的角度出发,一般来说我们更建议使用requests模块

request.get发起网页请求

requests库调用是requests.get方法传入url和参数,返回的对象是Response对象,打印出来是显示响应状态码。
Response对象比较重要的三个属性:

  • text:unicode 型的数据,一般是在网页的header中定义的编码形式,
  • content返回的是bytes,二进制型的数据。
  • json也可以返回json字符串。
    如果想要提取文本就用text,但是如果你想要提取图片、文件等二进制文件,就要用content,当然decode之后,中文字符也会正常显示。

修改头文件(Headers)

pcUserAgent = {
"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
}

mobileUserAgent = {
"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UC":"User-Agent: UCWEB7.0.2.37/28/999",
"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
}

查看cookie的方法

添加代理

实际爬取网页的过程中可能要爬取一些国外的网站,这时不免要使用代理访问网站,requests库访问添加代理的形式如下:

# 设置代理,形式为:{代理IP}:{端口号}
proxy = '192.168.0.1:1234'
proxies = {
"http": "http://%(proxy)s/" % {'proxy': proxy},
"https": "http://%(proxy)s/" % {'proxy': proxy}
}
r = requests.get('', headers=headers, proxies=proxies) #加一个proxies参数
print(r.status_code)
print(r.text)

urllib.request请求返回网页

python的urllib模块主要是负责打开URL和HTTP协议之类的。
urllib库的response对象是先创建http,request对象,装载到reques.urlopen里完成http请求。
返回的是http,response对象,实际上是html属性。使用.read().decode()解码后转化成了str字符串类型,decode解码后中文字符能够显示出来。
按照官方文档, urllib.request.urlopen可以打开HTTP、 HTTPS、 FTP协议的URL, 主要应用于HTTP协议。

urllib.request.urlopen(url,timeout)

timeout参数是超时时间设置
返回类的使用方法:

  • geturl()函数返回response的url信息, 常用于url重定向的情况。
  • info()函数返回response的基本信息。
  • getcode()函数返回response的状态代码, 最常见的代码是200服务器成功返回网页, 404请求的网页不存在, 503服务器暂时不可用。
import cookielib
import urllib2

url = "http://www.baidu.com"
response1 = urllib2.urlopen(url)
print "第一种方法"
#获取状态码,200表示成功
print response1.getcode()
#获取网页内容的长度
print len(response1.read())


##添加特殊情境的处理器

print "第二种方法"
request = urllib2.Request(url)
#模拟Mozilla浏览器进行爬虫
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print "第三种方法"
cookie = cookielib.CookieJar()
#加入urllib2处理cookie的能力
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cookie

添加代理

在访问网站时,urllib.request会将自己python的版本号作为User-Agent,我们可以对其进行修改。这时需要用到urllib.request.add_header()函数
为了避免某些网站对IP的访问拒绝,我们可以为urllib添加代理,这里给出一种用urllib.request添加访问代理的方法(重点内容在68行):

#!/usr/bin/env python3
#-*- coding: utf-8 -*-
__author__ = 'hstking hst_king@hotmail.com'

import urllib.request
import sys
import re

def testArgument():
'''测试输入参数,只需要一个参数 '''
if len(sys.argv) != 2:
print('需要且只需要一个参数就够了')
tipUse()
exit()
else:
TP = TestProxy(sys.argv[1])

def tipUse():
'''显示提示信息 '''
print('该程序只能输入一个参数,这个参数必须是一个可用的proxy')
print('usage: python testUrllib2WithProxy.py http://1.2.3.4:5')
print('usage: python testUrllib2WithProxy.py https://1.2.3.4:5')


class TestProxy(object):
'''这个类的作用是测试proxy是否有效 '''
def __init__(self,proxy):
self.proxy = proxy
self.checkProxyFormat(self.proxy)
self.url = 'https://www.baidu.com'
self.timeout = 5
self.flagWord = 'www.baidu.com' #在网页返回的数据中查找这个关键词
self.useProxy(self.proxy)

def checkProxyFormat(self,proxy):
try:
proxyMatch = re.compile('http[s]?://[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}:[\d]{1,5}$$
')
re.search(proxyMatch,proxy).group()
except AttributeError as e:
tipUse()
exit()
flag = 1
proxy = proxy.replace('//','')
try:
protocol = proxy.split(':')[0]
ip = proxy.split(':')[1]
port = proxy.split(':')[2]
except IndexError as e:
print('下标出界')
tipUse()
exit()
flag = flag and len(proxy.split(':')) == 3 and len(ip.split('.')) == 4
flag = ip.split('.')[0] in map(str,range(1,256)) and flag
flag = ip.split('.')[1] in map(str,range(256)) and flag
flag = ip.split('.')[2] in map(str,range(256)) and flag
flag = ip.split('.')[3] in map(str,range(1,255)) and flag
flag = protocol in ['http', 'https'] and flag
flag = port in map(str,range(1,65535)) and flag
'''这里是在检查proxy的格式 '''
if flag:
print('输入的http代理服务器符合标准')
else:
tipUse()
exit()

def useProxy(self,proxy):
'''利用代理访问百度,并查找关键词 '''
protocol = proxy.split('://')[0]
proxy_handler = urllib.request.ProxyHandler({protocol: proxy})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
try:
response = urllib.request.urlopen(self.url,timeout = self.timeout)
except Exception as e:
print('连接错误,退出程序')
exit()
result = response.read().decode('utf-8')
print('%s' %result)
if re.search(self.flagWord, result):
print('已取得特征词,该代理可用')
else:
print('该代理不可用')


if __name__ == '__main__':
testArgument()