python爬取链家租房之获取房屋的链接和页面的详细信息
时间:2022-05-08
本文章向大家介绍python爬取链家租房之获取房屋的链接和页面的详细信息,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。
因为期末考试的缘故,本打算一个星期结束的爬虫,拖了很久,不过,也有好处:之前写的时候总是被反爬,这几天复习之余写了些反爬取的py code 下面发出来和大家探讨 做了些反爬取的手段
随机获取一个headers
headers.py
__author__ = 'Lee'
import requests
import random #随机数模块
def requests_headers():
head_connection = ['Keep-Alive','close']
head_accept = ['text/html,application/xhtml+xml,*/*']
head_accept_language = ['zh-CN,fr-FR;q=0.5','en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3']
head_user_agent = ['Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)',
'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1',
'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3',
'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12',
'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11']
#header 为随机产生一套由上边信息的header文件
header = {
'Connection':head_connection[random.randrange(0,len(head_connection))],
'Accept':head_accept[0],
'Accept-Language':head_accept_language[random.randrange(0,len(head_accept_language))],
'User-Agent':head_user_agent[random.randrange(0,len(head_user_agent))],
}
print('headers.py connection Success!')
return header #返回值为 header这个字典
# for i in range(100): #随机产生100套信息
# print(requests_headers()) #打印
# #print(random.randrange(1,10))
从IP池随机选择个代理IP
ip_proxy.py
__author__ = 'Lee'
import random
ip_pool = [
'117.143.109.136:80'
]
def ip_proxy():
ip = ip_pool[random.randrange(0,len(ip_pool))]
proxy_ip = 'http://'+ip
proxies = {'http':proxy_ip}
print(proxies)
return proxies
items_combination.py
__author__ = 'Lee'
from bs4 import BeautifulSoup
import requests
import pymongo
import time
from headers import requests_headers
from ip_proxy import ip_proxy
client = pymongo.MongoClient('localhost',27017) #链接数据库
ceshi = client['ceshi']
url_list = ceshi['url_list']
item_list = ceshi['item_info']
url_list1 = []
channel = 'https://bj.lianjia.com/zufang/dongcheng/'
#spider1 爬取房屋信息链接并用mongodb存储
def get_pages_url(channel,pag):
url = str(channel+'pg'+ pag)
wb_data = requests.get(url,headers=requests_headers(),proxies=ip_proxy())
soup = BeautifulSoup(wb_data.text,'lxml')
time.sleep(1)
no_data = '呣..没有找到相关内容,请您换个条件试试吧~'
# 面包屑模块
# 面包屑 breadcrumbs
bread_crumbs =soup.select('#house-lst > li')
item_url = soup.select('#house-lst > li > div > h2 > a')
blank_url = str(soup.find(text = no_data))
if no_data != blank_url:
for url in item_url:
url1 = url.get('href')
url_list1.append(url1)
#url_list.insert_one({'url':url1})
print(url1)
else:
pass
#get_pages_url(channel,'2')
# spider2 爬取详细信息并用mongodb存储
def get_massages(url):
web_data = requests.get(url,headers=requests_headers(),proxies=ip_proxy())
soup = BeautifulSoup(web_data.text,'lxml')
title = (soup.title.text).split('|')[0] #房名
address = soup.select('div.zf-room > p > a')[0].text #地址
price = soup.select(' div.price > span.total')[0].text + '元'
area = (soup.select('div.zf-room > p ')[0].text).split(':')[-1]
home_url = url
print({'title':title ,
'address':address,
'price':price,
'area':area,
'home_url':home_url,
})
item_list.insert_one({'title':title ,
'address':address,
'price':price,
'area':area,
'home_url':home_url})
get_massages('https://bj.lianjia.com/zufang/101101635089.html')
'''
#house-lst > li > p
list-no-data clear
'''
- ASP.NET Process Model之二:ASP.NET Http Runtime Pipeline[上篇]
- Shell常用命令小结
- 插入法排序
- ASP.NET Process Model之二:ASP.NET Http Runtime Pipeline - Part II
- 震惊了!这样的js面试题让所有人-男默女泪
- 前端知识学了却不会用,都是没走心
- var a="xx";a=a+"ss";a的值变了,但"xx"字符串并没有变
- 先行者计划--1109微课总结 | 《通过二个demo初识webPack》
- 先行者计划--1107微课 《什么是Vuex?》| 文字简版
- 脱离前端菜鸟层次的二个关键点
- 【课堂笔记】先行者 3.0版本的vueJs课程的第三次课
- ASP.NET:创建Linked ValidationSummary, 深入理解ASP.NET的Validation
- 【课堂笔记】先行者 3.0版本的vueJs课程的第二次课
- 用Python做证券指数的三种策略分析
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法