参考:https://blog.csdn.net/weixin_43066287/article/details/116757164
本地Win 或者服务器安装Docker
教程:https://linux265.com/news/3787.html
2.2 启动splash容器
将宿主机 8050 端口映射到容器 8050 端口。
docker run -p 8050:8050 scrapinghub/splash
2.3 安装 scrapy-splash
pip install scrapy-splash
2.4 安装 Pillow(图片处理 )
pip install Pillow
# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # 此处设置你想要运行的pipeline,数值越大优先级越高。 ITEM_PIPELINES = { 'spider.pipelines.SpiderPipeline': 300, #项目自定义 'scrapy.pipelines.images.ImagesPipeline': 1 #scrapy框架自带 } ###图片处理 IMAGES_STORE = 'images' IMAGES_URLS_FIELD = 'img_url' # Splash # 添加splash服务器地址 SPLASH_URL = 'http://localhost:8050' # 添加Splash中间件 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # Enable SplashDeduplicateArgsMiddlewar SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } # 设置Splash自己的去重过滤器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 如果你使用Splash的Http缓存,那么还要指定一个自定义的缓存后台存储介质 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
import scrapy class SpiderItem(scrapy.Item): # define the fields for your item here like: img_url = scrapy.Field()
import scrapy from scrapy_splash import SplashRequest lua_script = ''' function main(splash) splash:go(splash.args.url) --打开页面 splash:wait(2) --等待加载 return splash:html() --返回页面数据 end ''' class NetbianSpider(scrapy.Spider): name = 'netbian' allowed_domains = ['jd.com'] start_urls = ['https://item.jd.com/34637635130.html'] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, endpoint='execute', args={'lua_source': lua_script, 'timeout': 90, #超时时间,有的页面读取很慢导致504,可设置大值防止504 'wait': 0.5}, cache_args=['lua_source'], callback=self.parse) def parse(self, response): price = response.xpath('//span[@class="price J-p-34637635130"]/text()').extract_first() print("价格:", price)
8.遇到的坑
WARNING: /xxx…/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)
解决方法:
在 /xxx…/scrapy_splash/request.py 中增加
from scrapy.utils.python import to_unicode
在第41行将
url = to_native_str(url)
改为
url = to_unicode(url)