爬虫进阶-JS自动渲染之Scrapy_splash组件的使用

参考：https://blog.csdn.net/weixin_43066287/article/details/116757164

本地Win 或者服务器安装Docker

教程：https://linux265.com/news/3787.html

2.2 启动splash容器
将宿主机 8050 端口映射到容器 8050 端口。

docker run -p 8050:8050 scrapinghub/splash

2.3 安装 scrapy-splash
pip install scrapy-splash

2.4 安装 Pillow（图片处理）
pip install Pillow

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 此处设置你想要运行的pipeline，数值越大优先级越高。
ITEM_PIPELINES = {
   'spider.pipelines.SpiderPipeline': 300,  #项目自定义
   'scrapy.pipelines.images.ImagesPipeline': 1  #scrapy框架自带
}
###图片处理
IMAGES_STORE = 'images'
IMAGES_URLS_FIELD = 'img_url'


# Splash

# 添加splash服务器地址
SPLASH_URL = 'http://localhost:8050'

# 添加Splash中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Enable SplashDeduplicateArgsMiddlewar
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# 设置Splash自己的去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

# 如果你使用Splash的Http缓存，那么还要指定一个自定义的缓存后台存储介质
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

import scrapy


class SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    img_url = scrapy.Field()

import scrapy
from scrapy_splash import SplashRequest

lua_script = '''
function main(splash)                     
    splash:go(splash.args.url)        --打开页面
    splash:wait(2)                    --等待加载
    return splash:html()              --返回页面数据
end
'''

class NetbianSpider(scrapy.Spider):
    name = 'netbian'
    allowed_domains = ['jd.com']
    start_urls = ['https://item.jd.com/34637635130.html']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url,
                                endpoint='execute',
                                args={'lua_source': lua_script,
                                        'timeout': 90, #超时时间，有的页面读取很慢导致504，可设置大值防止504
                                        'wait': 0.5},
                                cache_args=['lua_source'],
                                callback=self.parse)
    def parse(self, response):
        price = response.xpath('//span[@class="price J-p-34637635130"]/text()').extract_first()
        print("价格：", price)

8.遇到的坑
WARNING: /xxx…/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)

解决方法：
在 /xxx…/scrapy_splash/request.py 中增加

from scrapy.utils.python import to_unicode
在第41行将

url = to_native_str(url)

改为

url = to_unicode(url)

近期文章

python记录 / scrapy / scrapyImportant · 2022年1月6日

爬虫进阶-JS自动渲染之Scrapy_splash组件的使用

相关

您可能还喜欢...

python记录 / scrapy / scrapyImportant · 2022年1月6日

相关

您可能还喜欢...

python(html 颜色位置)

tcpdump命令

scrapy.Request参数