Scrapyでファイルをダウンロードして保存する

scrapyで複数ページを巡回はCrawlSpider、ファイルのダウンロードはFilesPipelineを使うと簡潔に記述できる。
FilesPipelineはデフォルトではSha1ハッシュをファイル名にする実装なので、カスタマイズが必要。
ソースコードは簡潔で読みやすいので継承してカスタマイズするのは容易。

CrawlSpider

要約すると、ポイントは以下

  • 巡回対象のページをrulesLinkExtractorで抽出
  • コールバックで抽出したページからアイテムを抽出

FilesPipeline

要約すると、ポイントは以下

  • settings.pyのFILES_STOREFILES_STOREによるダウンロード先ディレクトリを指定
  • settings.pyのITEM_PIPELINESFilesPipelineを有効化
  • 生成するアイテムにfile_urls属性を追加し、ダウンロードするファイルのURLsを指定
  • 生成するアイテムにダウンロード結果を保存するfiiles属性を追加する

Using the Files Pipeline

The typical workflow, when using the FilesPipeline goes like this:

In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.

The item is returned from the spider and goes to the item pipeline.

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).

When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from the file_urls field) , and the file checksum. The files in the list of the files field will retain the same order of the original file_urls field. If some file failed downloading, an error will be logged and the file won’t be present in the files field.

Spiderでスクレイピングし、目的のURLをfile_urlsにセットすると、SchedulerとDownloaderを使ってスケジューリングされるが、優先度が高く他のページをスクレイピングする前に処理される。ダウンロード結果はfilesに記録する。

Enabling your Media Pipeline

To enable your media pipeline you must first add it to your project ITEM_PIPELINES setting.

For Images Pipeline, use:

ITEM_PIPELINES = {‘scrapy.pipelines.images.ImagesPipeline’: 1}
For Files Pipeline, use:

ITEM_PIPELINES = {‘scrapy.pipelines.files.FilesPipeline’: 1}

ITEM_PIPELINESでscrapy.pipelines.files.FilesPipeline': 1を指定して有効化する。
画像ファイルのためのImagesPipelineもある。

Supported Storage - File system storage

The files are stored using a SHA1 hash of their URLs for the file names.

ファイル名はSHA1ハッシュを使用する

IPAの情報処理試験のページをサンプルにCrawlSpiderを試す

対象のページ構造

起点となるページは各年度の過去問ダウンロードページへのリンクになっている。

IPAのページ width=640

各ページは試験区分ごとに過去問のPDFへのリンクがある。

IPAのページ width=640

project

https://www.jitec.ipa.go.jp/1_04hanni_sukiru/_index_mondai.html以下のページを巡回してPDFをダウンロードするプロジェクトを作成する。
Spiderのスケルトンを作成する際に-t crawlを指定し、CrawlSpiderのスケルトンを作成する。

1
2
3
scrapy startproject <プロジェクト名>
cd <プロジェクト名>
scrapy genspider -t crawl ipa www.ipa.go.jp

spiders/ipa.py

rulesで各年度の過去問ダウンロードページを抽出し、各ページを解析してPDF単位でアイテム化する。
file_urlsは複数指定できるが、ここでは1ファイル毎で指定している。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from crawldownload.items import CrawldownloadItem

class IpaSpider(CrawlSpider):
name = 'ipa'
allowed_domains = ['ipa.go.jp']
start_urls = ['https://www.jitec.ipa.go.jp/1_04hanni_sukiru/_index_mondai.html']

rules = (
Rule(LinkExtractor(allow=r'1_04hanni_sukiru/mondai_kaitou'), callback='parse_item', follow=True),
)

def parse_item(self, response):
logger.info("{}".format(response.css('title::text').get()))

for main_area in response.css('#ipar_main'):
exam_seasons = main_area.css('h3').xpath('string()').extract()

season = 0
for exam_table in main_area.css('div.unit'):
exam_season = exam_seasons[season]
season+=1

# ページ内のPDFファイルのアイテムを生成
for exam_item in exam_table.css('tr'):
# リンクを含まないヘッダ部なので除く
if exam_item.css('a').get() is None:
continue

for exam_link in exam_item.css('a'):
exam_pdf = response.urljoin(exam_link.css('a::attr(href)').get())

item = CrawldownloadItem()
item['season'] = exam_season
item['title'] = exam_item.css('td p::text').getall()[1].strip()
item['file_title'] = exam_link.css('a::text').get()
item['file_urls'] = [ exam_pdf ]
yield item

items.py

files_urlsfiles属性がFilesPipelineで必要になる属性

1
2
3
4
5
6
7
8
import scrapy

class CrawldownloadItem(scrapy.Item):
season = scrapy.Field()
title = scrapy.Field()
file_title = scrapy.Field()
file_urls = scrapy.Field()
files = scrapy.Field()

pipelines.py

FilesPipelineはデフォルトでSHA1ハッシュのファイル名を使用するので、継承したクラスでfile_path()メソッドをオーバーライドする。
存在しないディレクトリも自動生成されるので、保存したいパスを生成して返せばいい。

1
2
3
4
5
6
7
8
9
10
11
12
from scrapy.pipelines.files import FilesPipeline

import os

class CrawldownloadPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
file_paths = request.url.split("/")
file_paths.pop(0) # https:
file_paths.pop(0) #//
file_name = os.path.join(*file_paths)

return file_name
1
2
3
response.url="https://www.jitec.ipa.go.jp/1_04hanni_sukiru/mondai_kaitou_2019h31_2/2019r01a_sg_am_qs.pdf"
↓↓↓
file_name="www.jitec.ipa.go.jp/1_04hanni_sukiru/mondai_kaitou_2019h31_2/2019r01a_sg_am_qs.pdf"

setting.py

FilesPipelineを有効化する。

  • FILES_STOREでダウンロード先ディレクトリを指定
  • ITEM_PIPELINESFilesPipelineを有効化

デフォルト設定では多重度が高すぎるので、調整する。

  • 同時アクセスは1
  • ダウンロード間隔3秒
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
DOWNLOAD_DELAY = 3

…略…

FILES_STORE = 'download'

ITEM_PIPELINES = {
#'scrapy.pipelines.files.FilesPipeline': 1,
'crawldownload.pipelines.CrawldownloadPipeline': 1,
}

コメント・シェア

Scrapyのcrawlでコマンドライン引数を処理する

 
カテゴリー Python   タグ

クローラーへのコマンドラインオプションの渡し方

scrapy crawl myspider -a category=electronicsのように-aオプションで渡す。

コンストラクタを実装する

1
2
3
4
5
6
7
8
9
10
11
Spiders can access arguments in their __init__ methods:

import scrapy

class MySpider(scrapy.Spider):
name = 'myspider'

def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
# ...

デフォルトコンストラクタを使用する

The default init method will take any spider arguments and copy them to the spider as attributes. The above example can also be written as follows:

デフォルトでは属性値として設定される。

1
2
3
4
5
6
7
import scrapy

class MySpider(scrapy.Spider):
name = 'myspider'

def start_requests(self):
yield scrapy.Request('http://www.example.com/categories/%s' % self.category)

コメント・シェア

ScrapyとSplashでのセッションハンドリング

 
カテゴリー Lua Python   タグ

Splashのセッションハンドリング

Splashのみで利用する場合はSelenium同様、内部的に動作するHeadlessブラウザ(Chromium)がセッションハンドリングを行うため、同一のLuaスクリプト内で記述する範囲では意識しなくてもステートは維持されている。

ScrapyとSplashの間

SplashはScrapyからのリクエスト毎にステートレスなので、ScrapyとLuaスクリプトの間でセッションハンドリングが必要になる。
scrapy-splashに説明がある。

セッションハンドリング

Splash itself is stateless - each request starts from a clean state. In order to support sessions the following is required:

  1. client (Scrapy) must send current cookies to Splash;
  2. Splash script should make requests using these cookies and update them from HTTP response headers or JavaScript code;
  3. updated cookies should be sent back to the client;
  4. client should merge current cookies wiht the updated cookies.

For (2) and (3) Splash provides splash:get_cookies() and splash:init_cookies() methods which can be used in Splash Lua scripts.

Splashはステートレスなので、状態を維持するためのコーディングが必要。

  1. ScrapyからSplashにCookieを送らなくてはならない
  2. SplashスクリプトはCookieを使って操作し、Cookieをアップデートする
  3. アップデートしたCookieをScrapyに返す
  4. Scrapyは受け取ったCookieをマージする

scrapy-splash provides helpers for (1) and (4): to send current cookies in ‘cookies’ field and merge cookies back from ‘cookies’ response field set request.meta[‘splash’][‘session_id’] to the session identifier. If you only want a single session use the same session_id for all request; any value like ‘1’ or ‘foo’ is fine.

scrapy-splashが自動的にCookie情報をセッション識別子としてrequest.meta['splash']['session_id']にマージする。

For scrapy-splash session handling to work you must use /execute endpoint and a Lua script which accepts ‘cookies’ argument and returns ‘cookies’ field in the result:

このセッションハンドリングを有効にするには/executeエンドポイントを使用し、cookiesパラメーターを使用する処理をLuaスクリプトで実装する必要がある。

1
2
3
4
5
6
7
8
9
10
function main(splash)
splash:init_cookies(splash.args.cookies)

-- ... your script

return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end

SplashRequest sets session_id automatically for /execute endpoint, i.e. cookie handling is enabled by default if you use SplashRequest, /execute endpoint and a compatible Lua rendering script.

SplashRequestで/executeエンドポイントを使い、適切なLuaスクリプトを記述すれば、セッションハンドリングを実装することができる。

Splash経由でのresponseの構造

All these responses set response.url to the URL of the original request (i.e. to the URL of a website you want to render), not to the URL of the requested Splash endpoint. “True” URL is still available as response.real_url.
plashJsonResponse provide extra features:

  • response.data attribute contains response data decoded from JSON; you can access it like response.data[‘html’].
  • If Splash session handling is configured, you can access current cookies as response.cookiejar; it is a CookieJar instance.
  • If Scrapy-Splash response magic is enabled in request (default), several response attributes (headers, body, url, status code) are set automatically from original response body:
    • response.headers are filled from ‘headers’ keys;
    • response.url is set to the value of ‘url’ key;
    • response.body is set to the value of ‘html’ key, or to base64-decoded value of ‘body’ key;
    • response.status is set from the value of ‘http_status’ key.
  • response.urlはレンダリングするページのURLが設定される
  • response.real_urlはSplashのURL(http://splash:8050/execute)となる
  • response.dataでSplashから返却したデータにアクセスできる
  • Cookieはresponse.cookiejarでアクセスすることができる。
  • Scrapy-Splash response magicで自動的にレンダリングしたページからの応答が設定される

セッションハンドリングのサンプルコード

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))

local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""

class MySpider(scrapy.Spider):


# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)

def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.

リクエストで注目するポイント

重要なポイントは/executeエンドポイントを使用していること。
argsでLuaスクリプトやパラメーターをSplashに渡す。

1
2
3
4
5
6
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)

SplashRequestで渡したパラメーターを使用してCookieを初期化。

1
2
3
4
5
6
7
8
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))

レスポンスで注目するポイント

最後のレスポンスのヘッダー情報やCookieを返却。

1
2
3
4
5
6
7
8
9
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}

コメント・シェア

Scrapyのloginformで効率的にログインする

 
カテゴリー Python   タグ

scrapy/loginform

ログインフォームの利用を支援する。pip install loginformでインストール。

プロジェクトの準備

1
2
3
4
5
6
7
8
9
10
11
$scrapy startproject scrapy_login
New Scrapy project 'scrapy_login', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_login

You can start your first spider with:
cd scrapy_login
scrapy genspider example example.com
$cd scrapy_login
$scrapy genspider github github.com
Created spider 'github' using template 'basic' in module:
scrapy_login.spiders.github
1
2
3
4
5
6
7
8
9
10
11
12
13
├── result.json
├── scrapy.cfg
└── scrapy_login
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
└── github.py

settings.pyをカスタマイズ

ROBOTSTXT_OBEY

githubはrobots.txtでクローラーからのアクセスを拒否するので、一時的にrobots.txtを無効化する。

1
2
3
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

items.pyをカスタマイズ

1
2
3
4
5
6
class ScrapyLoginItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#pass
repository_name = scrapy.Field()
repository_link = scrapy.Field()

github.pyをカスタマイズしてSpiderを実装する

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from loginform import fill_login_form
from scrapy_login.items import ScrapyLoginItem

class GithubSpider(scrapy.Spider):
name = 'github'
allowed_domains = ['github.com']
start_urls = ["http://github.com/login"]
login_user = "XXXXXXX"
login_pass = "XXXXXXX"

def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)

def after_login(self, response):
for q in response.css("ul.list-style-none li div.width-full"):
_, repo_name = q.css("span.css-truncate::text").getall()
github = ScrapyLoginItem()
github["repository_name"] = repo_name
github["repository_link"] = q.css("a::attr(href)").get()
yield github

実行すると以下のような内容が生成される。

1
2
3
4
[
{"repository_name": "hello-world", "repository_link": "/xxxxxxx/hello-world"},
{"repository_name": "Spoon-Knife", "repository_link": "/octocat/Spoon-Knife"}
]

fill_login_form()

注目するポイントはfill_login_formの部分。
fill_login_form()を実行すると、ページを解析してログインフォームの情報を返す。

1
2
3
4
5
6
7
8
9
10
11
12
13
$python
Python 3.8.2 (default, Apr 16 2020, 18:36:10)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from loginform import fill_login_form
>>> import requests
>>> url = "https://github.com/login"
>>> r = requests.get(url)
>>> fill_login_form(url, r.text, "john", "secret")
(
[('authenticity_token', 'R+A63AyXCpZLBzIdp6LefjsRxmkhLqsxaUPp+DLru2BlQlyID+B7yXL3FoNgoBgjF3osG3ZSyjBFriX6TsrsFg=='), ('login', 'john'), ('password', 'secret'), ('webauthn-support', 'unknown'), ('webauthn-iuvpaa-support', 'unknown'), ('timestamp', '1588766233339'), ('timestamp_secret', '115d1a1e733276fa256131e12acb6c1974912ba3923dddd3ade33ba6717b3dcd'), ('commit', 'Sign in')],
'https://github.com/session',
'POST')

タプルの1つめでauthenticity_tokenが含まれていることがわかる。このようにHiddenパラメーターを送ることができる。

コメント・シェア

scrapy-splash

SplashのScrapyミドルウェア。pip install scrapy-splashでインストール。

プロジェクトの準備

1
2
3
4
5
6
7
$ scrapy startproject scrapy_splash_tutorial
New Scrapy project 'scrapy_splash_tutorial', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_splash_tutorial

You can start your first spider with:
cd scrapy_splash_tutorial
scrapy genspider example example.com
1
2
3
4
5
6
7
8
9
10
11
12
.
├── scrapy.cfg
└── scrapy_splash_tutorial
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__

settings.pyをカスタマイズ

DOWNLOADER_MIDDLEWARES

1
2
3
4
5
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

ミドルウェアの優先度はHttpProxyよりも優先する必要があるため、750未満にする必要がある。

SPLASH_URL

SPLASH_URL =でSplashのURLを指定する。

1
SPLASH_URL = 'http://splash:8050/'

docker-composeで起動しているため、splashを使っている。

SPIDER_MIDDLEWARES

1
2
3
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

SplashDeduplicateArgsMiddlewareを有効化する。これによって重複するリクエストをSplashサーバーに送らない。

DUPEFILTER_CLASS / HTTPCACHE_STORAGE

1
2
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

リクエストのフィンガープリント計算をオーバーライドできないので、DUPEFILTER_CLASSHTTPCACHE_STORAGEを定義する。

Spiderの実装例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)

def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
  1. scrapy.Requestの代わりにSplashRequestを使用してページのレンダリング
  2. argsでSplashに引数として渡す
  3. endpointでデフォルトのエンドポイントであるrender.jsonからrender.htmlに変更

Spiderの例を元にquotesのJSページを実装する

JavaScriptでページを生成するhttp://quotes.toscrape.com/js/を対象にテストコードを作成する。

今回のスパイダーはquotesjsで作成。

1
2
3
$scrapy genspider quotesjs quotes.toscrape.com
Created spider 'quotesjs' using template 'basic' in module:
scrapy_splash_tutorial.spiders.quotesjs

ChromeのF12デバッグで内容を確認する

Chromeデバッグ width=640

Chromeデバッグ width=640

scrapy shellでページを解析する

shellはSplash経由で操作するため、scrapy shell 'http://splash:8050/render.html?url=http://<target_url>&timeout=10&wait=2'で起動する。
パラメーターのwait=2(秒数は対象にあわせて適切な値を)は重要で、指定なしではレンダリングが終わっていないHTMLが返却されることもある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
$scrapy shell 'http://splash:8050/render.html?url=http://quotes.toscrape.com/js/'
2020-05-06 18:09:33 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_splash_tutorial)
2020-05-06 18:09:33 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-06 18:09:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-06 18:09:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_splash_tutorial',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'EDITOR': '/usr/bin/vim',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'scrapy_splash_tutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_splash_tutorial.spiders']}
2020-05-06 18:09:33 [scrapy.extensions.telnet] INFO: Telnet Password: 2dd3dc32afe40826
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-06 18:09:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-06 18:09:33 [scrapy.core.engine] INFO: Spider opened
2020-05-06 18:09:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://splash:8050/robots.txt> (referer: None)
2020-05-06 18:09:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://splash:8050/render.html?url=http://quotes.toscrape.com/js/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f8aaede0f10>
[s] item {}
[s] request <GET http://splash:8050/render.html?url=http://quotes.toscrape.com/js/>
[s] response <200 http://splash:8050/render.html?url=http://quotes.toscrape.com/js/>
[s] settings <scrapy.settings.Settings object at 0x7f8aaede0b20>
[s] spider <DefaultSpider 'default' at 0x7f8aaeb9a9a0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
1
2
3
4
>>> response.css('.container .quote').get()
'<div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div>'
>>> response.css('.container .quote').getall()
['<div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div>', '<div class="quote"><span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span><span>by <small class="author">J.K. Rowling</small></span><div class="tags">Tags: <a class="tag">abilities</a> <a class="tag">choices</a></div></div>', '<div class="quote"><span class="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">inspirational</a> <a class="tag">life</a> <a class="tag">live</a> <a class="tag">miracle</a> <a class="tag">miracles</a></div></div>', '<div class="quote"><span class="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span><span>by <small class="author">Jane Austen</small></span><div class="tags">Tags: <a class="tag">aliteracy</a> <a class="tag">books</a> <a class="tag">classic</a> <a class="tag">humor</a></div></div>', '<div class="quote"><span class="text">“Imperfection is beauty, madness is genius and it\'s better to be absolutely ridiculous than absolutely boring.”</span><span>by <small class="author">Marilyn Monroe</small></span><div class="tags">Tags: <a class="tag">be-yourself</a> <a class="tag">inspirational</a></div></div>', '<div class="quote"><span class="text">“Try not to become a man of success. Rather become a man of value.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">adulthood</a> <a class="tag">success</a> <a class="tag">value</a></div></div>', '<div class="quote"><span class="text">“It is better to be hated for what you are than to be loved for what you are not.”</span><span>by <small class="author">André Gide</small></span><div class="tags">Tags: <a class="tag">life</a> <a class="tag">love</a></div></div>', '<div class="quote"><span class="text">“I have not failed. I\'ve just found 10,000 ways that won\'t work.”</span><span>by <small class="author">Thomas A. Edison</small></span><div class="tags">Tags: <a class="tag">edison</a> <a class="tag">failure</a> <a class="tag">inspirational</a> <a class="tag">paraphrased</a></div></div>', '<div class="quote"><span class="text">“A woman is like a tea bag; you never know how strong it is until it\'s in hot water.”</span><span>by <small class="author">Eleanor Roosevelt</small></span><div class="tags">Tags: <a class="tag">misattributed-eleanor-roosevelt</a></div></div>', '<div class="quote"><span class="text">“A day without sunshine is like, you know, night.”</span><span>by <small class="author">Steve Martin</small></span><div class="tags">Tags: <a class="tag">humor</a> <a class="tag">obvious</a> <a class="tag">simile</a></div></div>']

items.pyをカスタマイズ

1
2
3
class QuoteItem(scrapy.Item):
quote = scrapy.Field()
author = scrapy.Field()

quotesjs.pyをカスタマイズ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy_splash_tutorial.items import QuoteItem

class QuotesjsSpider(scrapy.Spider):
name = 'quotesjs'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)

def parse(self, response):
for q in response.css(".container .quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote

クローラーを実行する

scrapy crawl quotesjs -o result.jsonでクローラーを実行する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
$scrapy crawl quotesjs -o result.json
2020-05-06 18:34:02 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_splash_tutorial)
2020-05-06 18:34:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-06 18:34:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-06 18:34:02 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_splash_tutorial',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'EDITOR': '/usr/bin/vim',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'scrapy_splash_tutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_splash_tutorial.spiders']}
2020-05-06 18:34:02 [scrapy.extensions.telnet] INFO: Telnet Password: febe521f79cff551
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-06 18:34:02 [scrapy.core.engine] INFO: Spider opened
2020-05-06 18:34:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-06 18:34:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-06 18:34:02 [py.warnings] WARNING: /usr/local/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)

2020-05-06 18:34:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-06 18:34:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://splash:8050/robots.txt> (referer: None)
2020-05-06 18:34:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/js/ via http://splash:8050/render.html> (referer: None)
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“The world as we have created it is a process of our thinking. It '
'cannot be changed without changing our thinking.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'J.K. Rowling',
'quote': '“It is our choices, Harry, that show what we truly are, far more '
'than our abilities.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“There are only two ways to live your life. One is as though '
'nothing is a miracle. The other is as though everything is a '
'miracle.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Jane Austen',
'quote': '“The person, be it gentleman or lady, who has not pleasure in a '
'good novel, must be intolerably stupid.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Marilyn Monroe',
'quote': "“Imperfection is beauty, madness is genius and it's better to be "
'absolutely ridiculous than absolutely boring.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“Try not to become a man of success. Rather become a man of value.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'André Gide',
'quote': '“It is better to be hated for what you are than to be loved for '
'what you are not.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Thomas A. Edison',
'quote': "“I have not failed. I've just found 10,000 ways that won't work.”"}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Eleanor Roosevelt',
'quote': '“A woman is like a tea bag; you never know how strong it is until '
"it's in hot water.”"}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Steve Martin',
'quote': '“A day without sunshine is like, you know, night.”'}
2020-05-06 18:34:04 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-06 18:34:04 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: result.json
2020-05-06 18:34:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 960,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 9757,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 2,
'elapsed_time_seconds': 2.285135,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 6, 9, 34, 4, 575789),
'item_scraped_count': 10,
'log_count/DEBUG': 13,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'memusage/max': 56578048,
'memusage/startup': 56578048,
'response_received_count': 3,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/404': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2020, 5, 6, 9, 34, 2, 290654)}
2020-05-06 18:34:04 [scrapy.core.engine] INFO: Spider closed (finished)

生成されたresult.jsonは以下。

1
2
3
4
5
6
7
8
9
10
11
12
[
{"author": "Albert Einstein", "quote": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"author": "J.K. Rowling", "quote": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"},
{"author": "Albert Einstein", "quote": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"},
{"author": "Jane Austen", "quote": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"},
{"author": "Marilyn Monroe", "quote": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"},
{"author": "Albert Einstein", "quote": "\u201cTry not to become a man of success. Rather become a man of value.\u201d"},
{"author": "Andr\u00e9 Gide", "quote": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"},
{"author": "Thomas A. Edison", "quote": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"},
{"author": "Eleanor Roosevelt", "quote": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"},
{"author": "Steve Martin", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"}
]

コメント・シェア

Scrapy公式チュートリアル

Installation

docker run -it -p 8050:8050 --rm scrapinghub/splashだが、docker-composeで操作する。

docker-compose.ymlで定義。

1
2
3
4
splash:
image: scrapinghub/splash
ports:
- 8050:8050

実行のテスト

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ docker-compose run splash
Pulling splash (scrapinghub/splash:)...
latest: Pulling from scrapinghub/splash
2746a4a261c9: Pull complete
4c1d20cdee96: Pull complete
~略~
50ea6de52777: Pull complete
43e94179bda5: Pull complete
Digest: sha256:01c89e3b0598e904fea184680b82ffe74524e83160f793884dc88d184056c49d
Status: Downloaded newer image for scrapinghub/splash:latest
2020-05-06 04:13:03+0000 [-] Log opened.
2020-05-06 04:13:03.106078 [-] Xvfb is started: ['Xvfb', ':2112596484', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-05-06 04:13:03.184966 [-] Splash version: 3.4.1
2020-05-06 04:13:03.217438 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-05-06 04:13:03.217581 [-] Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
2020-05-06 04:13:03.217654 [-] Open files limit: 1048576
2020-05-06 04:13:03.217695 [-] Can't bump open files limit
2020-05-06 04:13:03.231322 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-05-06 04:13:03.231620 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-05-06 04:13:03.343525 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-05-06 04:13:03.343858 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2020-05-06 04:13:03.344260 [-] Site starting on 8050
2020-05-06 04:13:03.344470 [-] Starting factory <twisted.web.server.Site object at 0x7f23c5cb6160>
2020-05-06 04:13:03.344768 [-] Server listening on http://0.0.0.0:8050

使用する際はdocker-compose up -dで。

Splash WebUI

起動したSplashにアクセスするとWebUIから操作が可能。

Splash WebUI width=640

標準で表示されているコードでRender me!を実行する。

Splash WebUI width=640

Intro

Splash can execute custom rendering scripts written in the Lua programming language. This allows us to use Splash as a browser automation tool similar to PhantomJS.
Lua言語で記述されたカスタムレンダリングスクリプトを実行できるPhantomJS的なもの。
Lua言語はRedis, Nginx, Apache, World of Warcraft scripts,などのカスタムスクリプトの記述に使われている。

以下のチュートリアルが紹介されている。

1
2
3
4
5
6
function main(splash, args)
splash:go("http://example.com")
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
end

WebUI上でRender me!を実行すると、returnで返し多JSONオブジェトが得られる。

Splash WebUI width=640

Splash WebUI width=640

Entry Point: the “main” Function

1
2
3
function main(splash)
return {hello="world!"}
end

SplashのWebGUIで実行すると以下の結果になる。

1
2
Splash Response: Object
hello: "world!"

JSON形式ではなく、文字列で返すこともできる。

1
2
3
function main(splash)
return 'hello'
end

docker-composeでsplashというサービスなのでホスト名はsplashを使用している。

1
2
$ curl 'http://splash:8050/execute?lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend'
hello

Where Are My Callbacks?

It is not doing exactly the same work - instead of saving screenshots to files we’re returning PNG data to the client via HTTP API.
スクリーンショットをPNG形式で取得しWebAPIで返却する例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function main(splash, args)
splash:set_viewport_size(800, 600)
splash:set_user_agent('Splash bot')
local example_urls = {"www.google.com", "www.bbc.co.uk", "scrapinghub.com"}
local urls = args.urls or example_urls
local results = {}
for _, url in ipairs(urls) do
local ok, reason = splash:go("http://" .. url)
if ok then
splash:wait(0.2)
results[url] = splash:png()
end
end
return results
end

WebUI上でRender me!を実行すると、各サイトのスクリーンショットが表示される。

Splash WebUI width=640

Calling Splash Methods

There are two main ways to call Lua methods in Splash scripts: using positional and named arguments. To call a method using positional arguments use parentheses splash:foo(val1, val2), to call it with named arguments use curly braces: splash:foo{name1=val1, name2=val2}:

Luaのメソッド呼び出しは位置引数(Positional arguments)によるsplash:foo(val1, val2)や名前引数(named arguments)splash:foo{name1=val1, name2=val2}によるものがある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
function main(splash, args)
-- Examples of positional arguments:
splash:go("http://example.com")
splash:wait(0.5, false)
local title = splash:evaljs("document.title")

-- The same using keyword arguments:
splash:go{url="http://google.com"}
splash:wait{time=0.5, cancel_on_redirect=false}
local title = splash:evaljs{snippet="document.title"}

-- Mixed arguments example:
splash:wait{0.5, cancel_on_redirect=false}

return title
end

このチュートリアル自体に意味はないが、コード上evaljs{source="document.title"}となっているので動作しない。
splash:evaljsのリファレンスsnippetである事がわかる。

Error Handling

Splash uses the following convention:

  1. for developer errors (e.g. incorrect function arguments) exception is raised;
  2. for errors outside developer control (e.g. a non-responding remote website) status flag is returned: functions that can fail return ok, reason pairs which developer can either handle or ignore.
    If main results in an unhandled exception then Splash returns HTTP 400 response with an error message.

Splashのルールでは以下のルール。

  1. 開発者エラーは例外にする
  2. 開発者が制御できないエラーはstatusで返す

例外はerror()で明示的に発生させることができる。

1
2
3
4
5
6
7
function main(splash, args)
local ok, msg = splash:go("http://no-url.example.com")
if not ok then
-- handle error somehow, e.g.
error(msg)
end
end

例外の場合、LuaのHTTPレスポンスHTTP 400のエラーとして返す。

1
2
3
4
5
6
7
8
9
10
11
12
{
"error": 400,
"type": "ScriptError",
"description": "Error happened while executing Lua script",
"info": {
"source": "[string \"function main(splash, args)\r...\"]",
"line_number": 5,
"error": "network3",
"type": "LUA_ERROR",
"message": "Lua error: [string \"function main(splash, args)\r...\"]:5: network3"
}
}

同じコードをassert()で表現できる。

1
2
3
4
function main(splash, args)
-- a shortcut for the code above: use assert
assert(splash:go("http://no-rul.example.com"))
end

Sandbox

By default Splash scripts are executed in a restricted environment: not all standard Lua modules and functions are available, Lua require is restricted, and there are resource limits (quite loose though).

デフォルトではSplashはサンドボックスで実行される。無効化するには-disable-lua-sandboxオプションを使う。

Dockerコマンドをそのまま使用するなら以下のように。

1
`docker run -it -p 8050:8050 scrapinghub/splash --disable-lua-sandbox`

docker-composeなら、commandでオプションを渡す。

1
2
3
4
5
splash:
image: scrapinghub/splash
command: --disable-lua-sandbox
ports:
- 8050:8050

docker-compose runでテスト実行するとLua: enabled (sandbox: disabled)を確認できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PS C:\Users\g\OneDrive\devel\gggcat@github\python3-tutorial> docker-compose run splash
2020-05-06 06:02:02+0000 [-] Log opened.
2020-05-06 06:02:02.166203 [-] Xvfb is started: ['Xvfb', ':1094237403', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-05-06 06:02:02.242322 [-] Splash version: 3.4.1
2020-05-06 06:02:02.275180 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-05-06 06:02:02.275346 [-] Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
2020-05-06 06:02:02.275497 [-] Open files limit: 1048576
2020-05-06 06:02:02.275605 [-] Can't bump open files limit
2020-05-06 06:02:02.289473 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-05-06 06:02:02.289650 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-05-06 06:02:02.398489 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-05-06 06:02:02.398754 [-] Web UI: enabled, Lua: enabled (sandbox: disabled), Webkit: enabled, Chromium: enabled
2020-05-06 06:02:02.399073 [-] Site starting on 8050
2020-05-06 06:02:02.399156 [-] Starting factory <twisted.web.server.Site object at 0x7f02ac5b61d0>
2020-05-06 06:02:02.399344 [-] Server listening on http://0.0.0.0:8050

Timeouts

By default Splash aborts script execution after a timeout (30s by default); it is a common problem for long scripts.

タイムアウトはデフォルトで30秒。

コメント・シェア

AmazonLinux

AmazonLinuxのイメージ一覧

  • AmazonLinux2
  • リージョンがap-northeast-1
  • アーキテクチャがx86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
-------------------------------------------------------------------------------------------------------------
| DescribeImages |
+-----------------------------------------------------------------------+------------------------+----------+
| amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2 | ami-0f310fced6141e627 | x86_64 |
| amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs | ami-06aa6ba9dc39dc071 | x86_64 |
| amzn2-ami-hvm-2.0.20200304.0-x86_64-gp2 | ami-052652af12b58691f | x86_64 |
| amzn2-ami-hvm-2.0.20200304.0-x86_64-ebs | ami-0c6f9336767cd9243 | x86_64 |
~略~
| amzn2-ami-hvm-2017.12.0.20180109-x86_64-gp2 | ami-6be57d0d | x86_64 |
| amzn2-ami-hvm-2017.12.0.20180109-x86_64-ebs | ami-39e37b5f | x86_64 |
| amzn2-ami-hvm-2017.12.0.20171212.2-x86_64-gp2 | ami-2a34b64c | x86_64 |
| amzn2-ami-hvm-2017.12.0.20171212.2-x86_64-ebs | ami-1d37b57b | x86_64 |
+-----------------------------------------------------------------------+------------------------+----------+

AmazonLinuxの最新イメージを取得する

  • バージョン: 2
  • リージョンが: ap-northeast-1
  • アーキテクチャ: x86_64
  • ボリューム: gp2

ボリュームタイプでイメージが異なるので、以下はgp2(現行の汎用SSD)のボリュームで検索している。

1
2
3
4
5
6
7
8
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-gp2" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[1].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------
| DescribeImages |
+-------------------------------------------+
| amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs |
| ami-06aa6ba9dc39dc071 |
| x86_64 |
+-------------------------------------------+

AmazonLinuxのボリュームタイプ

amzn2-ami-hvm-*-x86_64-ebsはVolumeType: standardで旧世代のボリュームタイプを使用している。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-gp2" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-07T17:30:34.000Z",
"ImageId": "ami-0f310fced6141e627",
"ImageLocation": "amazon/amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2",
"ImageType": "machine",
"Public": true,
"OwnerId": "137112412989",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-06688593da98411ef",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
}
],
"Description": "Amazon Linux 2 AMI 2.0.20200406.0 x86_64 HVM gp2",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-ebs" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-07T17:14:50.000Z",
"ImageId": "ami-06aa6ba9dc39dc071",
"ImageLocation": "amazon/amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs",
"ImageType": "machine",
"Public": true,
"OwnerId": "137112412989",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-06688593da98411ef",
"VolumeSize": 8,
"VolumeType": "standard",
"Encrypted": false
}
}
],
"Description": "Amazon Linux 2 AMI 2.0.20200406.0 x86_64 HVM ebs",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}

UbuntuLinux

AMIでUbuntuLinuxの指定バージョンの最新イメージ

UbuntuLinuxの公式は099720109477なのでこれを基本に絞り込んでいく。

  • バージョン: 18.04
  • リージョンがap-northeast-1
  • アーキテクチャ: x86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------------------------------------------------------------------------------------------
| DescribeImages |
+-------------------------------------------------------------------------------------------+------------------------+----------+
| ubuntu-minimal/images/hvm-ssd/ubuntu-bionic-18.04-amd64-minimal-20200430 | ami-0084e4332fdb227c6 | x86_64 |
| ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20200408 | ami-0a8f568a6a14353b6 | x86_64 |
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 | ami-0278fe6949f6b1a06 | x86_64 |
| ubuntu-eks/k8s_1.15/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200406.1 | ami-0fd103c2168938a67 | x86_64 |
| ubuntu-minimal/images/hvm-ssd/ubuntu-bionic-18.04-amd64-minimal-20200406.1 | ami-0c1bb33d8c0bd2145 | x86_64 |
~略~
| ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20180426.2 | ami-19d33266 | x86_64 |
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180426.2 | ami-82c928fd | x86_64 |
| ubuntu-minimal/images-testing/hvm-ssd/ubuntu-bionic-18.04-daily-amd64-minimal-20180328.1 | ami-ddcec5a1 | x86_64 |
| ubuntu-minimal/images-testing/hvm-ssd/ubuntu-bionic-18.04-daily-amd64-minimal-20180329 | ami-54747f28 | x86_64 |
+-------------------------------------------------------------------------------------------+------------------------+----------+

複合条件で以下を条件として、18.04の最新イメージの情報を取得する。

  • Ubuntu 18.04
  • リージョンがap-northeast-1
  • アーキテクチャがx86_64
  • ボリューム: gp2
1
2
3
4
5
6
7
8
$aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-ssd/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------------------------------
| DescribeImages |
+-------------------------------------------------------------------+
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 |
| ami-0278fe6949f6b1a06 |
| x86_64 |
+-------------------------------------------------------------------+

UbuntuLinuxのボリュームタイプ

  • ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*: 現行SSDボリューム
  • ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-*: ボリュームマウントなし
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
$ aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-ssd/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-09T16:44:23.000Z",
"ImageId": "ami-0278fe6949f6b1a06",
"ImageLocation": "099720109477/ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408",
"ImageType": "machine",
"Public": true,
"OwnerId": "099720109477",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-0cb75af02a9254c11",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"Description": "Canonical, Ubuntu, 18.04 LTS, amd64 bionic image build on 2020-04-08",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
$ aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-instance/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[1]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-03-24T21:03:50.000Z",
"ImageId": "ami-0dc413a5565744b02",
"ImageLocation": "ubuntu-images-ap-northeast-1-release/bionic/20200323/hvm/instance-store/image.img.manifest.xml",
"ImageType": "machine",
"Public": true,
"OwnerId": "099720109477",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [],
"Description": "Canonical, Ubuntu, 18.04 LTS, amd64 bionic image build on 2020-03-23",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20200323",
"RootDeviceType": "instance-store",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}

コメント・シェア



nullpo

めも


募集中


Japan