ScrapyとSplashでのセッションハンドリング

Splashのセッションハンドリング

Splashのみで利用する場合はSelenium同様、内部的に動作するHeadlessブラウザ（Chromium）がセッションハンドリングを行うため、同一のLuaスクリプト内で記述する範囲では意識しなくてもステートは維持されている。

ScrapyとSplashの間

SplashはScrapyからのリクエスト毎にステートレスなので、ScrapyとLuaスクリプトの間でセッションハンドリングが必要になる。
scrapy-splashに説明がある。

セッションハンドリング

Splash itself is stateless - each request starts from a clean state. In order to support sessions the following is required:

client (Scrapy) must send current cookies to Splash;

Splash script should make requests using these cookies and update them from HTTP response headers or JavaScript code;

updated cookies should be sent back to the client;

client should merge current cookies wiht the updated cookies.

For (2) and (3) Splash provides splash:get_cookies() and splash:init_cookies() methods which can be used in Splash Lua scripts.

Splashはステートレスなので、状態を維持するためのコーディングが必要。

ScrapyからSplashにCookieを送らなくてはならない
SplashスクリプトはCookieを使って操作し、Cookieをアップデートする
アップデートしたCookieをScrapyに返す
Scrapyは受け取ったCookieをマージする

scrapy-splash provides helpers for (1) and (4): to send current cookies in ‘cookies’ field and merge cookies back from ‘cookies’ response field set request.meta[‘splash’][‘session_id’] to the session identifier. If you only want a single session use the same session_id for all request; any value like ‘1’ or ‘foo’ is fine.

scrapy-splashが自動的にCookie情報をセッション識別子としてrequest.meta['splash']['session_id']にマージする。

For scrapy-splash session handling to work you must use /execute endpoint and a Lua script which accepts ‘cookies’ argument and returns ‘cookies’ field in the result:

このセッションハンドリングを有効にするには/executeエンドポイントを使用し、cookiesパラメーターを使用する処理をLuaスクリプトで実装する必要がある。

function main(splash)
    splash:init_cookies(splash.args.cookies)

    -- ... your script

    return {
        cookies = splash:get_cookies(),
        -- ... other results, e.g. html
    }
end

SplashRequest sets session_id automatically for /execute endpoint, i.e. cookie handling is enabled by default if you use SplashRequest, /execute endpoint and a compatible Lua rendering script.

SplashRequestで/executeエンドポイントを使い、適切なLuaスクリプトを記述すれば、セッションハンドリングを実装することができる。

Splash経由でのresponseの構造

All these responses set response.url to the URL of the original request (i.e. to the URL of a website you want to render), not to the URL of the requested Splash endpoint. “True” URL is still available as response.real_url.
plashJsonResponse provide extra features:

response.data attribute contains response data decoded from JSON; you can access it like response.data[‘html’].

If Splash session handling is configured, you can access current cookies as response.cookiejar; it is a CookieJar instance.

If Scrapy-Splash response magic is enabled in request (default), several response attributes (headers, body, url, status code) are set automatically from original response body:

response.headers are filled from ‘headers’ keys;

response.url is set to the value of ‘url’ key;

response.body is set to the value of ‘html’ key, or to base64-decoded value of ‘body’ key;

response.status is set from the value of ‘http_status’ key.

response.urlはレンダリングするページのURLが設定される
response.real_urlはSplashのURL（http://splash:8050/execute）となる
response.dataでSplashから返却したデータにアクセスできる
Cookieはresponse.cookiejarでアクセスすることができる。
Scrapy-Splash response magicで自動的にレンダリングしたページからの応答が設定される

セッションハンドリングのサンプルコード

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):


    # ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
            headers={'X-My-Header': 'value'},
        )

    def parse_result(self, response):
        # here response.body contains result HTML;
        # response.headers are filled with headers from last
        # web page loaded to Splash;
        # cookies from all responses and from JavaScript are collected
        # and put into Set-Cookie response header, so that Scrapy
        # can remember them.

リクエストで注目するポイント

重要なポイントは/executeエンドポイントを使用していること。
argsでLuaスクリプトやパラメーターをSplashに渡す。

yield SplashRequest(url, self.parse_result,
    endpoint='execute',
    cache_args=['lua_source'],
    args={'lua_source': script},
    headers={'X-My-Header': 'value'},
)

SplashRequestで渡したパラメーターを使用してCookieを初期化。

splash:init_cookies(splash.args.cookies)
assert(splash:go{
  splash.args.url,
  headers=splash.args.headers,
  http_method=splash.args.http_method,
  body=splash.args.body,
  })
assert(splash:wait(0.5))

レスポンスで注目するポイント

最後のレスポンスのヘッダー情報やCookieを返却。

local entries = splash:history()
local last_response = entries[#entries].response
return {
  url = splash:url(),
  headers = last_response.headers,
  http_status = last_response.status,
  cookies = splash:get_cookies(),
  html = splash:html(),
}

ScrapyとSplashでのセッションハンドリング

Splashのセッションハンドリング

ScrapyとSplashの間

セッションハンドリング

Splash経由でのresponseの構造

セッションハンドリングのサンプルコード

リクエストで注目するポイント

レスポンスで注目するポイント

nullpo

プロセスのメモリダンプをとる

KeyringでOSのパスワード管理機構を利用する

PythonでGmailを使ったメール送信

SMTPHandlerでログ出力をメール通知する

SlackのIncoming WebHooksを使う

Hexoを使った静的サイト作成

モダンWebホスティングサービスNetlify

Hexoの基本操作チュートリアル

Hexo Markdown

画像の引用

ハイパーリンク

カテゴリ・タグ

続きを読む

CI/CD