Selenium と BeautifulSoup でブラウザ経由のスクレイピングを試みる

-----

・追記 2023-10-24

Driver を自動的にバージョンアップしてくれる Webdriver Manager を使用する方法に変更しました。

-----

Selenium と BeautifulSoup を使い、ブラウザ経由のスクレイピングをする機会があったので、備忘録を残しておこうと思います。

Requests からではデータを取得できないウェブページであっても、この方法であれば可能になります。例えば、~~Amazon~~ 某大手通販サイトとかですね。（スクレイピングは利用規約を守り、サーバーに負担をかけないように配慮の上で行う必要があります）

独習Python
by SimpleImageLink

今回は、ローカルマシンではなく、GCE（Google Compute Engine）上の Ubuntu 22.04 LTS で動かす想定です。

Python や pip 等のパッケージはインストールされているものとしています。

ブラウザは Chrome を使用します。

まずは、selenium と BeautifulSoup をインストール。

$ sudo pip install selenium
$ sudo pip install beautifulsoup4

Chrome をインストール。

$ sudo apt -y update
$ sudo apt install -y wget
$ wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
$ sudo apt install -y ./google-chrome-stable_current_amd64.deb
$ sudo apt install -y -f

Webdriver Manager をインストール。

$ sudo pip install webdriver_manager

これで、下準備が整いましたので、Selenium と BeautifulSoup によるスクレイピングができるようになりました。

以下、簡単な例として、このブログのタイトルを取得するためのコードになります。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

# Seleniumの設定を行います
options = Options()
options.add_argument("--headless")  # ブラウザを表示しない設定（バックグラウンドで実行）
options.add_argument("--ignore-certificate-errors")  # SSL証明書のエラーを無視

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# ウェブページにアクセスします
url = "https://www.nakakamado.com/"
driver.get(url)

# ウェブページの読み込みが完了するまで待機します（最大10秒）
driver.implicitly_wait(10)

# ウェブページの内容をBeautiful Soupで解析します
soup = BeautifulSoup(driver.page_source, "html.parser")

# タイトルを取得します
title = soup.find("title")

# タイトルを出力します
print(title.text)

# ブラウザを閉じます
driver.quit()

ということで、Selenium と BeautifulSoup でブラウザ経由のスクレイピングでした。

ちなみに、コードは ChatGPT に書いてもらいました（ちょっと修正）。

もう少し大きめなコードだったり、知らないライブラリを利用する際にも、最初の骨格部分を書いてもらい、そこから都度エラーの解説をしてもらいつつ作ると、とても捗ります。勉強にもなります。

ファーマーズハウスさわ

ページ

2023年10月10日火曜日

Selenium と BeautifulSoup でブラウザ経由のスクレイピングを試みる