Study Journals of TYRuan: 使用Python 寫爬蟲範例

2024年9月4日星期三

使用Python 寫爬蟲範例

############################ ubuntu ############################
$ sudo apt install python3-pip
$ sudo apt install python3-selenium
$ sudo apt install python3-bs4
$ sudo apt install python3-requests
$ sudo apt install xvfb
$ export DISPLAY=:99
$ Xvfb :99 -screen 0 1024x768x16 &

############################ fedora ############################
$ sudo yum install python3-pip
$ sudo pip3 install selenium
$ Xvfb :99 -screen 0 1024x768x16 &sudo pip3 install bs4

$ wget https://github.com/mozilla/geckodriver/releases/download/v0.35.0/geckodriver-v0.35.0-linux64.tar.gz
$ tar xvf geckodriver-v0.35.0-linux64.tar.gz
$ sudo cp geckodriver /usr/local/bin/

$ cat get_fedex.py
#!/usr/bin/python3

# 載入需要的套件
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.service import Service
import requests
import time

# 開啟瀏覽器視窗(Chrome)# 方法一：執行前需開啟chromedriver.exe且與執行檔在同一個工作目錄
#driver = webdriver.Chrome()
#service = Service(executable_path='/usr/local/bin/geckodriver')
service = Service('/usr/local/bin/geckodriver')
driver = webdriver.Firefox(service=service)
#driver = webdriver.Firefox()

# 方法二：或是直接指定exe檔案路徑
#driver = webdriver.Firefox("/usr/local/bin")

driver.implicitly_wait(3) #等10秒，讓網頁資料load進來
driver.get("https://www.fedex.com/zh-tw/shipping/surcharges.html") # 更改網址以前往不同網頁
#time.sleep(100)

#driver.find_elements(By.CLASS_NAME,"fxg-gdpr__accept-all-btn cc-aem-c-button cc-aem-c-button--responsive cc-aem-c-button--primary")[0].click()
#buttons=driver.find_elements(By.CLASS_NAME,"fxg-gdpr__accept-all-btn cc-aem-c-button cc-aem-c-button--responsive cc-aem-c-button--primary")

button=driver.find_elements(By.CSS_SELECTOR,".fxg-gdpr__accept-all-btn")[0] #使用Selenium IDE找出來 "accept all cookie" 的按鈕
print(button)
button.click()

driver.refresh() #接受cookie後，要將網頁refresh，重load資料

#使用Selenium IDE找出來要抓資料
data=driver.find_elements(By.CSS_SELECTOR,".fuelsurcharg-dynamic-datalookup .cc-aem-c-table__tbody:nth-child(2) .cc-aem-c-table__tbody__td:nth-child(1)")[0]
print(data.get_attribute("innerText"))
data=driver.find_elements(By.CSS_SELECTOR,".fuelsurcharg-dynamic-datalookup .cc-aem-c-table__tbody:nth-child(2) .cc-aem-c-table__tbody__td:nth-child(2)")[0]
print(data.get_attribute("innerText"))
data=driver.find_elements(By.CSS_SELECTOR,".fuelsurcharg-dynamic-datalookup .cc-aem-c-table__tbody:nth-child(2) .cc-aem-c-table__tbody__td:nth-child(3)")[0]
print(data.get_attribute("innerText"))

#print (driver.title)
#html=driver.page_source
#print (html)

#soup = BeautifulSoup(driver.page_source, 'lxml')
#print(soup.prettify())
#
#with open('index.html', 'w', encoding='utf-8',) as file:
# file.write(soup.prettify())

driver.close() # 關閉瀏覽器視窗

Study Journals of TYRuan

標籤

2024年9月4日星期三

使用Python 寫爬蟲範例

沒有留言:

張貼留言

網誌存檔

標籤

2024年9月4日 星期三

使用Python 寫爬蟲範例

沒有留言:

張貼留言

網誌存檔

2024年9月4日星期三