Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism #109

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

JunTingLin
Copy link

作者您好,

首先感謝您開發並分享這麼實用的專案。我在使用過程中發現,自從過年之後,原本透過靜態爬蟲requests去抓取http://isin.twse.com.tw/isin/C_public.jsp?strMode=2 上的所有股票代號資料的方法已經無法正常運作了。我推測這可能是網站加強了防爬機制的結果。

為了解決這個問題,我對fetch.py中的fetch_data函數進行了一番修正,改用Selenium進行動態爬蟲。考慮到可能有使用者會在無GUI環境下運行此專案,我有啟用了無頭模式(headless mode)。但...一旦啟用無頭模式後,就頻繁遇到連線失敗的問題。經過一番嘗試後,我發現了一個可行的解決方案:先訪問主頁面https://isin.twse.com.tw 並暫停幾秒,然後再去訪問目標URL,這樣就能順利獲取所需的資料了。

如果我的修改存在任何問題,或者有更好的解決方案,請隨時聯繫我。

@JeffBla
Copy link
Contributor

JeffBla commented Apr 11, 2024

Hello JunTingLin! I think I encountered the same problem with you. The update function fails.
I analyze it and make some adjustments in #110
I'm wondering whether it would be better not to use Selenium?

Copy link

@mitchhuang777 mitchhuang777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Consider adding try-except blocks can help handle potential exceptions.
  2. use WebDriverWait(driver, 10).until rather than time.sleep

driver.get(main_page_url)
time.sleep(5) # 等待JavaScript渲染完成
driver.get(url)
time.sleep(5) # 等待JavaScript渲染完成

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magical number is not a good way :(

# 使用WebDriver先訪問主頁面,再訪問指定的URL
main_page_url = "https://isin.twse.com.tw"
driver.get(main_page_url)
time.sleep(5) # 等待JavaScript渲染完成

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magical number is not a good way :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants