Skip to content

thinh-vu/vnnews

Repository files navigation

vnnews crawler

A Python package that helps crawl updates from top Vietnamese news providers.

Version Download Badge Commit Badge License Badge

II. REFERENCES

2.1. How to use this package?

  • You can install the latest vnnews crawler version from source with the following command: pip install git+https://github.com/thinh-vu/vnnews.git@main

  • Install the stable version: pip install vnnews (*) You might need to insert a ! before your command when running terminal commands on Google Colab.

  • To start using functions, you need to import them: from vnnews import *

2.2. List of Popular Online news for investors

  1. VN Express
  2. Tuổi trẻ Online
  3. CafeF
  4. Cafebiz
  5. Kinh tế Sài Gòn Online
  6. VN Economy
  7. Pháp Luật Tp.HCM
  8. Đầu tư Online
  9. Nhịp cầu đầu tư
  10. Diễn đàn doanh nghiệp
See more

  1. Diễn đàn kinh tế Việt Nam - Vietnamnet
  2. Forbes Việt Nam
  3. Vietstock
  4. Tin nhanh chứng khoán
  5. Cafe Land
  6. Kenh14
  7. Dân trí
  8. Thanh niên
  9. Vietnamnet
  10. Nhân dân điện tử
  11. Lao động
  12. Đời sống & pháp luật

2.3. Function references

  • url_extract (url, key, tag_class='', type='link', bs_on=True, user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64; rv:11.0) Gecko/20100101')

    • Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
    • Arguments:
      • url (:obj:str, required): url of the target news source. Eg. 'https://cafef.vn/'
      • key (:obj:str, required): HTML tag which contains the information that you want to extract. Eg. 'h3', 'article', 'div'
      • tag_class (:obj:str, required): The HTML class attribute specifies one or more class names for an element. Eg. 'pdate' in the tag 19-11-2022 - 15:32 PM on CafeF.
      • type (:obj:str, optional): 'link' as default to extract only the article link from a news homepage. Use blank value '' when extracting article detail on the article page.
      • bs_on (:obj:str, optional): True as default. Input blank '' when the issue is raised.
      • user_agent (:obj:str, optional): The default value for Desktop has been provided. You can find more user agent value here: https://developers.whatismybrowser.com/useragents/explore/operating_system_name/
  • fix_url(host, url)

    • Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
    • Arguments:
      • host (:obj:str, required): the host name of the news source. Eg. 'https://vneconomy.vn
      • url (:obj:str, required): the url string of the target news source. This might not contain the host at the beginning. Eg. '/de-viet-nam-thanh-digital-hub-cua-khu-vuc-vao-nam-2030-e290.htm'

2.4. Let's get our hands dirty

  1. VN Express
    • Get the list of article urls: url_extract('https://vnexpress.net/kinh-doanh', key='h3')
    • Extract article details: url_extract('https://vnexpress.net/thuong-mai-va-dau-tu-ben-vung-se-giup-apec-ung-pho-nguy-co-suy-thoai-4538015.html', key='span', tag_class='date', type='')
  2. Tuổi trẻ Online
    • Get the list of article urls: url_extract('https://tuoitre.vn/phap-luat.htm', key='h3')
    • Extract article details: url_extract('https://tuoitre.vn/gap-thu-tuong-xuc-dong-chuyen-co-giao-mam-non-miet-mai-lam-thien-nguyen-cho-vung-xa-20221119175021292.htm', key='div', tag_class='date-time', type='')
  3. CafeF
    • Get the list of article urls: url_extract('https://cafef.vn/bat-dong-san.chn', key='h3', type='link')
    • Extract article details: url_extract('https://cafef.vn/dau-se-la-phan-khuc-bds-giu-duoc-nhiet-trong-thoi-gian-toi-2022111913083069.chn', key='span', tag_class='pdate', type='')
  4. Cafebiz
    • Get the list of article urls: url_extract('https://cafebiz.vn/vi-mo.chn', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://cafebiz.vn/tai-sao-nha-o-my-la-tai-san-con-o-nhat-ban-thi-lai-chang-khac-gi-hang-tieu-dung-176221119095831295.chn', key='span', tag_class='time', type='')
See more

  1. Kinh tế Sài Gòn Online
    • Get the list of article urls: url_extract('https://thesaigontimes.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://thesaigontimes.vn/kinh-te-tuan-hoan-mo-ra-nhung-mo-hinh-kinh-doanh-moi/', key='time', tag_class='', type='')
  2. VN Economy
    • Get the list of article urls: url_extract('https://vneconomy.vn/', key='h3', type='link', bs_on=False)
    • Extract article details: url_extract('https://vneconomy.vn/xuat-khau-det-may-van-tu-tin-voi-muc-tieu-42-ty-usd.htm', key='div', tag_class='detail__meta', type='')
  3. Pháp Luật Tp.HCM
    • Get the list of article urls: url_extract('https://m.plo.vn/phap-luat/', key='h3', type='link')[0][1]
    • Extract article details: test = url_extract('https://plo.vn/dieu-tra-trung-tam-dang-kiem-cap-so-song-sinh-cho-xe-tai-post705918.html', key='time', tag_class='', type='')
  4. Đầu tư Online
    • Get the list of article urls: url_extract('https://baodautu.vn/', key='article', type='link', bs_on='')
    • Extract article details: url_extract('https://baodautu.vn/nguoi-dan-rong-ra-cau-cuu-khi-nao-co-so-do-tu-du-an-cua-cong-ty-bach-dat-an-d177946.html', key='span', tag_class='post-time', type='')
  5. Nhịp cầu đầu tư
  • Get the list of article urls: url_extract('https://m.nhipcaudautu.vn/kinh-doanh/', key='article', type='link', bs_on='', user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X)')
  • Extract article details: url_extract('https://m.nhipcaudautu.vn/ti-le-don-bay-tai-chinh-toan-thi-truong-giam-dan-tu-quy-i-3348999/', key='span', tag_class='date-post', type='')
  1. Diễn đàn doanh nghiệp
    • Get the list of article urls: url_extract('https://diendandoanhnghiep.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://diendandoanhnghiep.vn/https-diendandoanhnghiep-vn-dien-mat-troi-mai-nha-can-hoan-thien-co-che-ho-tro-doanh-nghiep-phat-trien-225626-html-e313.html', key='span', tag_class='created_time', type='')
  2. Diễn đàn kinh tế Việt Nam - Vietnamnet
    • Get the list of article urls: url_extract('https://vef.vn/diem-nong/', key='article', type='link', bs_on='')
    • Extract article details: ``
  3. Forbes Việt Nam
    • Get the list of article urls: url_extract('https://forbes.vn', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://forbes.vn/m-village-cua-nguyen-hai-ninh-xay-lang-trong-pho/', key='div', tag_class='forbes-single__heading-time', type='')
  4. Vietstock
    • Get the list of article urls: url_extract('https://vietstock.vn/', key='h4', type='link', bs_on='')
    • Extract article details: url_extract('https://vietstock.vn/2022/11/thieu-hut-iphone-14-nguoi-dung-viet-lua-chon-iphone-doi-cu-4264-1017483.htm', key='span', tag_class='date', type='')
  5. Tin nhanh chứng khoán
    • Get the list of article urls: Doesn't work url_extract('https://m.tinnhanhchungkhoan.vn/', key='h2', type='link', bs_on='')
    • Extract article details: url_extract('https://www.tinnhanhchungkhoan.vn/big-trends-sau-con-mua-troi-lai-sang-post310328.html', key='time', tag_class='', type='')
  6. Cafe Land
    • Get the list of article urls: url_extract('https://cafeland.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://cafeland.vn/phan-tich/bien-doi-khi-hau-dang-leo-thang-nhung-doanh-nghiep-chu-yeu-doi-pho-114941.html', key='div', tag_class='info-date right', type='')
  7. Kenh14
    • Get the list of article urls: url_extract('https://m.kenh14.vn/doi-song.chn', key='h3', type='link')
    • Extract article details: url_extract('https://m.kenh14.vn/phia-sau-nhung-gen-z-okela-co-luc-that-bai-co-luc-khong-on-lam-nhung-chua-bao-gio-ngung-no-luc-20221119153833146.chn', key='span', tag_class='kbwcm-time', type='')
  8. Dân trí
    • Get the list of article urls: url_extract('https://dantri.com.vn/', key='h3', type='link', bs_on='')
    • Extract article details: url_extract('https://dantri.com.vn/the-gioi/moscow-cao-buoc-ukraine-kich-dong-xung-dot-quan-su-nga-nato-20221119145209276.htm', key='time', tag_class='author-time', type='')
  9. Thanh niên
    • Get the list of article urls: ``
    • Extract article details: ``
  10. Vietnamnet
    • Get the list of article urls: ``
    • Extract article details: ``
  11. Nhân dân điện tử
    • Get the list of article urls: ``
    • Extract article details: ``
  12. Lao động
    • Get the list of article urls: ``
    • Extract article details: ``
  13. Đời sống & pháp luật
    • Get the list of article urls: ``
    • Extract article details: ``

III. APENDICES

  • Demo video: How to select the key
  • Explore User Agents by Operating System: here

IV. 🙋‍♂️ CONTACT INFORMATION

You can contact me at one of my social network profiles:


If you want to support my open-source projects, you can "buy me a coffee" via Patreon or Momo e-wallet (VN). Your support will help to maintain my blog hosting fee & to develop high-quality content.

momo-qr

About

A Python package that helps capture news updates from top Vietnamese news sites

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published