vnnews crawler

A Python package that helps crawl updates from top Vietnamese news providers.

II. REFERENCES

2.1. How to use this package?

You can install the latest vnnews crawler version from source with the following command: pip install git+https://github.com/thinh-vu/vnnews.git@main
Install the stable version: pip install vnnews (*) You might need to insert a ! before your command when running terminal commands on Google Colab.
To start using functions, you need to import them: from vnnews import *

2.2. List of Popular Online news for investors

2.3. Function references

url_extract (url, key, tag_class='', type='link', bs_on=True, user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64; rv:11.0) Gecko/20100101')
- Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
- Arguments:
  - url (:obj:str, required): url of the target news source. Eg. 'https://cafef.vn/'
  - key (:obj:str, required): HTML tag which contains the information that you want to extract. Eg. 'h3', 'article', 'div'
  - tag_class (:obj:str, required): The HTML class attribute specifies one or more class names for an element. Eg. 'pdate' in the tag 19-11-2022 - 15:32 PM on CafeF.
  - type (:obj:str, optional): 'link' as default to extract only the article link from a news homepage. Use blank value '' when extracting article detail on the article page.
  - bs_on (:obj:str, optional): True as default. Input blank '' when the issue is raised.
  - user_agent (:obj:str, optional): The default value for Desktop has been provided. You can find more user agent value here: https://developers.whatismybrowser.com/useragents/explore/operating_system_name/
fix_url(host, url)
- Purpose: Extract article info from a news source using BeautifulSoup to pull data from HTML/XML web page.
- Arguments:
  - host (:obj:str, required): the host name of the news source. Eg. 'https://vneconomy.vn
  - url (:obj:str, required): the url string of the target news source. This might not contain the host at the beginning. Eg. '/de-viet-nam-thanh-digital-hub-cua-khu-vuc-vao-nam-2030-e290.htm'

2.4. Let's get our hands dirty

VN Express
- Get the list of article urls: url_extract('https://vnexpress.net/kinh-doanh', key='h3')
- Extract article details: url_extract('https://vnexpress.net/thuong-mai-va-dau-tu-ben-vung-se-giup-apec-ung-pho-nguy-co-suy-thoai-4538015.html', key='span', tag_class='date', type='')
Tuổi trẻ Online
- Get the list of article urls: url_extract('https://tuoitre.vn/phap-luat.htm', key='h3')
- Extract article details: url_extract('https://tuoitre.vn/gap-thu-tuong-xuc-dong-chuyen-co-giao-mam-non-miet-mai-lam-thien-nguyen-cho-vung-xa-20221119175021292.htm', key='div', tag_class='date-time', type='')
CafeF
- Get the list of article urls: url_extract('https://cafef.vn/bat-dong-san.chn', key='h3', type='link')
- Extract article details: url_extract('https://cafef.vn/dau-se-la-phan-khuc-bds-giu-duoc-nhiet-trong-thoi-gian-toi-2022111913083069.chn', key='span', tag_class='pdate', type='')
Cafebiz
- Get the list of article urls: url_extract('https://cafebiz.vn/vi-mo.chn', key='h3', type='link', bs_on='')
- Extract article details: url_extract('https://cafebiz.vn/tai-sao-nha-o-my-la-tai-san-con-o-nhat-ban-thi-lai-chang-khac-gi-hang-tieu-dung-176221119095831295.chn', key='span', tag_class='time', type='')

Kinh tế Sài Gòn Online
- Get the list of article urls: url_extract('https://thesaigontimes.vn/', key='h3', type='link', bs_on='')
- Extract article details: url_extract('https://thesaigontimes.vn/kinh-te-tuan-hoan-mo-ra-nhung-mo-hinh-kinh-doanh-moi/', key='time', tag_class='', type='')
VN Economy
- Get the list of article urls: url_extract('https://vneconomy.vn/', key='h3', type='link', bs_on=False)
- Extract article details: url_extract('https://vneconomy.vn/xuat-khau-det-may-van-tu-tin-voi-muc-tieu-42-ty-usd.htm', key='div', tag_class='detail__meta', type='')
Pháp Luật Tp.HCM
- Get the list of article urls: url_extract('https://m.plo.vn/phap-luat/', key='h3', type='link')[0][1]
- Extract article details: test = url_extract('https://plo.vn/dieu-tra-trung-tam-dang-kiem-cap-so-song-sinh-cho-xe-tai-post705918.html', key='time', tag_class='', type='')
Đầu tư Online
- Get the list of article urls: url_extract('https://baodautu.vn/', key='article', type='link', bs_on='')
- Extract article details: url_extract('https://baodautu.vn/nguoi-dan-rong-ra-cau-cuu-khi-nao-co-so-do-tu-du-an-cua-cong-ty-bach-dat-an-d177946.html', key='span', tag_class='post-time', type='')
Nhịp cầu đầu tư

Get the list of article urls: url_extract('https://m.nhipcaudautu.vn/kinh-doanh/', key='article', type='link', bs_on='', user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X)')
Extract article details: url_extract('https://m.nhipcaudautu.vn/ti-le-don-bay-tai-chinh-toan-thi-truong-giam-dan-tu-quy-i-3348999/', key='span', tag_class='date-post', type='')

Diễn đàn doanh nghiệp
- Get the list of article urls: url_extract('https://diendandoanhnghiep.vn/', key='h3', type='link', bs_on='')
- Extract article details: url_extract('https://diendandoanhnghiep.vn/https-diendandoanhnghiep-vn-dien-mat-troi-mai-nha-can-hoan-thien-co-che-ho-tro-doanh-nghiep-phat-trien-225626-html-e313.html', key='span', tag_class='created_time', type='')
Diễn đàn kinh tế Việt Nam - Vietnamnet
- Get the list of article urls: url_extract('https://vef.vn/diem-nong/', key='article', type='link', bs_on='')
- Extract article details: ``
Forbes Việt Nam
- Get the list of article urls: url_extract('https://forbes.vn', key='h3', type='link', bs_on='')
- Extract article details: url_extract('https://forbes.vn/m-village-cua-nguyen-hai-ninh-xay-lang-trong-pho/', key='div', tag_class='forbes-single__heading-time', type='')
Vietstock
- Get the list of article urls: url_extract('https://vietstock.vn/', key='h4', type='link', bs_on='')
- Extract article details: url_extract('https://vietstock.vn/2022/11/thieu-hut-iphone-14-nguoi-dung-viet-lua-chon-iphone-doi-cu-4264-1017483.htm', key='span', tag_class='date', type='')
Tin nhanh chứng khoán
- Get the list of article urls: Doesn't work url_extract('https://m.tinnhanhchungkhoan.vn/', key='h2', type='link', bs_on='')
- Extract article details: url_extract('https://www.tinnhanhchungkhoan.vn/big-trends-sau-con-mua-troi-lai-sang-post310328.html', key='time', tag_class='', type='')
Cafe Land
- Get the list of article urls: url_extract('https://cafeland.vn/', key='h3', type='link', bs_on='')
- Extract article details: url_extract('https://cafeland.vn/phan-tich/bien-doi-khi-hau-dang-leo-thang-nhung-doanh-nghiep-chu-yeu-doi-pho-114941.html', key='div', tag_class='info-date right', type='')
Kenh14
- Get the list of article urls: url_extract('https://m.kenh14.vn/doi-song.chn', key='h3', type='link')
- Extract article details: url_extract('https://m.kenh14.vn/phia-sau-nhung-gen-z-okela-co-luc-that-bai-co-luc-khong-on-lam-nhung-chua-bao-gio-ngung-no-luc-20221119153833146.chn', key='span', tag_class='kbwcm-time', type='')
Dân trí
- Get the list of article urls: url_extract('https://dantri.com.vn/', key='h3', type='link', bs_on='')
- Extract article details: url_extract('https://dantri.com.vn/the-gioi/moscow-cao-buoc-ukraine-kich-dong-xung-dot-quan-su-nga-nato-20221119145209276.htm', key='time', tag_class='author-time', type='')
Thanh niên
- Get the list of article urls: ``
- Extract article details: ``
Vietnamnet
- Get the list of article urls: ``
- Extract article details: ``
Nhân dân điện tử
- Get the list of article urls: ``
- Extract article details: ``
Lao động
- Get the list of article urls: ``
- Extract article details: ``
Đời sống & pháp luật
- Get the list of article urls: ``
- Extract article details: ``

III. APENDICES

Demo video: How to select the key
Explore User Agents by Operating System: here

IV. 🙋‍♂️ CONTACT INFORMATION

You can contact me at one of my social network profiles:

If you want to support my open-source projects, you can "buy me a coffee" via Patreon or Momo e-wallet (VN). Your support will help to maintain my blog hosting fee & to develop high-quality content.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
vnnews		vnnews
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
vnnews_Demo_workbook_2022_11_20.ipynb		vnnews_Demo_workbook_2022_11_20.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vnnews

vnnews

.DS_Store

.DS_Store

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

vnnews_Demo_workbook_2022_11_20.ipynb

vnnews_Demo_workbook_2022_11_20.ipynb

Repository files navigation

vnnews crawler

II. REFERENCES

2.1. How to use this package?

2.2. List of Popular Online news for investors

2.3. Function references

2.4. Let's get our hands dirty

III. APENDICES

IV. 🙋‍♂️ CONTACT INFORMATION

About

Releases

Packages

Languages

License

thinh-vu/vnnews

Folders and files

Latest commit

History

Repository files navigation

vnnews crawler

II. REFERENCES

2.1. How to use this package?

2.2. List of Popular Online news for investors

2.3. Function references

2.4. Let's get our hands dirty

III. APENDICES

IV. 🙋‍♂️ CONTACT INFORMATION

About

Topics

Resources

License

Stars

Watchers

Forks

Languages