微博爬虫

爬取指定用户尽可能全面的信息。

已实现的功能

抓取用户个人基本信息。

抓取用户的所有微博，包括被转发的微博，支持长微博正文、图片ID和文件，可以只抓取指定日期后。

抓取用户互相关注的用户基本信息，可以包含间接互关，可以指定抓取用户粉丝门槛和数量上限。

使用方法

使用虚拟环境（推荐）

# 创建虚拟环境
python -m venv ./venv
# 进入虚拟环境
.\venv\Scripts\activate

安装依赖

pip install -r requirements.txt

获取cookie

访问新浪微博网页版，登录账号。

之后后按F12显示“开发人员工具”，按F5刷新，按图中依次点击，找到cookie并复制。cookie务必保密

配置和运行

首次运行将config_simple.json5复制为config.json5

配置config.json5后再次运行，将执行抓取

python main.py

数据保存

所有数据保存到 output以用户id命名的文件夹

用户信息保存到 user.json
微博保存到mblog.json
互关保存到mutual_follow.json
微博附图保存到 image

配置说明

见config_simple.json5

实现计划

功能

配置文件使用JSON5
支持获取用户非互关的关注
下载头条文章
下载视频
通过将接口的返回值保存到文件，减少重复的网络请求
人类友好的显示抓取到的数据，特别是微博的附图

系统配置

用户信息

用户个人信息
互相关注的用户基本信息
保留不同时间爬取的不同版本的用户信息
互动的用户id

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README		README
.gitignore		.gitignore
API.md		API.md
README.md		README.md
config_simple.json5		config_simple.json5
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README

README

.gitignore

.gitignore

API.md

API.md

README.md

README.md

config_simple.json5

config_simple.json5

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

微博爬虫

已实现的功能

使用方法

使用虚拟环境（推荐）

安装依赖

获取cookie

配置和运行

数据保存

配置说明

实现计划

功能

系统配置

用户信息

微博信息

About

Releases

Packages

Languages

YuSitong1999/weibo-crawler

Folders and files

Latest commit

History

Repository files navigation

微博爬虫

已实现的功能

使用方法

使用虚拟环境（推荐）

安装依赖

获取cookie

配置和运行

数据保存

配置说明

实现计划

功能

系统配置

用户信息

微博信息

About

Topics

Resources

Stars

Watchers

Forks

Languages