Movie-Recommendation

说明

本影视推荐系统根据电影影评获取电影属性，根据用户自身的影评生成用户属性。再根据用户属性和电影属性进行匹配推荐。

instruction

The movie recommendation system obtains movie attributes according to movie reviews, and generates user attributes according to user's own movie reviews. Then matching recommendation is made according to user attributes and movie attributes. module

模块

语料库&语料库爬取模块

语料库是通过语料库爬取模块爬取在豆瓣上的用户影评和电影影评而生成的。

数据预处理模块

数据预处理模块根据爬取的语料生成电影分类词典和影视词库。

影视推荐模块

此模块根据预处理模块数据用以确定用户和电影属性评分从而推荐。

module

corpus&&corpus acquirition

Corpus is generated by user reviews and movie reviews which are crawled on the bean by the corpus crawling module.

Data Preprocessing Module

Data preprocessing module generates movie classification dictionary and movie vocabulary based on the crawled corpus.

Movie Recommendation Module

According to the data of the pre-processing module, this module can determine the user and movie attribute score and recommend it.

整体流程

信息收集阶段通过豆瓣网站获取影评、从百科定义中获取各个类型电影的定义。之后对百科定义进行分词、去除停用词、并且构建电影分类词典。豆瓣影评也是通过分词、去除停用词后构建评论词库。最后通过这两个词库生成属性评分，再根据属性评分进行匹配推荐。

Overall process

In the information gathering stage, film reviews are obtained through Douban website and definitions of various types of movies are obtained from Encyclopedia definitions. Then the encyclopedia definition is segmented, stop words are removed, and a movie classification dictionary is constructed. Douban Movie Review also constructs a commentary thesaurus through word segmentation and deletion of stop words. Finally, attribute scores are generated through these two lexicons, and matching recommendation is made according to attribute scores.

环境

整体所需要的环境是：python2、python3
其中用到的库有requests库、bs4库、fake_useragent库、pkuseg库
另外还需要pe文件执行环境

environment

The overall environment required is: Python 2, Python 3
The libraries used are requests library, BS4 library, fake_useragent library and pkuseg library.
You also need the PE file execution environment

语料库

本语料库中分为“电影影评”和“用户影评”
其中“用户影评”为一个用户近期以来的十条评论，用以确定用户的属性
其中“电影影评”为一个电影的前五页的评论，用以确定电影的属性

如果需要增加数据，请使用user_reviews.py和movie_reviews.py爬取数据
环境：
python2
requests库
fake_useragent库（可选）

corpus

The corpus is divided into "film reviews" and "user reviews"
Among them, "User Movie Review" is a user's recent ten comments to determine the user's attributes.
Among them, "Film Review" is the first five pages of a film's commentary to determine the nature of the film.

If you need to add data, use user_reviews.py and movie_reviews.py to crawl data. environment：
python2
requests
fake_useragent（optional）

爬虫程序说明

其中proxies可自行更改可用爬虫代理，所爬取到的数据存入的文件的文件名，请将open的第一个参数改为自己所需要的名称。如果需要更改爬取数目以增加识别精度，请修改final_page变量为想要的页数（用户评论一页10条，电影评论一页20条）。本脚本文件使用方法可以参考youtube视频：爬虫演示

Reptilian Program Description

Proxies can change the file name of the file in which the crawler agent is available. Please change the first parameter of open to the name you need. If you need to change the number of crawls to increase recognition accuracy, change the final_page variable to the number of pages you want (10 for user reviews and 20 for movie reviews). Use of this script file can refer to YouTube Video:Crawler Demo

demo

爬取电影影评(Climbing Movie Review) 爬取结果(Crawling results) 爬取用户影评(Climbing User Movie Review) 爬取用户影评结果(Crawling User Movie Review Results)

语料说明

	来源	作用	数目
用户评论	豆瓣，同一用户近期评论	用以确定用户属性	10条
电影评论	豆瓣，同一电影前5页评论	用以确定电影属性	5页每页20条

每条评论之间以等号串进行分隔。

Corpus Description

	Source	role	number
User comments	Douban, the same user's recent comments	User attributes	10
Film Review	Movie Review Douban, the first five pages of the same movie	Used to determine movie attributes	5*20

Each comment is separated by an equal sign string.

版权说明

本语料库出于非商业目的，如果有侵权，请在issue下面留言。

Copyright Notes

This corpus is for non-commercial purposes. If there is any infringement, please leave a message under issue.

预处理模块

Data Pre-Processing文件夹中包含5个自动化脚本：

seg.py 单一文件分词脚本
clean.py 去除停用词脚本
dictionary.py 构建词典脚本
count.py 词数统计脚本
whileseg.py 批量分词脚本

脚本使用方法可以见：Data Pre Processing预处理演示

Preprocessing module

The Data Pre-Processing folder contains five automation scripts:

Seg.py：single file word segmentation script
Clean.py：removes stop-word scripts
Dictionary.py：Building Dictionary Scripts
Count.py：Word Number Statistics Script
Whleseg.py：Batch Word Segmentation Script

Demonstration of script usage can be seen as follows：Data Pre Processing预处理演示

所需环境

python3版本，需要实现安装pkuseg库。

Required environment

Python 3 and later. The pkuseg library needs to be installed.

demo

影评清洗(Film review cleaning)
影评清洗结果(Result of film review cleaning)
电影定义(Finding the Definition of Film Type)
电影分类词典(Constructing a Dictionary of Film Classification)

使用说明

本程序使用时需要运行movie_attr.bat用以获取电影属性评分，运行user_attr.bat用以获取用户属性评分。

注意：使用此批处理文件时一定需要预装好整体所需环境，否则会失败！！

Instructions

This program needs to run movie_attr.bat to get the movie attribute score, and user_attr.bat to get the user attribute score.

Note: When using this batch file, you must pre-install the whole environment, otherwise you will fail!!

致谢

感谢@nateprewitt提供的requests库
感谢@Aplicity提供的keyword_marry算法
感谢@jingjingxupku提供的pkuseg多领域中文分词工具
感谢@goto456提供的停用词表

参考文献

[1]王侨云,朱广丽,张顺香.基于词间距和点互信息的影评情感词库构建[J].阜阳师范学院学报(自然科学版),2019,36(02):40-46.
[2]王婷婷.字符串模糊匹配算法的探讨[J].现代计算机(专业版),2012(01):12-15.
[3]S_H-A_N.基于情感词典的情感分析[EB/OL].https://blog.csdn.net/lom9357bye/article/details/79058946,2018-1--19.
[4]刘鹏.利用网络爬虫技术获取他人数据行为的法律性质分析[J].信息安全研究,2019,5(06):548-552.
[5]黄克敏.网站信息安全之反爬虫策略[J].保密科学技术,2018(10):62-63.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Data Pre-Processing		Data Pre-Processing
Movie+User - git		Movie+User - git
pic		pic
语料库		语料库
README.md		README.md

Amanda-WangXiao/Movie-Recommendation

Folders and files

Latest commit

History

Repository files navigation

Movie-Recommendation

说明

instruction

模块

语料库&语料库爬取模块

数据预处理模块

影视推荐模块

module

corpus&&corpus acquirition

Data Preprocessing Module

Movie Recommendation Module

整体流程

Overall process

环境

environment

语料库

corpus

爬虫程序说明

Reptilian Program Description

demo

语料说明

Corpus Description

版权说明

Copyright Notes

预处理模块

Preprocessing module

所需环境

Required environment

demo

推荐模块(Recommendation module)

所需环境

Required environment

demo

使用说明

Instructions

致谢

参考文献

About

Topics

Resources

Stars

Watchers

Forks

Languages