Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pro.major_news中,部分src完全相同的重复新闻在一个很短时间段内超过200篇,挤占其他新闻提取,期待官方去重 #1715

Open
firezym opened this issue Aug 2, 2023 · 0 comments

Comments

@firezym
Copy link

firezym commented Aug 2, 2023

新闻通讯pro.major_news中,部分src重复新闻在一个很短时间段超过200篇,很容易挤占其他src的新闻提取,因为接口有每次提取的限额,这样会导致有些新闻提取不到,以下面时间段为例,5分钟内有超过200篇,实际上去重后仅有16篇。期待官方去重。

df = pro.major_news(src='', start_date='2020-06-03 08:30:00', end_date='2020-06-03 08:35:00', fields='pub_time,title,content,src')
print(len(df))
print(len(df.drop_duplicates(subset=['title', 'content'])))
df

下面列出了部分有大量重复的时间段,因为数据没拉完,所以应该还不是全部,看了下,其中src='凤凰财经'出现概率较高(22年2月以后,这个现象少很多):

2020-05-18 08:30:00  2020-05-18 09:00:00
2020-06-03 08:30:00  2020-06-03 09:00:00
2020-06-05 08:30:00  2020-06-05 09:00:00
2020-06-06 08:30:00  2020-06-06 09:00:00
2020-06-06 15:00:00  2020-06-06 15:30:00
2020-06-09 09:00:00  2020-06-09 09:30:00
2020-06-11 08:30:00  2020-06-11 09:00:00
2020-06-11 15:30:00  2020-06-11 16:00:00
2020-06-12 08:30:00  2020-06-12 09:00:00
2020-06-12 20:30:00  2020-06-12 23:59:59
2020-06-14 16:00:00  2020-06-15 16:59:59
2020-12-21 09:30:00  2020-12-21 10:00:00
2021-06-21 05:00:00  2021-06-21 08:00:00
2021-06-23 00:00:00  2021-06-23 01:00:00
2021-06-25 01:00:00  2021-06-25 07:00:00
2021-06-29 03:00:00  2021-06-29 04:00:00
2021-07-01 01:00:00  2021-07-01 04:00:00
2021-07-07 01:00:00  2021-07-07 02:00:00
2021-07-11 22:00:00  2021-07-12 08:00:00
2021-07-12 17:30:00  2021-07-12 23:00:00
2021-07-14 01:00:00  2021-07-14 02:00:00
2021-07-21 20:30:00  2021-07-21 22:00:00
2021-07-22 18:30:00  2021-07-27 18:00:00
2021-07-29 13:00:00  2021-07-30 15:00:00
2021-07-31 14:00:00  2021-07-31 15:00:00
2021-08-03 18:00:00  2021-08-03 19:00:00
2021-08-04 13:30:00  2021-08-04 15:00:00
2021-08-05 21:00:00  2021-08-06 14:00:00
2021-08-08 07:00:00  2021-08-08 13:00:00
2021-08-09 18:00:00  2021-08-09 19:00:00
2021-08-10 18:00:00  2021-08-10 19:00:00
2021-08-13 19:00:00  2021-08-14 09:00:00
2021-08-19 06:00:00  2021-08-19 07:00:00
2021-08-28 21:30:00  2021-08-28 08:00:00
2021-09-03 07:00:00  2021-09-03 09:00:00

2021-9-3之后的数据在下面这个csv文件中:
long-news-import-error-timespan.csv

tushare id:382058

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant