Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_or_none() got an unexpected keyword argument 'user' #674

Open
Steven630 opened this issue Apr 20, 2024 · 67 comments
Open

get_or_none() got an unexpected keyword argument 'user' #674

Steven630 opened this issue Apr 20, 2024 · 67 comments

Comments

@Steven630
Copy link

Steven630 commented Apr 20, 2024

用chrome插件做了个recipe,上传,高级设置手动立即推送。提示:Failed to execute recipe "주요뉴스 연합뉴스": get_or_none() got an unexpected keyword argument 'user'

另外这个扩展真的很好用,有三个小建议供大佬参考:

  1. 目录页面可选择获取不同section
  2. 目录页可获取文章发布时间,再设定时区/时间格式,方便筛选oldest articles
  3. 正文部分除了获取需要内容的规则,能否考虑增加remove rules。这样可以更快去除广告等内容。
@cdhigh
Copy link
Owner

cdhigh commented Apr 24, 2024

是因为我在写扩展的过程中发现在扩展界面显示爬取规则比较麻烦,太复杂,小小的对话框显得特别乱,后来就翻BeautifulSoup的文档,发现使用CSS选择器比直接使用其字典参数格式简洁多了。
现在同时支持这两种格式,你如果更熟悉以前的方式,也可以使用。
我在代码注释中已经写明,注释里面的两行代码是等效的,可见CSS选择器的简洁:

#为一个二维列表,可以保存多个标签规则,每个规则都很灵活,只要是BeautifulSoup的合法规则即可(字典或CSS选择器字符串)
#每个顶层元素为一个html标签的查找规则列表,从父节点到子节点,依次往下一直到最后一个元素为止
#最后一个元素必须为链接,或其子节点有链接,则此链接为文章最终链接,链接的文本为文章标题
#比如:url_extract_rules = [[{'name': 'div', 'attrs': {'class': 'art', 'data': True}}, {'name': 'a'}],]
#或:url_extract_rules = [['div.art[data]', 'a'],]
url_extract_rules = []

python2的recipe一般需要稍微改动,如果很简单的话,可能不需要改动,根据你脚本的复杂程度决定。

“自动排除重复文章” 是支持的。

@Steven630
Copy link
Author

只有一个网页的recipe怎么写。现在的recipe几乎都是默认每个recipe都有好多篇文章的。如果知道只有一篇文章,不需要再做目录,应该怎么修改。内置Espresso的recipe就是只有一篇文章,base url直接就是内容本身,但还是多做了个The world in brief的目录。请教该怎么修改?

@Steven630
Copy link
Author

Steven630 commented Apr 29, 2024

觉得Chrome扩展很好用,所以明明是RSS源,也用了扩展去生成规则。最后把import和Class后面括号的内容都改成BasicNewsRecipe,url extract rules删除,最后的feeds改成RSS链接。但是推送的结果是没有采用那些规则,整个页面都抓取了。是有什么地方没改对吗?是否可以加上为RSS源的文章生成规则的选项(直接跳过第一步目录抓取),这样以后也不用再改。

@Steven630
Copy link
Author

Steven630 commented Apr 29, 2024

是我没表达清楚,现在RSS大部分都不是全文的,所以最后也是要打开正文再解析正文的结构。我写recipe的时候偷了个懒,想用扩展提取出正文的元素。目录页是随便找了个其他的(原来是BBC China的RSS,就去找了对应的网页China版块,最近文章的列表都是一样的,我只是想要去第二第三步生成正文提取规则罢了)。最后生成了py文件,所以才去把url提取规则删除什么的。我以为extract rules和remove rules是通用的。能不能考虑为Basic Recipe也增加这两个功能呢?毕竟正文都是网页。扩展的话可以增加一个RSS的选项,由用户自己粘贴一篇文章的链接,直接进入第二步正文提取。最后生成的Recipe选择对应的类别,如果KE更新的时候能让基本recipe也有那两个功能,那这个扩展就能为RSS解析文章正文也发挥作用了。

@cdhigh
Copy link
Owner

cdhigh commented Apr 30, 2024

如果是WebPageUrlNewsRecipe派生的子类,是的,但是如果还是传统的RSS,则受限与最旧文章时间。

提示no news feeds available不一定是时间的问题,可能是解释文章列表过程中的异常,建议在后台看看logs

@Steven630
Copy link
Author

Steven630 commented Apr 30, 2024

是按照昨天你教的,把import和class改回了WebPageUrlNewsRecipe,在class内增加了函数,其他没有更改。

查看了日志,确实是no article found

@Steven630
Copy link
Author

Steven630 commented Apr 30, 2024

还有个Economist recipe的问题要请教。内置的recipe文章不全,今天Calibre更新了recipe。看到作者在论坛上写的是use the GraphQL query to load the content。

用GAE试了新recipe,后台logging报错 "textPayload": "parse_index() failed: [Errno 30] Read-only file system: 'u7_iuj2x.html'"

不知道是不是新的recipe用了KE不支持的手段。

https://github.com/kovidgoyal/calibre/blob/refs/heads/master/recipes/economist.recipe

@cdhigh
Copy link
Owner

cdhigh commented Apr 30, 2024

你可以给它提一个pull request,它的做法有些不适当,随便在当前目录建立临时文件
如果你要修改可以修改里面的PersistentTemporaryFile(),修改为
pt = PersistentTemporaryFile('.html', dir=os.getenv('KE_TEMP_DIR'))
或简单点
pt = PersistentTemporaryFile('.html', dir='/tmp')

我看了一下,我会升级代码兼容它的recipe修改

@cdhigh
Copy link
Owner

cdhigh commented Apr 30, 2024

关于nonews 的问题,如果你愿意,可以上传你的recipe,我来看看

@Steven630
Copy link
Author

你可以给它提一个pull request,它的做法有些不适当,随便在当前目录建立临时文件 如果你要修改可以修改里面的PersistentTemporaryFile(),修改为 pt = PersistentTemporaryFile('.html', dir=os.getenv('KE_TEMP_DIR')) 或简单点 pt = PersistentTemporaryFile('.html', dir='/tmp')

我看了一下,我会升级代码兼容它的recipe修改

好的,我改后试试,并期待下次升级。

@Steven630
Copy link
Author

Steven630 commented Apr 30, 2024

关于nonews 的问题,如果你愿意,可以上传你的recipe,我来看看

这是recipe
BBC China.txt

后缀改成txt了

@cdhigh
Copy link
Owner

cdhigh commented Apr 30, 2024

修改函数:

def parse_feeds(self):
    	BasicNewsRecipe.parse_feeds(self)

增加return

def parse_feeds(self):
    	return BasicNewsRecipe.parse_feeds(self)

@Steven630
Copy link
Author

Steven630 commented Apr 30, 2024

或简单点
pt = PersistentTemporaryFile('.html', dir='/tmp')

我看了一下,我会升级代码兼容它的recipe修改

改成了这个简单的版本,日志报错很多临时文件找不到。最开始还有个封面下载的错误,是permission denied,这个错误是旧版本recipe就有的,文件确实也没有封面

@cdhigh
Copy link
Owner

cdhigh commented Apr 30, 2024

感觉是我的代码问题,好像对file:////形式的url分析成//了

@cdhigh
Copy link
Owner

cdhigh commented Apr 30, 2024

代码已经升级

@Steven630
Copy link
Author

Steven630 commented Apr 30, 2024

修改函数:

def parse_feeds(self):
    	BasicNewsRecipe.parse_feeds(self)

增加return

def parse_feeds(self):
    	return BasicNewsRecipe.parse_feeds(self)

谢谢!现在推送有文章了,不过正文还是没有处理的样子

日志是说extract rules失败,改用readability

@Steven630
Copy link
Author

又来请教了

Failed to execute recipe "컨텍스트 레터": can't subtract offset-naive and offset-aware datetimes"
"Failed to execute input plugin: All feeds are empty, aborting."
"There are no new feeds available."

这又是什么问题呢

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
from calibre.web.feeds.recipes import WebPageUrlNewsRecipe

class CustomRecipe95934(WebPageUrlNewsRecipe):
    __created_date__ = "2024-04-30"
    title = "컨텍스트 레터"
    description = "컨텍스트 레터"
    encoding = "UTF-8"
    language = "ko-KR"
    max_articles_per_feed = 30
    oldest_article        = 2
    auto_cleanup = False

    url_extract_rules = [
        [
            "article.entry",
            "div.entry-content-wrap",
            "header.entry-header",
            "h2.entry-title",
            "a[href][rel]",
        ],
    ]

    content_extract_rules = [
        [
            "section",
            "div",
            "div",
            "header",
            "h1",
        ],
        [
            "div.entry-content",
            "div[id]",
            "div[id][style]",
        ],
    ]

    content_remove_rules = [
        [
            "div",
            "div.wp-block-kadence-spacer",
            "div",
        ],
        [
            "body",
            "div",
            "div.wp-block-spacer",
        ],
    ]

    feeds = [
        ("컨텍스트 레터", "https://slownews.kr/category/slowletter"),
    ]

@cdhigh
Copy link
Owner

cdhigh commented Apr 30, 2024

这个是因为google数据库的处理和sql不一致,
将保存的没有时区信息的时间自动转换为包含时区0的时间。
既然这样,干脆内部就全部使用时区0的时间,代码已经更新。

@Steven630
Copy link
Author

Steven630 commented Apr 30, 2024

重新部署后Economist还是不行,有很多这样的提示:

"The file '//tmp/gmr6w_ze.html' does not exist"
"Could not fetch link file:////tmp/ih_vfvs2.html : No content at URL 'file:////tmp/ih_vfvs2.html'"
"Failed to download article:How strong is India’s economy? from file:////tmp/7q1kgbbv.html"

一开始的几篇似乎成功了?

DEFAULT 2024-04-30T13:57:54.104022Z Title : Politics
DEFAULT 2024-04-30T13:57:54.104027Z URL : file:////tmp/ih_vfvs2.html
DEFAULT 2024-04-30T13:57:54.104031Z Author :
DEFAULT 2024-04-30T13:57:54.104036Z Summary : ...
DEFAULT 2024-04-30T13:57:54.104040Z Date : Tue, 30 Apr, 2024 13:57
DEFAULT 2024-04-30T13:57:54.104045Z TOC thumb : None
DEFAULT 2024-04-30T13:57:54.104049Z Has content : False

这里的tmp/ih_vfvs2.html和上面错误提示的是一样的,所以好像也没成功。

封面还是失败
"Failed to download supplied masthead_url: [Errno 13] Permission denied: '/mastheadImage.gif'"

@Steven630
Copy link
Author

另外这个recipe有这样的错误提示:
GET https://slownews.kr/wp-content/uploads/2023/11/230324_삼성전기_중국_텐진공장_점검_3-963x800.jpg failed: 'latin-1' codec can't encode characters in position 54-57: ordinal not in range(256)

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
from calibre.web.feeds.recipes import WebPageUrlNewsRecipe

class CustomRecipe95934(WebPageUrlNewsRecipe):
    __created_date__ = "2024-04-30"
    title = "컨텍스트 레터"
    description = "컨텍스트 레터"
    encoding = "UTF-8"
    language = "ko-KR"
    max_articles_per_feed = 30
    oldest_article        = 2
    auto_cleanup = False

    url_extract_rules = [
        [
            "article.entry",
            "div.entry-content-wrap",
            "header.entry-header",
            "h2.entry-title",
            "a[href][rel]",
        ],
    ]

    content_extract_rules = [
        [
            "section",
            "div",
            "div",
            "header",
            "h1",
        ],
        [
            "div.entry-content",
            "div[id]",
            "div[id][style]",
        ],
    ]

    content_remove_rules = [
        [
            "div",
            "div.wp-block-kadence-spacer",
            "div",
        ],
        [
            "body",
            "div",
            "div.wp-block-spacer",
        ],
    ]

    feeds = [
        ("컨텍스트 레터", "https://slownews.kr/category/slowletter"),
    ]

@Steven630
Copy link
Author

大佬辛苦啦。那个网站果然是比较奇怪。亚马逊epub推送的问题,github有个js写的代码专门修复epub格式,让它符合亚马逊的要求。代码也不多,不知道KE能不能用上一点:

https://github.com/innocenat/kindle-epub-fix

上次推送Guardian的新闻也遇到过一次亚马逊不认的问题。

@Steven630
Copy link
Author

App Engine的什么项目?

第二代运行时比第一代运行时贵一些,Python3版本占用资源多一些,但是应该实例小时数是够用的,应该是其他费用。

都是backend instances产生的费用,16.26小时

@cdhigh
Copy link
Owner

cdhigh commented May 1, 2024

一天只有9小时免费,如果你的rss不是几十个,建议你将worker 的机器调小。
为了避免有人内存不够,我默认调到B4,大部分人B2就够用了,如果只有5个RSS以内,B1就够用了。
B2价格是B1的两倍,B4价格是B2的两倍。
在shell里面打开worker.yaml,修改后只执行gae_deploy.sh即可。

这只是测试时消耗大,因为要经常唤醒,有时候甚至运行两个进程,费用再加倍,每次唤醒都至少运行15分钟,等正常使用后,每天只唤醒一两次,就不会超了。

前台有28小时,一般不会超。

@Steven630
Copy link
Author

在shell里面打开worker.yaml,这一步怎么做?是可以在线编辑的吗。实在不好意思问出那么菜的问题。

RSS我现在一直订阅的只有一个,因为其他那几个推送效果都不理想……

@cdhigh
Copy link
Owner

cdhigh commented May 1, 2024

可以在线编辑,在shell的上沿偏右有一个按钮“打开编辑器”,然后在左边打开菜单,就可以选择worker.yaml,
修改行:instance_class: B1,保存后再点击“打开终端”,再输入命令:kindleear/tools/gae_deploy.sh

你的情况使用B1足以,资源随便玩,订阅不多,还可以将idle_timeout改小,改成15分钟,甚至10分钟,这个时间决定进程一次运行多长时间,最大进程如果再改成1,就更省了。

instance_class: B1
basic_scaling:
  max_instances: 1
  idle_timeout: 15m

app_engine_apis: true
entrypoint: gunicorn -b :$PORT -w 1 --timeout 900 main:app

@cdhigh
Copy link
Owner

cdhigh commented May 1, 2024

更好的方法是fork github仓库,然后在你的仓库修改,之后我的仓库有更新,你先同步到你的仓库,然后部署时将github仓库链接修改为你的路径即可。

rm -rf kindleear && \
git clone --depth 1 https://github.com/cdhigh/kindleear.git && \
chmod +x kindleear/tools/gae_deploy.sh && \
kindleear/tools/gae_deploy.sh

@Steven630
Copy link
Author

谢谢,终于找到了。这样修改之后,以后如果再升级代码、重新部署,还需要反复改吗。要是RSS有五六个就建议B2?

@Steven630
Copy link
Author

刚看到最新回复,感谢!

@cdhigh
Copy link
Owner

cdhigh commented May 1, 2024

你可以一直先B1,什么时候推送失败了,在后台看到 out of memory再上调。
如果你不想fork仓库,可以保存一个副本,然后增加一条命令,每次覆盖即可。

比如你现在修改好了,使用一条命令备份到home目录

cp kindleear/worker.yaml ~/worker.yaml

以后部署时使用下面的命令即可,中间增加了一条命令,恢复这个文件。

rm -rf kindleear && \
git clone --depth 1 https://github.com/cdhigh/kindleear.git && \
chmod +x kindleear/tools/gae_deploy.sh && \
cp ~/worker.yaml kindleear/worker.yaml && \
kindleear/tools/gae_deploy.sh

@Steven630
Copy link
Author

谢谢指教。最后还是fork了,以前其实也fork过,因为旧版本还有一些自己的recipe,所以一直没更新。

现在Economist可以推送了,有的图片还是抓取失败:
"Could not fetch image https://www.economist.com/cdn-cgi/image/width=600,quality=80,format=auto/media-assets/image/20240427_INT101.png: status: 429"

杂志封面不知道为什么一直没有

中间还有这个错误:
"recipe1.publication_date error: Unknown string format: weeklyedition"

Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
@cdhigh
Copy link
Owner

cdhigh commented May 10, 2024

当然是因为这个,之前你不说,除了要改这行以外,还需要修改

def populate_article_metadata(self, article, soup, first):
            article.url = soup.find('h3')['title']

Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
@Steven630
Copy link
Author

当然是因为这个,之前你不说,除了要改这行以外,还需要修改

def populate_article_metadata(self, article, soup, first):
            article.url = soup.find('h3')['title']

啊,对不起。还以为就是改一个标签,没想到后面的处理还要查找它……让大佬费心了,还特意更新了GAE的项目

@cdhigh
Copy link
Owner

cdhigh commented May 10, 2024

客气了,我是很感谢你不停的能发现问题的。

Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
Repository owner deleted a comment from Steven630 May 10, 2024
@Steven630
Copy link
Author

Steven630 commented May 10, 2024

推送后文章前面有四部分了
h1
div(flytitle)
h3
div(subtitle)

h1和h3的内容相同,都是title,只是字号不同。h3显然就是因为我改了,期待的效果是没有第一行的h1。

是不是因为recipe还有这行

E(article, 'h1', replace_entities(data['headline']))

感觉又不对,这个是在else部分的,use_archive是true,不会用到这部分代码。不过recipe其他地方就没有h1了,除了互动页面的文章,增加提示应该用浏览器打开。

@cdhigh
Copy link
Owner

cdhigh commented May 10, 2024

我猜是h2,而不是h1,开头的标题是KindleEar自动添加的,如果一篇文章找不到h1/h2,就在前面默认添加一个h2.
你可以看news.py line 1074 - line 1086

解决方法就是将h3改成h2即可。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants