Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 不断请求阻塞请求导致导航页超时 #105

Open
PIGfaces opened this issue Jun 16, 2022 · 2 comments
Open

[Bug] 不断请求阻塞请求导致导航页超时 #105

PIGfaces opened this issue Jun 16, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@PIGfaces
Copy link
Contributor

PIGfaces commented Jun 16, 2022

问题描述

当我在爬取 目标站点: https://19.offcn.com/ 的时候有些链接爬不全,关闭无头模式打开 chrome 开发者工具时发现有访问失败的请求再不断地重复提交导致其他请求被阻塞,导致整个页面渲染超时而导致漏爬

正常打开时的控制台

image

crawlergo 无头模式爬取的控制台

crawlergo

复现步骤

版本

  • Commit Version: 551acb2b75403985493b56414d797ce5a1da480f
  • Browser: 1.39.122 Chromium: 102.0.5005.115 (正式版本) (arm64)

执行的命令

 ./crawlergo -m 2 -c **** --no-headless https://19.offcn.com/

期望表现

  • 网页能加载完成

实际表现

  • 网页无法加载完全导致以下的 <DIV> 没有渲染而漏抓
						<div class="zg_personal already_login" style="display: none">
							<p class="zg_personalP"><strong><img src=""/></strong><i></i></p>
							<div class="zg_person_list" style="display: none;">
							<em>&nbsp;</em>
							<a href="/mycourse/index/">我的课程</a>
							<a href="/svipcourse/">学员专享</a>
							<a href="/orders/myorders/">我的订单</a>
							<a href="/mycoupon/index/">我的优惠券</a>
							<a href="/user/index/">账号设置</a>
							<a href="/foreuser/outlogin/">退出登录</a>
							</div>
						</div>

正常打开可以快速加载完成,使用 crawlergo 加载时间太长,这是个 bug 吗,已关闭了所有代理

@Qianlitp
Copy link
Owner

crawlergo默认会阻断图片的请求,减少静态资源访问,现在看来这个会导致页面异常

@Qianlitp Qianlitp added the bug Something isn't working label Jun 22, 2022
@HeisenbergV
Copy link
Contributor

HeisenbergV commented Jan 7, 2023

是否可以这样做,我测试是可行的,但不知道有没有其他后果:

func (tab *Tab) Start() {
	// ...
	if err := chromedp.Run(*tab.Ctx,
		RunWithTimeOut(tab.Ctx, tab.config.DomContentLoadedTimeout, chromedp.Tasks{
			//....
                        // 在这里进行阻断
			network.SetBlockedURLS(config.StaticSuffix),
		
			//....
			// 执行导航
			chromedp.Navigate(tab.NavigateReq.URL.String()),
		}),
func (tab *Tab) InterceptRequest(v *fetch.EventRequestPaused) {
//...
        // 删除此处逻辑
	// 静态资源 全部阻断
	// https://github.com/Qianlitp/crawlergo/issues/106
	// if config.StaticSuffixSet.Contains(url.FileExt()) {
	// 	_ = fetch.FailRequest(v.RequestID, network.ErrorReasonBlockedByClient).Do(ctx)
	// 	req.Source = config.FromStaticRes
	// 	tab.AddResultRequest(req)
	// 	return
	// }

//...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants