Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Missing ads on news sites #266

Open
tuehlarsen opened this issue Nov 20, 2023 · 3 comments
Open

[Bug]: Missing ads on news sites #266

tuehlarsen opened this issue Nov 20, 2023 · 3 comments
Labels
bug Something isn't working replay bug Archived content is not displaying as expected

Comments

@tuehlarsen
Copy link

Browsertrix Cloud Version

v1.8.0-beta.4-7d985a9

What did you expect to happen? What happened instead?

Missing ads on most used news sites.
replay of news sites are missing most of the ads - some are traced with Archived Page Not Found or not displayed and a few displayed. All ads can be seen in watch crawl window.

Step-by-step reproduction instructions

e.g.

politiken.dk
crawl: "pol frontpage with all context"
https://beta.browsertrix.cloud/orgs/netarkivet-det-kgl-bibliotek/items/crawl/sched-bb9b135d-357-28341060?workflowId=bb9b135d-3573-4901-bdef-a80d35a15741#replay
Archived Page Not Found
Sorry, this page was not found in this archive:
https://0e9755db0ca066211b5983705fdb4922.safeframe.googlesyndication.com/safeframe/1-0-40/html/container.html?n=2

tv2.dk
crawl: tv2.dk frontpage complete context incl. ads
https://beta.browsertrix.cloud/orgs/netarkivet-det-kgl-bibliotek/items/crawl/manual-20231118064936-03e01f26-37d?workflowId=03e01f26-37dd-4fa6-880f-db7bd6dd6679

berlingske.dk frontpage with context
crawl: https://beta.browsertrix.cloud/orgs/netarkivet-det-kgl-bibliotek/items/crawl/manual-20231118095211-a4e6bc32-473?workflowId=a4e6bc32-4733-4a3f-8231-43b6df1c4031#replay

Additional details

No response

@tuehlarsen tuehlarsen added the bug Something isn't working label Nov 20, 2023
@Shrinks99
Copy link
Member

This may be a result of switching to Brave browser which has more agressive privacy settings by default. These should be able to be disabled on a per-browser profile basis, but should likely be off by default unless the "block ads" setting has been enabled by the user.

In the mean time, try creating a browser profile with some of Brave's "Shields" settings disabled.
Screenshot 2023-11-20 121639

@tuehlarsen
Copy link
Author

tuehlarsen commented Nov 20, 2023 via email

@Shrinks99 Shrinks99 transferred this issue from webrecorder/browsertrix Nov 22, 2023
@Shrinks99 Shrinks99 added the replay bug Archived content is not displaying as expected label Nov 22, 2023
@tuehlarsen
Copy link
Author

tuehlarsen commented Mar 23, 2024

If you download https://beta.browsertrix.cloud/orgs/kb/items/crawl/manual-20240323083932-bb9b135d-357?workflowId=bb9b135d-3573-4901-bdef-a80d35a15741#files:~:text=20240323084140064%2Dbb9b135d%2D357%2D0.wacz
and load the wacz file offline with replay webpage 2.00.beta it replays the ads which are harvested. But if you unzip the file and only load the warc.gz here kb-pol-frontpage-with-all-context-manual-20240323083932-bb9b135d-357-20240323083954557-0.warc.gz the replay of https://politiken.dk does not show any ads, but they are all in the warc.gz file and can be replayed using the image/audio/video url list. The same in pywb no ads in replay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working replay bug Archived content is not displaying as expected
Projects
Status: Triage
Development

No branches or pull requests

2 participants