You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HonzaKirchner opened this issue
Apr 4, 2024
· 2 comments
Labels
bugSomething isn't working.t-consoleIssues with this label are in the ownership of the console team.t-toolingIssues with this label are in the ownership of the tooling team.
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
I am working with some TripAdvisor API endpoint, which returns a bunch of places in JSON format, near some coordinates. One of the places returned, in my case, is called Restaurace "Otevreno", with the double quotes.
TripAdvisor encodes the string as such: Restaurace "Otevreno".
When i take the response's body from the Cheerio Crawlee context, " is automatically decoded to ", thus breaking the JSON and making JSON.parse raise an error.
Apparently, this was added there to "save memory for highly parallel runs" (source). However, currently we don't get this effect anyways, since the result of _parseHtml() is immediately destructured here, so when requestHandler, context.body is already a string, and not the getter. (You can trivially verify this by putting a breakpoint/debugger;/console.log() into the getter, and checking at what moment (with what call stack) it is called.
Also, I don't think there's a way to have $.html() return what we want. When a website responds with ``, content-type: text/html:
First case is current behavior of ctx.body, and imo it's bad, because it breaks HTML. Also, it's really confusing that CheerioCrawlingContext.body and HttpCrawlingContext.body return different values (that's kind of what prompted the original report by @gullmar).
The second case is probably also unjustifiable, since it would change current behavior heavily (websites probably contain a lot of quotes, &s and idk what else cheerio decides to escape).
Therefore, I propose we remove body getter, and instead return the original body buffer .toString("utf8"), to have the same data like HttpCrawler, but also keep body as a string to avoid breaking Actors.
It is also ok with me to call this a won'tfix, since website returning JSON with content-type: text/html is just weird.
mtrunkat
added
t-tooling
Issues with this label are in the ownership of the tooling team.
t-console
Issues with this label are in the ownership of the console team.
t-c&c
Team covering store and finance matters.
labels
Apr 10, 2024
bugSomething isn't working.t-consoleIssues with this label are in the ownership of the console team.t-toolingIssues with this label are in the ownership of the tooling team.
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
I am working with some TripAdvisor API endpoint, which returns a bunch of places in JSON format, near some coordinates. One of the places returned, in my case, is called Restaurace "Otevreno", with the double quotes.
TripAdvisor encodes the string as such:
Restaurace "Otevreno"
.When i take the response's body from the Cheerio Crawlee context,
"
is automatically decoded to ", thus breaking the JSON and making JSON.parse raise an error.This is the API
Code sample
No response
Package version
latest
Node.js version
latest
Operating system
No response
Apify platform
I have tested this on the
next
releaseNo response
Other context
Link to slack thread
The text was updated successfully, but these errors were encountered: