Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behaviour with HTML entities within XML content #3689

Open
Ponynjaa opened this issue Mar 1, 2024 · 0 comments
Open

Weird behaviour with HTML entities within XML content #3689

Ponynjaa opened this issue Mar 1, 2024 · 0 comments

Comments

@Ponynjaa
Copy link

Ponynjaa commented Mar 1, 2024

When I have this input: <root><table><div>test&nbsp;</div></table></root> and I run this code:

import * as cheerio from 'cheerio';

const input = `<root><table><div>test&nbsp;</div></table></root>`;
const $ = cheerio.load(input, {
	xmlMode: false,
	decodeEntities: false
}, false);
console.log($.xml()); // -> "<root><div>test </div><table/></root>"

it moves the table which is a behaviour I don't want, so I use xmlMode=true like so:

import * as cheerio from 'cheerio';

const input = `<root><table><div>test&nbsp;</div></table></root>`;
const $ = cheerio.load(input, {
	xmlMode: true,
	decodeEntities: false
}, false);
console.log($.xml()); // -> "<root><table><div>test&nbsp;</div></table></root>"

Now the table didn't get moved in the result but the &nbsp; doesn't get decoded to \u00a0 anymore. If I then try to use decodeEntities=true it encodes it even more:

import * as cheerio from 'cheerio';

const input = `<root><table><div>test&nbsp;</div></table></root>`;
const $ = cheerio.load(input, {
	xmlMode: true,
	decodeEntities: true
}, false);
console.log($.xml()); // -> "<root><table><div>test&amp;nbsp;</div></table></root>"

My current workaround is to use the libraries htmlparser2 and dom-serializer separately like so:

import * as htmlparser2 from 'htmlparser2';
import * as domserializer from 'dom-serializer';

const input = `<root><table><div>test&nbsp;</div></table></root>`;
const parsed = htmlparser2.parseDocument(input, {
	xmlMode: false,
	decodeEntities: true
});

const serialized = domserializer.render(parsed, {
	xmlMode: false,
	encodeEntities: false,
	decodeEntities: true
});

console.log(serialized); // -> "<root><table><div>test </div></table></root>"

It is weird behaviour and I can't really tell where the error is happening, but I suppose it's the lack of options to pass to the serializer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant