Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title with unicode doesn't work #90

Open
mrcnski opened this issue Feb 22, 2024 · 6 comments
Open

Title with unicode doesn't work #90

mrcnski opened this issue Feb 22, 2024 · 6 comments
Assignees
Labels
bug Something isn't working waiting for feedback

Comments

@mrcnski
Copy link

mrcnski commented Feb 22, 2024

Thanks for the utility! I think I've found a bug. It seems that setting the <SEO title= field to some unicode breaks the generated HTML:

Screenshot 2024-02-22 at 15 26 40
@ttmc
Copy link

ttmc commented Feb 24, 2024

I think this might be because astro-seo currently puts the <meta charset="UTF-8" /> tag after the title tag. I'm going to create a separate issue for that.

@jonasmerlin
Copy link
Owner

Thank you for reporting this @mrcnski and really good catch @ttmc, we'll discuss possible solutions to this in #91

Just to confirm that this might actually be the issue behind this: do you use the charset tag via astro-seo @mrcnski?

@jonasmerlin jonasmerlin self-assigned this Feb 25, 2024
@jonasmerlin jonasmerlin added bug Something isn't working waiting for feedback labels Feb 25, 2024
@mrcnski
Copy link
Author

mrcnski commented Feb 25, 2024

Hey @jonasmerlin, I didn't set charset because I thought that it would default to UTF-8 if not provided. If that's not true, maybe the docs could be clarified?

@ttmc
Copy link

ttmc commented Feb 25, 2024

@mrcnski While modern browsers like Chrome strongly prefer UTF-8 as the default charset for websites without explicit declaration, there isn't a single guaranteed assumption. It's a multi-step process with multiple fallback options... here's what Chrome does:

  1. Byte Order Mark (BOM): Chrome first checks if the website content starts with a Byte Order Mark, which is a sequence of bytes indicating the specific encoding used. If a UTF-8 BOM is present, the browser assumes UTF-8 encoding.
  2. HTTP Headers: If no BOM is found, Chrome looks for the Content-Type header in the HTTP response. This header can explicitly specify the charset used, and if it mentions UTF-8, that will be used.
  3. Meta Tag: If neither BOM nor the Content-Type header provides a clear answer, Chrome checks for a <meta charset="utf-8"> tag within the HTML document itself. If present, this explicitly declares UTF-8 as the encoding.
  4. Heuristic Detection: If none of the above methods provide a clear indication, Chrome attempts to "guess" the charset based on heuristics and statistical analysis of the content itself. This involves looking for patterns and similarities with known encodings, but it's not always accurate and can lead to misinterpretations, especially for content containing characters from multiple languages.
  5. Fallback Default: If all attempts to identify the charset fail, Chrome resorts to a fallback default encoding. This is implementation-dependent and can vary across different browsers and even browser versions. However, for Chrome, the fallback default is generally the user's operating system default encoding, which might be something like Windows-1252 or ISO-8859-1 depending on the user's system configuration.

However, relying on browser guessing and fallback defaults is strongly discouraged for several reasons:

  • Inconsistency: Different browsers and user systems can have different fallback defaults, leading to inconsistent rendering of a website across different platforms.
  • Incorrect Interpretation: Misinterpreting the encoding can lead to garbled text, broken layouts, and potential security vulnerabilities.
  • Unnecessary Re-encoding: Browsers need to re-encode the content if the guessed encoding is wrong, which wastes resources and can slow down page loading.

Therefore, it's essential for website developers to explicitly declare the character encoding using either the Content-Type header or the <meta charset> tag, preferably using UTF-8 due to its widespread adoption and compatibility.

@mrcnski
Copy link
Author

mrcnski commented Feb 27, 2024

Thanks for the detailed explanation @ttmc. For some reason I thought that astro-seo would set a default of UTF-8 if this field was omitted. I can see why we wouldn't want to set default any values, as they may be already set outside of the component. My mistake, but maybe this part of the README could be clarified slightly?

Set the charset of the document. In almost all cases this should be UTF-8.

@ttmc
Copy link

ttmc commented Feb 27, 2024

@mrcnski Over on issue #91, one suggestion has been to make the astro-seo integration check if the charset gets set to UTF-8 (by something or someone other than the astro-seo integration), and if it hasn't been set, then the integration will inject the charset declaration at the top of the head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for feedback
Projects
None yet
Development

No branches or pull requests

3 participants