Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonical-url includes double // and no trailing slash #9524

Open
aronatkins opened this issue Apr 29, 2024 · 11 comments · May be fixed by #9595
Open

canonical-url includes double // and no trailing slash #9524

aronatkins opened this issue Apr 29, 2024 · 11 comments · May be fixed by #9595
Labels
bug Something isn't working websites Issues creating websites

Comments

@aronatkins
Copy link
Contributor

Bug description

Given a Quarto project configuration:

project:
  type: website

website:
  site-url: https://quarto.org/

format:
  html:
    canonical-url: true

The about/index.qmd file receives the following tag:

<link rel="canonical" href="https://quarto.org//about">

This link tag has two problems:

  1. Double slash when concatenating site-url with the page path.
  2. Missing trailing slash (causing an unnecessary redirect).

Steps to reproduce

Create a Quarto project with an index file in a subdirectory. Render that project. Look at the generated HTML.

Expected behavior

<link rel="canonical" href="https://quarto.org/about/">

Actual behavior

<link rel="canonical" href="https://quarto.org//about">

Your environment

No response

Quarto check output

Quarto 1.4.551
[✓] Checking versions of quarto binary dependencies...
      Pandoc version 3.1.11: OK
      Dart Sass version 1.69.5: OK
      Deno version 1.37.2: OK
[✓] Checking versions of quarto dependencies......OK
[✓] Checking Quarto installation......OK
      Version: 1.4.551
      Path: /Users/aron/quarto/quarto-1.4.551-macos/bin

[✓] Checking tools....................OK
      TinyTeX: (not installed)
      Chromium: (not installed)

[✓] Checking LaTeX....................OK
      Using: Installation From Path
      Path: /Library/TeX/texbin
      Version: 2024

[✓] Checking basic markdown render....OK

[✓] Checking Python 3 installation....OK
      Version: 3.12.2
      Path: /opt/homebrew/opt/python@3.12/bin/python3.12
      Jupyter: (None)

      Jupyter is not available in this Python installation.
      Install with python3 -m pip install jupyter

[✓] Checking R installation...........OK
      Version: 4.3.3
      Path: /Library/Frameworks/R.framework/Resources
      LibPaths:
        - /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
      knitr: 1.45
      rmarkdown: 2.26

[✓] Checking Knitr engine render......OK
@aronatkins aronatkins added the bug Something isn't working label Apr 29, 2024
@aronatkins
Copy link
Contributor Author

The top-level index page receives double-slashes, too:

<link rel="canonical" href="https://quarto.org//">

@mcanouil mcanouil added the websites Issues creating websites label Apr 29, 2024
@mcanouil

This comment was marked as outdated.

@mcanouil
Copy link
Collaborator

mcanouil commented May 6, 2024

@mcanouil
Copy link
Collaborator

mcanouil commented May 6, 2024

I've found where the additional / is added.

  • export function synthesizeCitationUrl(
    input: string,
    metadata: Metadata,
    outputFile?: string,
    offset?: string,
    ) {
    const siteMeta = metadata[kWebsite] as Metadata | undefined;
    const baseUrl = siteMeta?.[kSiteUrl] as string;
    if (baseUrl && outputFile && offset) {
    const rootDir = normalizePath(join(dirname(input), offset));
    if (outputFile === "index.html") {
    return `${baseUrl}/${
    pathWithForwardSlashes(relative(rootDir, dirname(input)))
    }`;
    } else {
    const relativePath = relative(
    rootDir,
    join(dirname(input), basename(outputFile)),
    );
    const part = pathWithForwardSlashes(relativePath);
    return `${baseUrl}/${part}`;
    }
    } else {
    // The url is unknown
    return undefined;
    }
    }
export function synthesizeCitationUrl(
  input: string,
  metadata: Metadata,
  outputFile?: string,
  offset?: string,
) {
  const siteMeta = metadata[kWebsite] as Metadata | undefined;
  let baseUrl = siteMeta?.[kSiteUrl] as string;
  baseUrl = baseUrl.replace(/\/$/, "");

  if (baseUrl && outputFile && offset) {
    const rootDir = normalizePath(join(dirname(input), offset));
    if (outputFile === "index.html") {
      return `${baseUrl}/${
        pathWithForwardSlashes(relative(rootDir, dirname(input)))
      }`;
    } else {
      const relativePath = relative(
        rootDir,
        join(dirname(input), basename(outputFile)),
      );
      const part = pathWithForwardSlashes(relativePath);
      return `${baseUrl}/${part}`;
    }
  } else {
    // The url is unknown
    return undefined;
  }
}

Resolves the issue.

I am not aware of the requirement for a trailing / for canonical URLs as implied in the expected behaviour though.

@cscheid
Copy link
Collaborator

cscheid commented May 6, 2024

I am not aware of the requirement for a trailing / for canonical URLs as implied in the expected behaviour though.

We have to be very careful about how to resolve this in the case of index.qmd files, because GET /subdir/ is very different from GET /subdir, and this is a constant source of confusion for users.

I know many people think this is ugly, but I think foo/index.qmd should be resolved to foo/index.html. It's the only safe client-side resolution that doesn't imply additional behavior from the HTTP server (about which Quarto shouldn't assume any additional behavior). There's no guarantee that a web server will respond to foo/ by searching for foo/index.html, and the situation gets even worse when web servers redirect foo to foo/index.html, because the parent directory structure now has changed.

I believe that as of right now, Aron's "expected behavior" is what we expected to do. The actual behavior is completely broken, but I think we should instead fix it to about/index.html instead.

@mcanouil
Copy link
Collaborator

mcanouil commented May 6, 2024

So the function would be: something like below?

export function synthesizeCitationUrl(
  input: string,
  metadata: Metadata,
  outputFile?: string,
  offset?: string,
) {
  const siteMeta = metadata[kWebsite] as Metadata | undefined;
  let baseUrl = siteMeta?.[kSiteUrl] as string;
  baseUrl = baseUrl.replace(/\/$/, "");

  if (baseUrl && outputFile && offset) {
    const rootDir = normalizePath(join(dirname(input), offset));
    if (outputFile === "index.html") {
      const part = pathWithForwardSlashes(relative(rootDir, dirname(input)));
      if (part.length === 0) {
        return `${baseUrl}/`;
      } else {
        return `${baseUrl}/${part}/`;
      }
    } else {
      const relativePath = relative(
        rootDir,
        join(dirname(input), basename(outputFile)),
      );
      const part = pathWithForwardSlashes(relativePath);
      return `${baseUrl}/${part}`;
    }
  } else {
    // The url is unknown
    return undefined;
  }
}

I get the following canonical URLs:
image

@cscheid
Copy link
Collaborator

cscheid commented May 6, 2024

Show me the diffs in a PR?

@mcanouil
Copy link
Collaborator

mcanouil commented May 6, 2024

@aronatkins
Copy link
Contributor Author

I'll also note that I went to configure canonical-url because a Quarto site may include both sub/ and sub/index.html links. Crawlers encounter both URLs and complain about duplicate content.

If Quarto could emit only sub/ links, I would not need a site-wide canonical-url configuration.

@cscheid
Copy link
Collaborator

cscheid commented May 6, 2024

Crawlers encounter both URLs and complain about duplicate content.

Huh. I think that's a problem with the crawler, for similar reasons as I stated above :( Unfortunately, there's no guarantee that sub/ and sub/index.html will be the same resource, and I don't think we should be dictating that.

@aronatkins
Copy link
Contributor Author

I think that's a problem with the crawler, for similar reasons as I stated above

I see your problems with the crawler and raise you:

// fixup index.html links if we aren't on the filesystem
if (window.location.protocol !== "file:") {
const links = window.document.querySelectorAll("a");
for (let i = 0; i < links.length; i++) {
if (links[i].href) {
links[i].dataset.originalHref = links[i].href;
links[i].href = links[i].href.replace(/\/index\.html/, "/");
}
}

PATH/index.html is rewritten in the browser to PATH/.

I don't want to shift the purpose of this issue, which is reporting URL construction problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working websites Issues creating websites
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants