feat: adds urlpattern to ada #381

miguelteixeiraa · 2023-05-09T20:39:51Z

TL;DR

WIP!

So i started to implement https://wicg.github.io/urlpattern/
Seems that this is an work-in-progress spec, so maybe we'll have some freestyle stuff along the way of the implementation.
I'm using https://github.com/kenchris/urlpattern-polyfill and https://github.com/denoland/rust-urlpattern as references.

Don't know if there is a "right way" to do this, but I get the unicode identifier start/part ranges using the script:

"use strict";

const fs = require("fs");

const regexIdentifierStart = /[$_\p{ID_Start}]/u;

function isValidIdStart(num) {
  const c = String.fromCharCode(num);
  return regexIdentifierStart.test(c);
}

const validRanges = [];
let start = null;

for (let c = 256; c < 0x10ffff; c++) {
  const isValid = isValidIdStart(c);
  if (isValid && start === null) {
    start = c;
  } else if (!isValid && start !== null) {
    validRanges.push([start, c - 1]);
    start = null;
  }
}

const writeRangeInFile = (filename, ranges) => {
  let file = "";
  for (const [base, upper] of ranges) {
    file += `\{${base}, ${upper}\},\n`;
  }
  fs.writeFile(filename, file, (err) => {});
};

console.log(validRanges.length);

writeRangeInFile("id_start_ranges.txt", validRanges);

++ something similar for the identifier-part with the regex /[$_\u200C\u200D\p{ID_Continue}]/u
I got those regexes in https://github.com/kenchris/urlpattern-polyfill/blob/main/src/path-to-regex-modified.ts#LL70C1-L70C50

Then I used the ranges to create bitwise masks in the unicode.cpp.

Nothing is tested yet

It will took some time to finish!

ref nodejs/node#40844

anonrig · 2023-05-09T21:07:58Z

include/ada/urlpattern.h

+  bool ignore_case = false;
+};
+
+struct component_result {


ada::component_result seems a really vague naming, since it might be mistaken with ada::url_components

include/ada/urlpattern.h

src/urlpattern.cpp

include/ada/unicode.h

anonrig · 2023-06-10T14:18:38Z

include/ada/urlpattern.h

+namespace ada::urlpattern {
+struct urlpattern_component_result {
+  std::string_view input;
+  std::unordered_map<std::string_view, std::optional<std::string_view>> groups;


@lemire How's the performance comparison of using std::optional in here versus std::variant<std::nullopt, std::string_view>?

I don't expect it matters much.

anonrig · 2023-06-10T14:19:41Z

include/ada/urlpattern_base.h

+namespace ada::urlpattern {
+
+struct urlpattern_options {
+  std::string_view delimiter = "";


Suggested change

std::string_view delimiter = "";

std::string_view delimiter{};

anonrig · 2023-06-10T14:20:18Z

include/ada/urlpattern_constructor_string_parser.h

+
+  std::u32string_view input;
+  std::vector<token> token_list;
+  size_t component_start = 0;


Suggested change

size_t component_start = 0;

size_t component_start{0};

anonrig · 2023-06-10T14:22:03Z

include/ada/urlpattern_internals.h

+
+namespace ada::urlpattern {
+// https://wicg.github.io/urlpattern/#component
+struct urlpattern_component {


Since this is already under ada::urlpattern namespace, why not call this struct component?

anonrig · 2023-06-10T14:23:57Z

src/urlpattern_canonicalization.cpp

+  std::string final_utf8_url(utf8_size, '\0');
+  ada::idna::utf32_to_utf8(url.data(), url.size(), final_utf8_url.data());
+
+  if (ada::can_parse(final_utf8_url)) {


Can you add a todo in here to optimize the can_parse function?

anonrig · 2023-06-10T14:30:05Z

src/urlpattern_constructor_string_parser.cpp

+ada_really_inline bool constructor_string_parser::is_group_open() {
+  // If parser’s token list[parser’s token index]'s type is "open", then
+  // return true. Else return false.
+  return token_list[token_index].type == TOKEN_TYPE::OPEN;


This is not safe. We should add development asserts to make sure token_index is smaller than token_list length.

anonrig · 2023-06-10T14:31:29Z

src/urlpattern_pattern_parser.cpp

+  while (index < input.size()) {
+    size_t pos = input.find_first_of(U".+*?^${}()[]|/\\)");
+    if (pos == std::string_view::npos) {
+      result = result += input.substr(index, input.size());


This line is weird.

anonrig · 2023-06-10T14:34:24Z

src/urlpattern_pattern_parser.cpp

+    // 1. Set type to "full-wildcard".
+    type = PART_TYPE::FULL_WILDCARD;
+    // 2. Set regexp value to the empty string.
+    regexp_value.clear();


We are assigning and later clearing this value, which is creating performance degregation. We should eventually optimize it to reduce unnecessary allocations.

anonrig · 2023-06-10T14:35:43Z

src/urlpattern_pattern_parser.cpp

+  }
+
+  // 3. Set token to the result of running try to consume a token given parser
+  // and "asterisk".


there was a function like t.value_or("default val")

anonrig · 2023-06-10T14:36:32Z

src/urlpattern_pattern_parser.cpp

+
+  // If name token is null and token is null, then set token to the result of
+  // running try to consume a token given parser and "asterisk".
+  if (!name_token.has_value() && !regexp_or_wildcard.has_value()) {


There is fast path here. If name_token does not have a value, you dont need to try to consume token right?

bricss · 2023-12-05T19:40:13Z

Is there any way to crank 🔧 this up for nodejs/node#51060 needs? 🤔

bricss · 2024-01-15T22:45:42Z

Houston, do you read me? 📡

lemire · 2024-01-15T22:53:13Z

@bricss Are you available to help push this forward?

bricss · 2024-01-15T23:00:52Z

Yes, with only one exception, I don't have big/deep experience with C++ coding 🤷‍♂️ atm 🙄

lemire · 2024-01-15T23:21:52Z

@bricss Lack of knowledge of C++ could be a problem.

miguelteixeiraa added 4 commits May 8, 2023 13:41

urlpattern: adds id_start and id_pard lookup table

e466165

urlpattern: adds id_start & id_part + tokenizer sketch

4aa5f55

urlpattern: adds missing eof

842f5b1

merge branch 'main' into urlpattern

dadb665

miguelteixeiraa marked this pull request as draft May 9, 2023 20:41

urlpattern: adds bitset lib

ee41c6f

anonrig reviewed May 9, 2023

View reviewed changes

anonrig mentioned this pull request May 9, 2023

implement URLPattern nodejs/node#40844

Open

miguelteixeiraa added 20 commits May 10, 2023 08:47

urlpattern: adds comments to unicode.h

df58be4

urlpattern: component_result -> urlpattern_component_result

582a649

urlpattern: wip constructors

0516de4

urlpattern: WIP contructor_string_parser

179a065

urlpattern: introducing canonicalize_protocol

2e80ab3

urlpatter: adds pragma regions

17ee73d

urlpattern: breakdown in multiple files

9ae30c4

urlpattern: minor fixes

95e5c70

urlpattern: fixup to make it compile

b9041fb

Merge branch 'main' into urlpattern

50d6d09

urlpattern: adds missing cassert lib + fixes

5651a1f

urlpattern: fix assert for token type

ca49555

urlpattern: introducing tests for tokenizer

58342b3

urlpattern: update id_start and id_part tables

3f5fac8

urlpattern: WIP fix tokenizer

824725d

urlpattern: update tokenizer

88453a5

urlpattern: updates tokenizer's test

6e14684

urlpattern: make tokenizer pass the tests

81fa16d

urlpattern: WIP compile a component

5016786

urlpattern: WIP pattern parser

8c8942a

anonrig reviewed Jun 10, 2023

View reviewed changes

Merge branch 'main' into urlpattern

aa8522f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adds urlpattern to ada #381

feat: adds urlpattern to ada #381

miguelteixeiraa commented May 9, 2023 •

edited

anonrig May 9, 2023

anonrig Jun 10, 2023

lemire Jan 15, 2024

anonrig Jun 10, 2023

anonrig Jun 10, 2023

anonrig Jun 10, 2023

anonrig Jun 10, 2023

anonrig Jun 10, 2023

anonrig Jun 10, 2023

anonrig Jun 10, 2023

anonrig Jun 10, 2023

anonrig Jun 10, 2023

bricss commented Dec 5, 2023

bricss commented Jan 15, 2024 •

edited

lemire commented Jan 15, 2024

bricss commented Jan 15, 2024 •

edited

lemire commented Jan 15, 2024

	std::string_view delimiter = "";
	std::string_view delimiter{};

feat: adds urlpattern to ada #381

Are you sure you want to change the base?

feat: adds urlpattern to ada #381

Conversation

miguelteixeiraa commented May 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bricss commented Dec 5, 2023

bricss commented Jan 15, 2024 • edited

lemire commented Jan 15, 2024

bricss commented Jan 15, 2024 • edited

lemire commented Jan 15, 2024

miguelteixeiraa commented May 9, 2023 •

edited

bricss commented Jan 15, 2024 •

edited

bricss commented Jan 15, 2024 •

edited