Add guidelines on returning string offsets & lengths #521

mikekistler · 2024-01-28T20:21:29Z

This PR splits out the update for string offset and length from #517. I also reworked things a bit by moving the explanatory content over to ConsiderationsForServiceDesign.

It looks like my editor also trimmed some trailing whitespace from otherwise unchanged lines.

weidongxu-microsoft · 2024-01-30T06:17:43Z

azure/ConsiderationsForServiceDesign.md

+    "offset": {
+      "utf8": 12,
+      "utf16": 10,
+      "codePoint":  4


nit, we seems got 2 spaces here "codePoint": 4

weidongxu-microsoft · 2024-01-30T06:19:08Z

azure/ConsiderationsForServiceDesign.md

@@ -515,6 +515,61 @@ For example, the client can specify an `If-Match` header with the last ETag valu
 The service processes the update only if the ETag value in the header matches the ETag of the current resource on the server.
 By computing and returning ETags for your resources, you enable clients to avoid using a strategy where the "last write always wins."

+## Returning String Offsets & Lengths (Substrings)
+
+Some Azure services return substring offset & length values within a string. For example, the offset & length within a string to a name, email address, or phone #.


nit phone # seems too informal? Just phone number?

heaths

A few suggestions, but otherwise LGTM.

heaths · 2024-02-02T01:21:16Z

azure/ConsiderationsForServiceDesign.md

+| UTF-16 | JavaScript, Java, C# |
+| CodePoint (UTF-32) | Python |
+
+Because the service doesn't know what language a client is written in and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding it required by its language's internal string encoding.


Grammar nit:

Suggested change

Because the service doesn't know what language a client is written in and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding it required by its language's internal string encoding.

Because the service doesn't know in what language a client is written and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding required by its language's internal string encoding.

heaths · 2024-02-02T01:22:29Z

azure/ConsiderationsForServiceDesign.md

+   name := response.fullString[ response.name.offset.utf8 : response.name.offset.utf8 + response.name.length.utf8]
+```
+
+The service must calculate the offset & length for all 3 encodings and return them because clients find it difficult working with Unicode encodings and how to convert from one encoding to another. In other words, we do this to simplify client development and ensure customer success when isolating a substring.


Should we also mention that it makes pass-through requests easier as well? That was the thing that really won me over. I think the same was true for @JeffreyRichter, IIRC.

heaths · 2024-02-02T01:27:03Z

azure/Guidelines.md

+All string values in JSON are inherently Unicode and UTF-8 encoded, but clients written in a high-level programming language must work with strings in that language's string encoding, which may be UTF-8, UTF-16, or CodePoints (UTF-32).
+When a service response includes a string offset or length value, it should specify these values in all 3 encodings to simplify client development and ensure customer success when isolating a substring.
+
+<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response.


I think we should document here in this doc the exact format we want e.g., {"utf8": 2, "utf16": 1, "codePoint":1}. We document formats for LROs, pageables, and errors. How you expanded on that in "Considerations" is perfect, but you should also link to that section e.g.,

Suggested change

<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response.

<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response using the schema below. See [considerations](ConsiderationsForServiceDesign.md#{actual-stub-here}) for more information.

```json

{

"length": {

"utf8": 2,

"utf16": 1,

"codePoint": 1

}

}

```

Add guidelines on returning string offsets & lengths

1731170

mikekistler requested review from johanste and JeffreyRichter January 28, 2024 20:21

JeffreyRichter approved these changes Jan 29, 2024

View reviewed changes

MushMal approved these changes Jan 29, 2024

View reviewed changes

weidongxu-microsoft reviewed Jan 30, 2024

View reviewed changes

heaths approved these changes Feb 2, 2024

View reviewed changes

Address PR review feedback

03c0af7

mikekistler merged commit e465ea1 into microsoft:vNext Feb 4, 2024
1 check passed

mikekistler deleted the string-index branch February 4, 2024 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add guidelines on returning string offsets & lengths #521

Add guidelines on returning string offsets & lengths #521

mikekistler commented Jan 28, 2024

weidongxu-microsoft Jan 30, 2024

weidongxu-microsoft Jan 30, 2024

heaths left a comment

heaths Feb 2, 2024

heaths Feb 2, 2024

heaths Feb 2, 2024

-<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response.
+<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response using the schema below. See [considerations](ConsiderationsForServiceDesign.md#{actual-stub-here}) for more information.
+```json
+{
+  "length": {
+    "utf8": 2,
+    "utf16": 1,
+    "codePoint": 1
+  }
+}
+```

Add guidelines on returning string offsets & lengths #521

Add guidelines on returning string offsets & lengths #521

Conversation

mikekistler commented Jan 28, 2024

weidongxu-microsoft Jan 30, 2024

Choose a reason for hiding this comment

weidongxu-microsoft Jan 30, 2024

Choose a reason for hiding this comment

heaths left a comment

Choose a reason for hiding this comment

heaths Feb 2, 2024

Choose a reason for hiding this comment

heaths Feb 2, 2024

Choose a reason for hiding this comment

heaths Feb 2, 2024

Choose a reason for hiding this comment