Skip to content

Latest commit

 

History

History
98 lines (79 loc) · 4.66 KB

encoding.md

File metadata and controls

98 lines (79 loc) · 4.66 KB

Ecoji encoding standard

Ecoji maps input data into 1024 Unicode emojis (plus 5 padding emojis). Ten bits are needed to represent 1024. Ecoji reads 5 bytes at a time because this is 40 bits which is a multiple of 10. For each 5 bytes read, 4 emojis are output. When less than 5 bytes are available, special padding emojis are output.

To see exactly how Ecoji works look at encode.go and decode.go. Both of these use mapping.go which contains the following.

  • An array with the 1024 Ecoji V1 emojis, used for encoding.
  • An array with the 1024 Ecoji V2 emojis, used for encoding.
  • A map from emojis to information about the emojis, used for decoding. This map is used at decoding time to quickly determine an emojis 10-bit ordinal, version, and padding type.
  • Two arrays for the Ecoji V1 and V2 padding emojis.

The source code in mapping.go is automatically generated by gen.go which reads emojisV1.txt and emojisV2.txt. The files emojisV1.txt and emojisV2.txt contains the unicode code points for the emojis used by Ecoji V1 and V2. The line number of an emoji in those files corresponds with its 10-bit ordinal for encoding. For more information see emojis.md.

To test new implementations of Ecoji, look at the test script.

New lines

When encoding data, new lines can optionally be inserted to wrap data. Encoding should normally emit the Unix new line character of \n when wrapping, but could emit '\r\n' if desired. Decoding data ignores all new lines, including windows new lines. So decoding should ignore \n and \r.

The Go implementation only emits \n, but accepts \n or \r.

Versions

Ecoji currently has two versions that each have their own sets of emojis. Ecoji V2 was designed to be backwards compatible with Ecoji V1. When decoding data, it is always possible to distinguish between V1 and V2 and correctly decode the data. This was achieved by following the principle that if an emoji is used in Ecoji V1 and V2, then it must have the same 10-bit ordinal in both. If a program was properly written against only the Ecoji V1 standard and it is given Ecoji V2 data, then it should either decode it successfully or throw an error. If data encoded with Ecoji V2 happens to only used emojis used by Ecoji V1 then an Ecoji V1 program can decode it because the same 10-bit ordinals were used. If emojis only used by Ecoji V2 are seen by an Ecoji V1 program then it should throw an error because the emojis are unrecognized.

Ecoji V2 was created to use a better set of emojis for encoding. In addition to this Ecoji V2 also relaxes the padding requirements. Ecoji V1 always padded to 4 emojis when the end of the input data was less than 5 bytes. Ecoji V2 can have less than 4 padding emojis when the end of the input data is less than 5 bytes.

Ecoji V2 used emojis from the Unicode 14 emoji standard. Ecoji V1 used emojis from the Unicode 11 emoji standard. Ecoji V1 used some non fully qualified emojis. Ecoji V2 only uses fully qualified emojis.

Ecoji encoding efficiency

Many have asked how Ecoji compares to base64. The short answer is that a string encoded with Ecoji will have more bytes, but fewer visible characters, than the same string encoded with base64. With Ecoji, each visible char represents 10 bits, but each character is multi-byte. With base64 each char represents 6 bits and is one byte. The following table shows encoding sha256 in different ways.

Encoding Bytes Characters
none 32 N/A
hex 64 64
base64 44 44
ecoji 112 28

Sorting Ecoji-Encoded Data

Ecoji V1 supported sorting encoded data. However V2 does not support this. It was not possible to support sorting and backwards compatability, so sorting was dropped as feature in V2.

Below is an example showing that data encoded with Ecoji V1 sorts the same as the input data.

$ echo -n a | ecoji > /tmp/test.ecoji
$ echo -n ab | ecoji >> /tmp/test.ecoji
$ echo -n abc | ecoji >> /tmp/test.ecoji
$ echo -n abcd | ecoji >> /tmp/test.ecoji
$ echo -n ac | ecoji >> /tmp/test.ecoji
$ echo -n b | ecoji >> /tmp/test.ecoji
$ echo -n ba | ecoji >> /tmp/test.ecoji
$ export LC_ALL=C
$ sort /tmp/test.ecoji > /tmp/test-sorted.ecoji
$ diff /tmp/test.ecoji /tmp/test-sorted.ecoji
$ cat /tmp/test-sorted.ecoji
👕☕☕☕
👖📲☕☕
👖📸🎈☕
👖📸🎦⚜
👖🔃☕☕
👙☕☕☕
👚📢☕☕