Talk:Base62

Latest comment: 2 years ago by Johnuniq in topic References

Improve edit

Hello, I requested an undelete, but still have to improve it. Anyone suggestions on text or articles? --FlippyFlink (talk) 10:01, 13 August 2020 (UTC)Reply

References edit

The references used are:

I have examined these after help from Bruce1ee at WP:RX. They are not readily available and are somewhat esoteric. Also, there are few other references. Therefore I have provided a summary below. This is more detail than would be suitable for the article but it might be useful when deciding what to do with this article. Johnuniq (talk) 01:14, 13 November 2021 (UTC)Reply

IEEE summary edit

Base64 is used to encode binary data into a printable representation. However, data encoded with base64 is inflated to 133% of its original size. Also, different base64 implementations use different alternate characters (MIME uses '+' and '/', but file name and URL encoding often use '-' and '_').

This paper presents a lossless base-62 compressed encoding method that uses only alphanumeric characters to represent the original data and which achieves a good compression level with typical text while being faster than more advanced compression systems. Applications include URL encoding, Internet Mail Transfer, and embedding binary objects in XML files. The compression is separate from the proposed base62 encoding.

Base62 encoding uses the 62 characters A–Za–z0–9 (character A represents the value 0, B represents 1, and so on, up to 9 which represents 61).

The input consists of a stream of bytes which is transformed into a stream of bits with the most-significant bits from each byte processed first.

Each step of encoding results in a single base62 8-bit ASCII character. The encoder works with bits that may remain unprocessed from the last step. The next input bits are appended to any unprocessed bits. If there are no further input bits, zeroes are prepended to the unprocessed bits.

Each step encodes either the next five bits or the next six bits.
If the next five bits are 11110 they are encoded by the 61st character (8).
If the next five bits are 11111 they are encoded by the 62nd character (9).
Otherwise, the next six bits are encoded by the corresponding character which will be from the 1st to the 60th (A to 7).

Alphabet (0 → A, 10 → K, 20 → U, 52 → 0, except binary 11110 → 8 and 11111 → 9):

"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
 01234567890123456789012345678901234567890123456789012345678901
 0         1         2         3         4         5         6

Example (0x means hex):

Byte stream   0x53        0xFE        0x92
Bit stream    0101 0011   1111  1110  1001 0010
              0101 00|11  111|1 1110| 1001 00|10
Index           20   |  61   |  60  |  36    | 2
Encoded         U    |  9    |  8   |  k     | C

That is, hex 53, FE, 92 would be encoded to base62 U98kC. Johnuniq (talk) 01:14, 13 November 2021 (UTC)Reply

Wiley summary edit

This paper presents a proposal called UTF-62 for a procedure to encode/decode program identifiers (names of variables). The aim is to allow a program in, for example, C to use variables written in a local language such as Chinese. A preprocessor transforms the source program with its Chinese identifiers into valid C code by base62 encoding non-English names and prepending "_0x" to each encoded name. That ensures they are valid C identifiers and shows that the name has been encoded. If necessary, compiler errors can be translated by decoding the names to the original language. Since variable names are generally short, there is no attempt to compress the encoded name.

An example C function is given to perform the encoding. Input to the function consists of 32-bit UCS-4 codepoints representing Unicode characters. Output is a string of ASCII bytes, each 8-bit byte consisting of a base62 character, namely 0–9A–Za–z (character 0 represents the value 0, A represents 10, and so on, up to z which represents 61). Encoding preserves the lexicographic sorting order of UCS-4.

Each input 32-bit codepoint is encoded as either 3 bytes or 6 bytes.
Codepoints <= 0xffff result in 3 bytes with most significant bit = 0.
Other codepoints give 6 bytes with most significant bits = 1000.
That allows the decoder to reverse the variable-length encoding.

Alphabet (0 → 0, 10 → A, 20 → K, 61 → z):

"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
 01234567890123456789012345678901234567890123456789012345678901
 0         1         2         3         4         5         6

Example: The Chinese variable "加總" ("total") is two UCS-4 32-bit numbers—U+52A0 and U+7E3D. Since each number <= hex FFFF, each is encoded as three bytes. The result is six UTF-62 characters: 5VA8PF which would appear as _0x5VA8PF after prepending _0x.

Encoding consists of processing each 32-bit codepoint separately by repeatedly dividing by 62 while keeping the remainders. Respectively, the remainders are 10, 31, 5 for the first codepoint and 15, 25, 8 for the second:

5×622 + 31×62 + 10 = 21152 = 0x52A0
8×622 + 25×62 + 15 = 32317 = 0x7E3D

The remainders are stored with the most significant first (5, 31, 10 and 8, 25, 15) and are translated to UTF-62 characters:

5 → 5
31 → V
10 → A
8 → 8
25 → P
15 → F

Johnuniq (talk) 01:14, 13 November 2021 (UTC)Reply