Some Thoughts on Unicode in Perl 6

All of the recent work on Rakudo, getting it to run on the JVM, and the creation of and work on MoarVM as another backend for NQP (and thus Rakudo), has created a sense that we’re really moving forward in Perl 6-land. Maybe Christmas will come this year, or perhaps 2014?

In any case, with Rakudo now on a more mature platform, to be able to implement the big things (such as threads), it seems as though Rakudo is making big leaps towards being fully Perl 6.

Except that actually cannot happen, what with 8 unwritten synopses:

  • S15 — Unicode
  • S18 — Compiling
  • S20 — Introspection*
  • S23 — Security
  • S25 — Portable Perl
  • S27 — Perl Culture*
  • S30 — Standard Perl Library
  • S33 — Diagnostic Messages

And that’s not counting all the other synopses that need a serious rewrite (the higher the spec number, the more likely it’s in need of repair). With the momentum currently going forward in the community, perhaps it’s time we use some of that to fill out the rest of the specification?

If you haven’t guessed already, the spec I’ve been thinking about is S15. Below is a presentation of some notes on the subject I put up a couple of days ago.

Consider this humble Devanagari syllable:

नि (U+0928 U+093F)

Next to it you see the two codepoints that make it up. I shall now present a table on how the various UTFs encode this syllable:

UTF-8 E0 A4 A8 E0 A4 BF
UTF-16BE 0928 093F
UTF-32BE 00000928 0000093F

When it comes to Unicode there are a number of ways to count characters, depending on your view of the situation. Here’s a quick list, from lowest to highest view:

  • Bytes are a simple count of the number of bytes that make up the given Unicode text.
  • Code units are the smallest units of information in an encoding. The number after UTF indicates the number of bits in a code unit (so the code unit of a UTF-8 text is the byte, 8 bits).
  • Code points are the numbers assigned to each “character” in Unicode. This is independent of encoding. The Devanagari syllable above has two code points.
  • Graphemes are what normally constitute a character to the reader’s eyes, regardless of how many code points make it up. Both ä and ä are just one grapheme, even though the first one is made up of two code points.

For our Devanagari syllable above, the counting based on viewpoint and encoding is outlined here:

(count by) UTF-8 UTF-16 UTF-32
bytes 6 4 8
code units 6 2 2
codepoints 2 2 2
graphemes 1 1 1

As you’ll notice, the counting of codepoints and graphemes is not affected by the text’s encoding. (Also note that the endianness of UTF-16 and UTF-32 doesn’t matter when it comes to counting.)

Perl 6 has some ideas about Unicode already set, such as counting by graphemes by default (which counts a string containing just our Devanagari syllable above as 1 long, which is what you usually mean).

What I’m putting here today are some of my ideas on what the methods and pragmas involved should look like. I’ve yet to think about Str and Buf specifically (questions such as their relationship with each other and whether more-derived types of Str/Buf (e.g. StrGraphemes) are necessary or useful). There’s hardly enough here for a decent S15, but hopefully enough for a starting point.

Pragmas — Changing Defaults

Perl 6 handles Unicode in a couple of default ways:

  • Encodes Unicode strings in UTF-8
  • Views strings in terms of graphemes unless another view is requested

These defaults should pervade any time you’re dealing with text, whether it’s a literal string, user input, or non-binary file I/O. You can always change these, such as .codes to count the string by codepoints, or open("file", :enc<UTF-16BE>) to open a text file you know is encoded as UTF-16BE.

But if you’re dealing with a lot of UTF-32LE encoded files, or you need to a lot of string operations at the code unit level, then pragmas are the way to change these defaults. Here are the pragmas as I imagine spelling them:

use utf8;          # use UTF-8 encoding
use utf16 :be/:le; # use UTF-16[BE|LE] encoding
use utf32 :be/:le; # use UTF-32[BE|LE] encoding

use graphemes;  # count by graphemes
use codepoints; # count by code points   
use codeunits;  # count by code units
use bytes;      # count by bytes

There is also one other pragmas I’ve thought up, although its usefulness is very questionable:

use normalization :NFC/:NFD/:any
# compose/decompose/leave be all characters
# in strings at time of creation.

Methods for Str

These methods either count characters in a certain way, or (de)compose them, or change the encoding of the Str. Here’s the list, some of these already specced:

.chars  # count by the current default view (default .graphs)
.graphs # count by graphemes
.codes  # count by code points
.units  # count by code units
.bytes  # count by bytes

.compose   # convert string to NFC form
.decompose # convert string to NFD form

.convert # change the encoding of the Str

There’s likely a host of other functionality that Strs need, but these are the ones that have come to mind.

Closing Thoughts

I don’t think Perl 6 needs a separate type for a single character (the Char and AnyChar found in the untouched corners of the spec). It feels like an unnecessary addition; it’s hard for me to see a time where a one-character string needs to be treated differently from a multichar string.

Also a couple times in the spec, is the idea of counting characters with adverbs such as :ArabicChars in addition to :graphs. I’d like to see examples of scripts where a “grapheme” is not always the same as a complete “character” before going along with such language-specific counting mechanisms in core.

I also feel that Buf needs better explanation. I’m thinking about it now, and I suppose I need some convincing that we need to consider Buf the cousin of Str. I think it’s useful to have a type of array that’s designed to work with binary files (something I feel Buf is perfect for), but I have doubts about treating it like a numbers-based look at Str.

To put it another way, I’ve always used, and thus see, Buf as a way of interacting with binary files, and its data. I have a hard time believing such an object should be tasked with text-based knowledge, such as if it’s a valid Unicode string, when that may not be the case.

(Although derived versions of Buf, such as Utf8, could perfectly place text-based restrictions on its data. But leave Buf out of it. :) )

Finally, I hope that we can soon get a decent S15 written, and then maybe also finish the rest of the spec. How ’bout it?

* To be fair, there are two drafts, one for S20, and S27. They’re available from the front page of the HTML specs, the A20 draft (yes, an apocalypse draft) and the S27 draft. The A20/S20 draft might be worth a look and combining with what jnthn’s debugger does. The S27 draft, in my opinion, should be ignored without a second thought.

Maybe there should be an S34 for 6model, making it nine unwritten.

About these ads
This entry was posted in Think Tank and tagged , , , . Bookmark the permalink.

5 Responses to Some Thoughts on Unicode in Perl 6

  1. skids says:

    All the pragma names are easily identifiable as having to do with unicode, except for “use bytes” which would just be a mystery to me if I hadn’t seen and remembered it in the spec. Str.bytes, on the other hand, is less mysterious because it has context. Although I guess if someone knew nothing at all about unicode, “codepoint” and “codeunit” might be overloaded someplace. I’d suggest finding a good base word to attach adverbs to e.g. “use unicode :bytes”

    As far as Buf goes, if you are using it for anything other than ferrying opaque data, you are probably working with a complicated binary data structure that has embedded bits of it in various text formats, e.g. a deeply nested TLV mess like TIFF or something with a lot of stricture in it like the various certificate standards that can be pretty specific about what encodings are allowed for what fields. The ability to nimbly switch between encodings for any arbitrary sub-buf once you know its start index and length is what will be needed here. In other words, it’s the text embedded in the material Buf will be handling that necessitates a tight coupling.

  2. > use utf8; # use UTF-8 encoding
    > use utf16 :be/:le; # use UTF-16[BE|LE] encoding
    > use utf32 :be/:le; # use UTF-32[BE|LE] encoding

    In Perl 5, the utf8 pragma declares that your source code is UTF-8 and is unrelated to I/O. It sounds like you’re proposing that the Perl 6 utf8 pragma should work like Perl 5 “use open qw( :encoding(UTF-8) :std )”. That certainly would be more in line with what most folks expect of the Perl 5 utf8 pragma before reading the docs, although very incompatible so I’m worried about the added confusion. I truly hope that Perl 6 doesn’t allow arbitrary encodings of source code and instead just standardizes across the board with UTF-8. The worst is Perl 5 where you can have lexically-scoped source encodings!

    > use codeunits; # count by code units

    > .units # count by code units

    Do you see any use for this? I’ve never seen or heard of someone wanting to work on the code unit level unless they’re performing low-level encoding conversions. I think this blog post may be the first place I’ve seen it mentioned for use by programmers outside of implementing an encoding or a conversion algorithm. It’s certainly not a question that I see brought up on Stack Overflow, Perl Monks, etc.

    > use normalization :NFC/:NFD/:any

    I would love to see a way to default to a normalization form, but it would be important to have the ability to specify different normalization forms for input and output. I almost always want NFC for output, but I may want the option for NFD, NFKD, or NFKC for input. Composition or decomposition based on my regex/grammar needs, and canonical or compatibility based on my comparison needs.

    > .bytes # count by bytes

    Are you proposing that this be added back to Str? If so, a Str would need to keep track of its encoding like a Buf does. It sounds like you may be against Bufs entirely though for textual data using character encodings.

    > .compose # convert string to NFC form
    > .decompose # convert string to NFD form

    I hope there will be simple options for compatibility decomposition for those who use NFKD or NFKC. When I last looked, the spec had .nfd, .nfc, .nfkd, and .nfkc. It also had .normalize with various options but I’m not a big fan.

    > I don’t think Perl 6 needs a separate type for a single character

    I agree. Although it’s worth mentioning that a nice benefit could be attributes for Unicode properties like $char.script and $char.numeric-value, but that might be overkill!

    > I’d like to see examples of scripts where a “grapheme” is not always the same as a complete “character” before going along with such language-specific counting mechanisms in core.

    Search for “tailored grapheme clusters” in UAX #29. It provides details and examples for languages like Slovak with “ch” digraph and Devanagari with “क्षि” (kshi), plus custom tailoring such as a sequence with a letter modifier like “kʷ”.

  3. Pingback: Unicode in Perl6 | Enjoying The Moment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s