All of the recent work on Rakudo, getting it to run on the JVM, and the creation of and work on MoarVM as another backend for NQP (and thus Rakudo), has created a sense that we’re really moving forward in Perl 6-land. Maybe Christmas will come this year, or perhaps 2014?
In any case, with Rakudo now on a more mature platform, to be able to implement the big things (such as threads), it seems as though Rakudo is making big leaps towards being fully Perl 6.
Except that actually cannot happen, what with 8 unwritten synopses:
- S15 — Unicode
- S18 — Compiling
- S20 — Introspection*
- S23 — Security
- S25 — Portable Perl
- S27 — Perl Culture*
- S30 — Standard Perl Library
- S33 — Diagnostic Messages
And that’s not counting all the other synopses that need a serious rewrite (the higher the spec number, the more likely it’s in need of repair). With the momentum currently going forward in the community, perhaps it’s time we use some of that to fill out the rest of the specification?
If you haven’t guessed already, the spec I’ve been thinking about is S15. Below is a presentation of some notes on the subject I put up a couple of days ago.
Consider this humble Devanagari syllable:
नि (U+0928 U+093F)
Next to it you see the two codepoints that make it up. I shall now present a table on how the various UTFs encode this syllable:
When it comes to Unicode there are a number of ways to count characters, depending on your view of the situation. Here’s a quick list, from lowest to highest view:
- Bytes are a simple count of the number of bytes that make up the given Unicode text.
- Code units are the smallest units of information in an encoding. The number after UTF indicates the number of bits in a code unit (so the code unit of a UTF-8 text is the byte, 8 bits).
- Code points are the numbers assigned to each “character” in Unicode. This is independent of encoding. The Devanagari syllable above has two code points.
- Graphemes are what normally constitute a character to the reader’s eyes, regardless of how many code points make it up. Both ä and ä are just one grapheme, even though the first one is made up of two code points.
For our Devanagari syllable above, the counting based on viewpoint and encoding is outlined here:
As you’ll notice, the counting of codepoints and graphemes is not affected by the text’s encoding. (Also note that the endianness of UTF-16 and UTF-32 doesn’t matter when it comes to counting.)
Perl 6 has some ideas about Unicode already set, such as counting by graphemes by default (which counts a string containing just our Devanagari syllable above as 1 long, which is what you usually mean).
What I’m putting here today are some of my ideas on what the methods and pragmas involved should look like. I’ve yet to think about
Buf specifically (questions such as their relationship with each other and whether more-derived types of
StrGraphemes) are necessary or useful). There’s hardly enough here for a decent S15, but hopefully enough for a starting point.
Pragmas — Changing Defaults
Perl 6 handles Unicode in a couple of default ways:
- Encodes Unicode strings in UTF-8
- Views strings in terms of graphemes unless another view is requested
These defaults should pervade any time you’re dealing with text, whether it’s a literal string, user input, or non-binary file I/O. You can always change these, such as
.codes to count the string by codepoints, or
open("file", :enc<UTF-16BE>) to open a text file you know is encoded as UTF-16BE.
But if you’re dealing with a lot of UTF-32LE encoded files, or you need to a lot of string operations at the code unit level, then pragmas are the way to change these defaults. Here are the pragmas as I imagine spelling them:
use utf8; # use UTF-8 encoding use utf16 :be/:le; # use UTF-16[BE|LE] encoding use utf32 :be/:le; # use UTF-32[BE|LE] encoding use graphemes; # count by graphemes use codepoints; # count by code points use codeunits; # count by code units use bytes; # count by bytes
There is also one other pragmas I’ve thought up, although its usefulness is very questionable:
use normalization :NFC/:NFD/:any # compose/decompose/leave be all characters # in strings at time of creation.
Methods for Str
These methods either count characters in a certain way, or (de)compose them, or change the encoding of the
Str. Here’s the list, some of these already specced:
.chars # count by the current default view (default .graphs) .graphs # count by graphemes .codes # count by code points .units # count by code units .bytes # count by bytes .compose # convert string to NFC form .decompose # convert string to NFD form .convert # change the encoding of the Str
There’s likely a host of other functionality that
Strs need, but these are the ones that have come to mind.
I don’t think Perl 6 needs a separate type for a single character (the
AnyChar found in the untouched corners of the spec). It feels like an unnecessary addition; it’s hard for me to see a time where a one-character string needs to be treated differently from a multichar string.
Also a couple times in the spec, is the idea of counting characters with adverbs such as
:ArabicChars in addition to
:graphs. I’d like to see examples of scripts where a “grapheme” is not always the same as a complete “character” before going along with such language-specific counting mechanisms in core.
I also feel that
Buf needs better explanation. I’m thinking about it now, and I suppose I need some convincing that we need to consider
Buf the cousin of
Str. I think it’s useful to have a type of array that’s designed to work with binary files (something I feel
Buf is perfect for), but I have doubts about treating it like a numbers-based look at
To put it another way, I’ve always used, and thus see,
Buf as a way of interacting with binary files, and its data. I have a hard time believing such an object should be tasked with text-based knowledge, such as if it’s a valid Unicode string, when that may not be the case.
(Although derived versions of
Buf, such as
Utf8, could perfectly place text-based restrictions on its data. But leave
Buf out of it. :) )
Finally, I hope that we can soon get a decent S15 written, and then maybe also finish the rest of the spec. How ’bout it?
* To be fair, there are two drafts, one for S20, and S27. They’re available from the front page of the HTML specs, the A20 draft (yes, an apocalypse draft) and the S27 draft. The A20/S20 draft might be worth a look and combining with what jnthn’s debugger does. The S27 draft, in my opinion, should be ignored without a second thought.
Maybe there should be an S34 for 6model, making it nine unwritten.