The magic of exponential growth (or, fun with Planck times)

A while back I was thinking about how one would implement a time-oriented system for keeping notes. When thinking about the differences between points and ranges in time, I naturally came to the conclusion that a second technically covers a range of time as well.

Then, because I’m me, instead of saying “oh, milliseconds would do nicely as a ‘point’ in time”, I went straight to planck times.

The question then of course, is: How much space do you need to store the number of planck times that have elapsed since the start of the universe?1

So, a planck time consists of 5.39106e-44 ± .00032e-44 seconds, and inversely a second contains 1.85492e43 ± .00011e43 planck times.2

Additionally, we know the age of the universe as being 4.354e17 ± .012e17 seconds, or .4354 ± .0012 Es.

So, to store the number of elapsed Planck times, we need enough bits to store 4.354e17 ± .012e17 s × 1.85492e43 ± .00011e43 tp s-1 = 8.07632168e60 ± .00047894e60 tp . With plenty of room to grow needed, of course.

So, that’s our goal, roughly 8e60 planck times. Let’s begin with some sizes you’ll get on most systems, programming in C or something like it:

No. Bits Maximum number of planck times tp Number of seconds s 10 * log10(s)
8 0xFF = 255 1.3747203e-41 ± .0000816e-41 -409
16 0xFFFF = 65535 3.533031171e-39 ± .000209712e-39 -385
32 0xFFFF_FFFF = 4294967295 2.31544263853827e-34 ± .00013743895344e-34 -336
64 0xFFFF_FFFF_FFFF_FFFF = 18446744073709551615 9.94475041060126e-25 ± .000590295810358706e-25 -240

None of those integers come close to talking about a lot of time. Only that 64-bit number doesn’t look (too) small written in yoctoseconds, specifically ≈.99 ys.

The last column (which is rounded to nearest integer, as you might’ve suspected) exists for this handy graphical chart of the beginning of the universe. For eight bits, we’re almost halfway through the Grand Unification Epoch3, while 16 bits is farther into it. 32 bits we finally get to experience the next phase, the beginning portion of the Electroweak Epoch (as well as mostly through the inflationary epoch). At 64 bits, we’re half-way through that electroweak epoch.

OK, what if we used two 64-bit numbers to represent a 128-bit integer?

128 ≈3.4e38 (it’s a 39 digit number, or in the duodecillions) 1.83448265701279e-05 ± .0001088903574147e-05 -47

All of a sudden, we jump from being just under 1 ys to being just under 20 µs. On the handy graphical chart, we finish up in the electroweak, skip quarks, and land right towards the start of the Hadron Epoch.

That’s nice and all, but still not even one second. How about a 256-bit integer?

256 ≈1.1e77 (78 digits!) 6.24242100603726e33 ± .000370534685559412e33 338

That graphical chart will not help us now. We could write 6.24242100603726e33 s as 6242421006.03726 Ys, but it still manages to look like a huge number. It’s decillions of seconds, or hundreds of quattuorvigintillions of planck times. That’s 14.3 quadrillion times the estimated age of the universe. I could keep going, but I think you get the picture.

The reason why I created this post was because exponential growth. You’ve probably heard of that old story with the rice and the chessboard, or something like it. It does a good job of demonstrating how exponential growth works.

Until I had come across this question, and tried to solve it, I had no deep understanding of exponential growth, despite that story and all the rest. It surprised me that doubling the number of bits from 128 to 256 would bring me sufficient bits to talk about events here and now, billions of years after the universe started. I went from a couple dozen microseconds to the age of the universe 14 quadrillion times over, and only knowing I could blame exponential growth kept me from distrusting my results longer.

So I thought I should share this. Put out there another example of how quickly exponential growth takes things out of hand. Maybe it’s just because I discovered this on my own, but I think there’s nothing quite like this question to help you understand that kind of growth.

And if I ever do write a note-keeping program, you’d rest assured knowing your timing will only ever be limited by general relativity. Because it only takes 32 bytes for me to give that :) .

Maybe in a follow-up post I can post a table of numbers of bits that aren’t just powers of two, and a graph as well. Sounds like something fun to look at, at least.

(By the way, you only need 203 bits at minimum, as that’ll get you 1.6 times the age of the universe. Since bit-level control in programming is a great bother, a minimum of 26 bytes will get you 50 times the age. Though by this point 4 64-bit numbers (or 8 32-bit ones) would be easier to manage, I should think :) .)

1While choosing some sort of epoch would make the numbers smaller for things relevant to us, if you’re dealing in planck times, chances are you’ll want to talk about the beginning of the universe no matter what epoch we count from. This way lets us avoid throwing a bit away on indicating the number’s sign.

2You’ll forgive me if I got the error calculations wrong in this post, I don’t usually mess with them. I in fact only did it here for completeness’ sake. I used the simpler formulas that assume we’re playing with standard deviations, for those who know these things better.

3Just to point this out: it takes just 255 planck times to get roughly halfway through the point in the universe where the electronuclear force was a thing. Only 255.

Posted in Cantina | Tagged , , , , , | Leave a comment

Buffers Aren’t Strings

So, the issue of this post has recently come up once again, because this:

my $buffer =, 66, 67);
my $string = "ABC";
say $buffer eq $string;

infinitely recurses in Rakudo. Why? It’s because both Buf and Str do Stringy, and when eq is given disparate types, it calls .Stringy on both of them. Which returns a Buf for Buf, and a Str for Str.

Str.Stringy being a Str is normal and expected, but Buf.Stringy is the problem. If Buf didn’t do Stringy, it would be converted into a Stringy object that isn’t itself (like 4.Stringy, which is why "4" eq 4 works).

This is indicative of what I think is a huge problem in Perl 6: Bufs should not be considered Stringy at all. Since the last time this discussion came up didn’t go so well, I thought I’d put up a blog post on my thoughts, to avert the problems with trying to convey the same information on IRC.

So, Perl 6 regards strings as a high-level sequence of characters. Unlike other programming languages, you’re not required to pay attention to how strings are actually stored, or encoded, to manipulate them as you would expect. Strings in Perl 6 don’t know their storage at all, so if you do in fact need to manipulate the bytes making up its storage, you have to .encode the string to a buffer, and .decode that buffer when you want a string again.

Now, I can’t say for sure why Buf does Stringy in the first place; it’s the only thing in Perl 6 I know of where the implicit definition of the word “string” is much more general than the text-based definition we’re familiar with. What I can say though, is what I find wrong with this:

Textual data is only a subset of what buffers can handle. Buffers in Perl 6 are used to handle binary data, for example reading a binary file. This is the kind of thing buffers are designed to handle. Some of that data could be text, but that’s not all it could be. So why inherit a role that handles only some of the data you receive? Rats don’t inherit Int to handle numbers whose denominators are 1, after all.

Important to note here is that while text data is a proper subset of binary data, the Stringy role that deals with text data isn’t similarly related to the Buf role that deals with binary data. There may be some overlap, but neither fits inside the other. This brings us to the larger issue…

Strings and buffers aren’t the same. The match method for strings doesn’t make too much sense for buffers. Going the other direction, the bitwise AND operator for buffers makes no sense for strings, which don’t know their bit patterns in the first place.

However, because buffers and strings are currently linked as they are, both buffers and strings need to support (in some fashion) operations that are truly only meant for the other. This is I think the biggest and most substantive problem. Buffers and strings aren’t similar. There is no good way to relate strings and buffers without getting a clunky mess.

The best evidence for this is S03’s coverage of the buffer bitwise operators. Except for the shift operators1, every single one mentions coercion of string types to some buffer type, and then says coercion probably indicates a design error. The design error is trying to say that buffers are string-like.

These issues can be fixed by simply not saying Buf does Stringy. The Stringy role is the basis for all high-level string types, the Buf role is the basis for all low-level buffer types. They do separate things, and have separate purposes. Creating this link between them serves no purpose than to cause possible design errors and issues with infinite recursion.

This leads to a particular problem though: those bitwise ops. The ~ character signifies string-like stuff in Perl 6, which (as I’ve established) buffers aren’t. Which necessitates a new symbol. Problem is, looking at my ordinary keyboard, the only ASCII symbol that doesn’t mean something somewhere in Perl 6 yet is the backtick. Sadly, I don’t think many people will enjoy `+ and `> for their bitwise ops :) , so we’ll need to go past ASCII, and come up with a Texas variant too. Some possible ideas I’ve come across so far:

€& €| €^ €> €<    (E&) (E|) (E^) (E>) (E<)

Flimsily based on the theme set by $ and ¢ --- $calar, ¢apture, and €xposed
binary data, of course. (parens in the Texas version like set ops', to avoid
thinking E is a metaop)

⅋& ⅋| ⅋^ ⅋> ⅋<    (&&) (&|) (&^) (&>) (&<)

Flimsily based on the fact that ⅋ looks cool. (parens in Texas version to
prevent conflict with &&)

⋈& ⋈| ⋈^ ⋈> ⋈<    ><& ><| ><^ ><> ><<

Bowties are cool.

Additionally, there’s the question of what kinds of methods and operators on Buf we should see, to which the answer is simple: array-like things, rather than string-like things. Bufs should be seen as a kind of list, really. (This means postcircumfix:<[]> instead of .subbuf, .push instead of infix:<~>, etc.)

Finally, just to clear up this potential point: utf8, utf16, and all the other Unicode encoding scheme2 Blobs shouldn’t do Unicodey. This is because the point of those blob types, to enforce an encoding scheme, isn’t handled by Unicodey (a high-level string-like role), and the only other stuff Unicodey offers is for string-based stuff, not buffer-based stuff.3

I realize that this isn’t the end of the discussion (we’ve got a buffer symbol to decide, after all :P). However, I don’t think I’ll ever be convinced that Buf does Stringy is right; they are just too distinct for this association to be useful, and they are distinct enough for this association to be harmful.4

I think separating the two would lead to better things, for Buf especially. For instance, I have my suspicions that Perl 6’s version of pack and unpack will be heavily centered on Buf.5 :)

Perl 6 roles are usually adjectives, not nouns. Shouldn’t Buf be Buffy then?

1The buffer bitwise shift operators have no descriptions in S03 in the first place, and in any case are implied to handle strings much like the other buffer bitwise ops.
2Yes, scheme, not form. I’d like to see utf16le, utf16be, utf32le, and utf32be be added as types. My experience writing S15 tells me that specifying endianness with :be and :le adverbs is a poor reimplementation of the type system :) . utf16 and utf32 would be kept as BOM-using (but not -requiring) variants, as they are encoding schemes too.
3The fact that the utf buffers are guaranteed to be holding textual data would suggest it’s ok for them to do Unicodey, and thus also Stringy. However, if you need to be doing string-like operations on your data, might I suggest our lovely collection of Unicode string types :D !
4If you think otherwise, shouldn’t Array does Stringy too? Practically the same thing :) .
5Especially when you consider that the only functionality of pack/unpack Perl 6 is lacking is the ability to easily interchange complex data with low-level APIs, and thus it’s the only thing pack/unpack in Perl 6 needs to do.

Posted in Think Tank | Tagged , , , , , , | Leave a comment

I Just Love (♥!) C++

Let’s say you want to keep a bunch byte values around in a std::vector<uint8_t> until you get the chance to write it to a file. So, here comes the first piece of data:

std::vector<uint8_t> foo;

Anyone who knows their C++ knows that won’t work, I’m merely demonstrating what I want to do at this point: put three bytes at the end of the vector, which happen to make up a magic string in the final file. So, how about that insert? It puts multiple values in a vector, right?

foo.insert(foo.end(), std::vector((uint8_t*)"foo"));

Nah, insert only takes iterators. Since I don’t want to define roughly 15–20 variables to insert all the values I need, I’ll let a lambda take care of it for me.

auto pushmany = [&](uint8_t * bytes) {
    foo.insert(foo.end(), std::begin(bytes), std::end(bytes));


But of course not:

foo.cpp:42:42: error: no matching function for call to ‘begin(uint8_t*&)’

Ooooook then, how about I just loop through the array?

auto pushmany = [&](uint8_t * bytes) {
    for (auto & i : bytes) {


foo.cpp:42:42: error: no matching function for call to ‘begin(uint8_t*&)’

Since so far I’m stuck on a string, how’s about I just use a char* array here?

auto pushmany = [&](char * bytes) {
    for (auto & i : bytes) {


Survey says…

foo.cpp:42:42: error: invalid range expression of type 'char *'; no viable 'begin' function available

(I’ve switched from g++ to clang++ at this point to have more readable walls of errors.)

How about a std::vector to the lambda? Let me know how you manage to transform a string literal into a std::vector<uint8_t>; I’d love to know.

So, new approach:

foo = {(uint8_t*)"foo", (uint8_t*)"bar", ...}

The answer:

note: candidate function not viable: no known conversion from 'uint8_t *' (aka 'unsigned char *') to 'const value_type'
      (aka 'const unsigned char') for 1st argument; dereference the argument with *


foo = {*(uint8_t*)"foo", ...}

And it compiles! It works! Or… I think. Maybe. Y’see, this finally, finally compiles, but the output file from all this was far shorter than it should’ve been1, so at this point my patience ran out and I git reset --hard.

This was me trying to make some code of mine more C++ (in addition to factoring it out into its own function), avoiding the use of char arrays when possible (frustratingly, fstream binary reads make this unavoidable to a point). Here’s the kicker: beforehand, I didn’t bother collecting it all into a container, because it was all inlined, so the equivalent line I was trying to duplicate there was this:

std::ofstream thefile(...);
thefile.write("foo", 3);

So you see, at least parts of the standard library have the capability of storing multiple units of something into a class at once. But pity be to he who dares desire the same thing of his std::vector.

I realize I could’ve just declared a char * variable and then do an insert using std::begin and std::end (even though I’m sure those weren’t working, somehow). I could’ve looked up the numbers of each character in the string and typed out a bunch of push_back statements. I realize I probably made a stupid error in this (somewhat abridged) journey just now that made me think something that would’ve worked didn’t.

The point is I shouldn’t have to do any of this. This has always been a sticking point with C++ for me; oftentimes things are frustratingly absent from the standard library, or are incomprehensibly difficult to accomplish. I used to feel like this all the time when I started using C++, and recently I thought I had grown past that. That I was comfortable with the fact that C++ makes some things more verbose, it wasn’t too much of a bother.

Then something like this happens.

Dear C++, adding more than one element to a list (be it a vector or deque or …) is a simple thing. It should be easy to accomplish. Why can’t push_back accept a same-typed vector? At least then my frustration would be focused on how I couldn’t convert "foo" into a std::vector<uint8_t> (which wasn’t my primary issue during this, so I’m sure the solution is simple, and that I simply didn’t spend enough time on the problem to find it).

C++, here’s how Perl 6 does this2:

my @a;


1And I just now realized why this probably occurred. I was really frustrated by this point, I had no interest in solving the problem further. So small wonder I didn’t see it sooner.
2The .encode.list instead of .ords is just to replicate C++’s not-handling-Unicode default.

Posted in Think Tank | Tagged , , , | 2 Comments

About Those Slangs…

I like the idea of slangs. They let you modify the grammar of Perl 6 (or maybe you’d prefer Q, or perhaps Regex?). This means you could easily switch into a more pythonesque style of programming just by importing a module, such as use Slang::Python.

However, not only are there issues with how slangs currently work (or are at least spec’d to), but they, at least I think, are unnecessary.

The current problems:

  • They augment by default : the intended use of the ‘slang’ keyword makes it so that you have augment-like behavior without actually using the augment keyword, which means you globally affect the meaning of, say, Perl6::Grammar as soon as the slang statement is interpreted.
  • No explicit way of including actions : the slang keyword is essentially another way of writing a bunch of grammar rules. What the mechanism doesn’t come with is any way to include actions, which those of you writing Perl 6 grammars are very used to by this point. To be fair, inline code blocks can do the exact same thing, but this lack of an explicit mechanism is indicative of a general forgetfulness on the importance of actions to a grammar :) .

Those things are fixable, but there’s a larger issue at play: modifying any grammar after the fact is hard, for something like Perl 6 it’s daunting. There is no standard set of rule names for an implementation of the grammar, and how to implement the actions of those rules are much harder to standardize (because not everyone will use QAST blocks in their implementation). So this would a implementation-dependent endeavor, both in terms of supporting multiple implementations, and in hoping said implementations don’t break their grammar/action definitions on you.

So, because of how hard it is in general to modify a grammar, I feel that the primary purpose of slangs should be to introduce a new sublanguage, since a new grammar is easy to do :) . I’ve said as much before.

However, I’ve recently realized something: the slang keyword offers no benefits over grammars and actions, at least not in its current form.

With everything at my disposal but slangs, including being able to interact with things like Perl6::Grammar directly, how would I implement a new sublang? Here’s an idea:


grammar Skylang::Grammar {
    regex TOP { ... }

class Skylang::Actions {
    method TOP($/) { ... }

augment grammar Perl6::Grammar {
    rule statement_control:sym<SKYLANG> {
        <sym> '☃' ~ '☄' $<srctext>=(<-[☄]>+)

augment class Perl6::Actions {
    method statement_control:sym<SKYLANG> {
        make Skylang::Grammar.parse(~$<srctext>, :actions(Skylang::Actions)).ast;

(Yes, this would be implementation-dependent too (e.g., on rakudo the ast from Skylang would need to have a bunch of QAST blocks); I never said that was a problem unique to slangs :P)

Plain ol’ modifying the language would involve just the augments. Additionally, depending on what macros end up doing, the above augmentations could be replaced with a single macro declaration.

My point here is that I think Perl 6 already has what you need to modifying the very grammar of the language itself. If we can work out what slangs (and those $~ variables) could be that isn’t just a synonym for grammar, then I’d be all for it. However I can’t presently think of what slang would do differently. The only thing that comes to mind for me is a way of better linking a grammar and its actions together (though that would benefit grammar too, and you can already do it by redefining the parse method (and maybe its friends) in a grammar anyway).

The more interesting question is how to modify the parsing of Perl 6 in an implementation-independent fashion. The grammar side can be helped by standardizing the rules of the grammar, essentially Perl 6’s readable version of a BNF grammar definition in other language specifications. Whether or not the grammar should be standardized at all is another matter though :) .

The actions side, an independent AST specifier, might be far more tricky. But if I understand things correctly, quasi does that for us already.

So I don’t know if we can fashion slang into a far more distinct (and hence useful) keyword than its friend grammar, or if we don’t need it after all, but it’s certainly an interesting thought.

Of course, this is all complicated by the fact that “we must be fairly certain what we want, and we aren’t yet :)”. There’s so little specification of slangs that the point of this post (“slang is useless”) is in all likelihood just plain wrong. Allowing the user to rewrite the language they’re using with the language they’re using is a hard thing to do, and it’s no wonder that all efforts have been placed on everything else so far. But it has led to at least me thinking slang is unnecessary, and if I’m to be proven wrong, it needs to be done soon :P .

Maybe we need a “metalanguage”, like the “metamodel”… would that just be NQP?…

Posted in Think Tank | Tagged , , , , | 1 Comment

Perl 6 and CPAN? Well…

So, just earlier today the issue of putting Perl 6 modules (and other assorted things) on CPAN came up. I feel that it’s pertinent to put all of my concerns with this now, instead of waiting until people are in the middle of actually implementing this.

Why not just put the modules on CPAN and be done with it?

Sure! Let’s go ahead and right now upload some of the more well-known Perl 6 modules, like File::Find, Shell::Command, and JSON::Tiny.


Yeah, that’s not happening. Perl 6 and Perl 5 are incompatible languages, at least enough so that sharing one universe of module names is absurd.

Why not just prefix Perl 6 modules with Perl6:: ?

You mean Perl 6 module-writers do this for each and every one of their modules? No.

You mean CPAN puts a fake Perl6:: in front of modules for organizational purposes? Alright. Sure hope no existing CPAN modules use Perl6:: as an actual namespace.


We could implement all the workarounds we want, but really it would be best for everyone’s sanity if we just maintained separate Perl 5 and Perl 6 worlds.

Other issues

OK, so why not just separate the two worlds on the CPAN servers? Fine, but there are other issues that then come up in the process.

PAUSE needs new scanning tools

Namely, PAUSE currently check a tarball’s .pm files for packages given, which clearly won’t work with Perl 6. A script that analyzes an S11-compliant would suffice here.

CPAN needs more metadata, and be more like typical package managers

Have your own, non-CPAN bug tracker? Have a repo that contributors can, uh, contribute to? CPAN’s current solution is to “check the documentation of the module”. I believe this is unacceptable, especially considering how atypical (wrt CPAN) Perl 6 does module distribution. CPAN needs to have “source code” and “bug tracker” links that point to the right place.

In fact, I’d prefer it if CPAN were more like various package manager sites for various Linux distros. That is, more than a place designed to give free tarball hosting space to Perl developers. It should at the very least provide a standardized, metadata-based external link to some sort of homepage.

This brings up a more general issue: S11 and related are designed to specify a full package management system, something much closer to those OS package managers. Granted, I am not at all familiar with Perl 5, much less CPAN, but it just feels like a bare-bones “only what’s needed to install modules easily” kind of thing, which Perl 6 goes beyond. CPAN was built around how Perl 5 does packages; how Perl6 does packages is designed around what package managers typically do.

The most prominent difference between how Perl 5 works with packages and how Perl 6 works with packages is in authority and versions. CPAN and PAUSE are responsible for handling versions, Perl 5 does not handle this. Additionally, module names are owned by one (or more) people, specified by the same infrastructure.

In Perl 6, the version and authority are a part of the package itself. It makes little sense to place restrictions on what names you can use, or have a “upload tarballs only once” policy that requires versioned tarballs.

The versioning shouldn’t be too hard to fix; most tarballs tend to be versioned anyway, but with version info in the module, instead of near it, PAUSE can’t rely on versioning of tarballs anymore, at least not for Perl 6.

The author part is harder; since anyone can create a module with an existing name, so long as they aren’t the same author, this destroys the idea of various people “owning” a particular module name. Where in CPAN I have to explicitly request the ability to update Shell::Command from the right sources, in Perl 6 I can just make a module with that name, that holds the updates (a.k.a. “forking”). I imagine this isn’t easily fixed unless CPAN/PAUSE6 are effectively totally separate from the 5 versions.


So, with the need to maintain separate worlds for Perl 5 and Perl 6, and to significantly alter CPAN itself (esp. the interfaces) for the kinds of things Perl 6 is designed for, the question arises: why doesn’t panda and the ecosystem work well enough already? Sure, CPAN offers a nice place to host tarballs, but aside from that, I think the existing infrastructure that Perl 6 has works. From my view, this would be nothing more than a name change, one that’s of questionable value.

Just to be clear, I’m not totally opposed to a move to CPAN (in fact it would likely give the Perl 6 crowd some much-needed structure in their module distributions). I just have some serious misgivings about what this move would entail, and on some level why putting it all under the CPAN name is better than just putting it under a different name, especially if the two languages would be so separate.

However, I would love to be convinced otherwise, that CPAN would be awesome for Perl 6. This is simply the opinion of someone who’s used cpan all of once or twice for the odd Perl 5 script that needs to be run, and has been able to get along in Perl 6 just fine without CPAN so far.

Additionally, because I believe in lighting candles when it’s dark, I’ll do my best over the next few days to design a mockup of my idea of a Perl 6 package manager, to better illustrate why I’m not sold on CPAN as it is. (Yes, this will most likely revive one certain idea, if maybe not quite in name :P)

Also, mostly as a matter of principle, I refuse to use PAUSE until I’m not forced to give out my full real name. I just don’t see the point of that, and I’m not very liberal with any of my personal information unless it’s absolutely necessary :) .

Posted in Think Tank | Tagged , , , , , , | 3 Comments

A Brand New Spec, S15

So, for the past few days I’ve been working on a provisional S15 mostly for fun. I was considering TimToady’s long-ago suggestion of developing a libicu replacement tuned to Perl 6’s needs, and after learning some interesting things about NFG, I finally got around to writing an S15.

After those few days, S15 has become “good enough” for inclusion into the specs repository, where it will benefit from many people being able to edit the spec. Now anyone with commit access to the specs repository will be able to improve it, as well as anyone who forks the repo :) .

See it here.

The contents of S15 are far from finished. There’s a lot of stuff that still needs working out, such as the functions of the Stringy and Unicodey roles, whether Uni is a rope of multiple Normalization Forms or just a simple string containing that mixture, and the function of string operators now. For instance,

Str ~ Str

Concatenates two strings and results in a Str. But what happens when you try

Uni ~ NFC

or any of the other multitude of combinations of string types?

What’s Next?

There are three things I see that I could do at this time:

  1. Write and fudge a bunch of S15 tests. This seems to me to be the most important thing, as it allows us to see how coding with these new things feels before they ever begin to work.
  2. Copy a bunch of S15 information to the rest of the spec. This involves at least, off the top of my head, S05, S32::Str(ing), and S02. Undoubtedly more.
  3. Start migrating the other specs to Pod6. The S15 I placed in the repository makes it the second Pod6-written document in the specs repository. I should think that now’s a good time to migrate the rest of the specs, and modify/replace the relevant scripts in the mu repository to handle Pod6. All this work would of course happen in branches.

The list is in about the order I plan on doing these things, assuming others don’t work on these things first :) .

So please, read our not-yet-stellar provisional draft S15, and get ready for the Unicode Future™.

Posted in Press, Progress Happened | Tagged , , , , , , | 2 Comments

Some Thoughts on Unicode in Perl 6

All of the recent work on Rakudo, getting it to run on the JVM, and the creation of and work on MoarVM as another backend for NQP (and thus Rakudo), has created a sense that we’re really moving forward in Perl 6-land. Maybe Christmas will come this year, or perhaps 2014?

In any case, with Rakudo now on a more mature platform, to be able to implement the big things (such as threads), it seems as though Rakudo is making big leaps towards being fully Perl 6.

Except that actually cannot happen, what with 8 unwritten synopses:

  • S15 — Unicode
  • S18 — Compiling
  • S20 — Introspection*
  • S23 — Security
  • S25 — Portable Perl
  • S27 — Perl Culture*
  • S30 — Standard Perl Library
  • S33 — Diagnostic Messages

And that’s not counting all the other synopses that need a serious rewrite (the higher the spec number, the more likely it’s in need of repair). With the momentum currently going forward in the community, perhaps it’s time we use some of that to fill out the rest of the specification?

If you haven’t guessed already, the spec I’ve been thinking about is S15. Below is a presentation of some notes on the subject I put up a couple of days ago.

Consider this humble Devanagari syllable:

नि (U+0928 U+093F)

Next to it you see the two codepoints that make it up. I shall now present a table on how the various UTFs encode this syllable:

UTF-8 E0 A4 A8 E0 A4 BF
UTF-16BE 0928 093F
UTF-32BE 00000928 0000093F

When it comes to Unicode there are a number of ways to count characters, depending on your view of the situation. Here’s a quick list, from lowest to highest view:

  • Bytes are a simple count of the number of bytes that make up the given Unicode text.
  • Code units are the smallest units of information in an encoding. The number after UTF indicates the number of bits in a code unit (so the code unit of a UTF-8 text is the byte, 8 bits).
  • Code points are the numbers assigned to each “character” in Unicode. This is independent of encoding. The Devanagari syllable above has two code points.
  • Graphemes are what normally constitute a character to the reader’s eyes, regardless of how many code points make it up. Both ä and ä are just one grapheme, even though the first one is made up of two code points.

For our Devanagari syllable above, the counting based on viewpoint and encoding is outlined here:

(count by) UTF-8 UTF-16 UTF-32
bytes 6 4 8
code units 6 2 2
codepoints 2 2 2
graphemes 1 1 1

As you’ll notice, the counting of codepoints and graphemes is not affected by the text’s encoding. (Also note that the endianness of UTF-16 and UTF-32 doesn’t matter when it comes to counting.)

Perl 6 has some ideas about Unicode already set, such as counting by graphemes by default (which counts a string containing just our Devanagari syllable above as 1 long, which is what you usually mean).

What I’m putting here today are some of my ideas on what the methods and pragmas involved should look like. I’ve yet to think about Str and Buf specifically (questions such as their relationship with each other and whether more-derived types of Str/Buf (e.g. StrGraphemes) are necessary or useful). There’s hardly enough here for a decent S15, but hopefully enough for a starting point.

Pragmas — Changing Defaults

Perl 6 handles Unicode in a couple of default ways:

  • Encodes Unicode strings in UTF-8
  • Views strings in terms of graphemes unless another view is requested

These defaults should pervade any time you’re dealing with text, whether it’s a literal string, user input, or non-binary file I/O. You can always change these, such as .codes to count the string by codepoints, or open("file", :enc<UTF-16BE>) to open a text file you know is encoded as UTF-16BE.

But if you’re dealing with a lot of UTF-32LE encoded files, or you need to a lot of string operations at the code unit level, then pragmas are the way to change these defaults. Here are the pragmas as I imagine spelling them:

use utf8;          # use UTF-8 encoding
use utf16 :be/:le; # use UTF-16[BE|LE] encoding
use utf32 :be/:le; # use UTF-32[BE|LE] encoding

use graphemes;  # count by graphemes
use codepoints; # count by code points   
use codeunits;  # count by code units
use bytes;      # count by bytes

There is also one other pragmas I’ve thought up, although its usefulness is very questionable:

use normalization :NFC/:NFD/:any
# compose/decompose/leave be all characters
# in strings at time of creation.

Methods for Str

These methods either count characters in a certain way, or (de)compose them, or change the encoding of the Str. Here’s the list, some of these already specced:

.chars  # count by the current default view (default .graphs)
.graphs # count by graphemes
.codes  # count by code points
.units  # count by code units
.bytes  # count by bytes

.compose   # convert string to NFC form
.decompose # convert string to NFD form

.convert # change the encoding of the Str

There’s likely a host of other functionality that Strs need, but these are the ones that have come to mind.

Closing Thoughts

I don’t think Perl 6 needs a separate type for a single character (the Char and AnyChar found in the untouched corners of the spec). It feels like an unnecessary addition; it’s hard for me to see a time where a one-character string needs to be treated differently from a multichar string.

Also a couple times in the spec, is the idea of counting characters with adverbs such as :ArabicChars in addition to :graphs. I’d like to see examples of scripts where a “grapheme” is not always the same as a complete “character” before going along with such language-specific counting mechanisms in core.

I also feel that Buf needs better explanation. I’m thinking about it now, and I suppose I need some convincing that we need to consider Buf the cousin of Str. I think it’s useful to have a type of array that’s designed to work with binary files (something I feel Buf is perfect for), but I have doubts about treating it like a numbers-based look at Str.

To put it another way, I’ve always used, and thus see, Buf as a way of interacting with binary files, and its data. I have a hard time believing such an object should be tasked with text-based knowledge, such as if it’s a valid Unicode string, when that may not be the case.

(Although derived versions of Buf, such as Utf8, could perfectly place text-based restrictions on its data. But leave Buf out of it. :) )

Finally, I hope that we can soon get a decent S15 written, and then maybe also finish the rest of the spec. How ’bout it?

* To be fair, there are two drafts, one for S20, and S27. They’re available from the front page of the HTML specs, the A20 draft (yes, an apocalypse draft) and the S27 draft. The A20/S20 draft might be worth a look and combining with what jnthn’s debugger does. The S27 draft, in my opinion, should be ignored without a second thought.

Maybe there should be an S34 for 6model, making it nine unwritten.

Posted in Think Tank | Tagged , , , | 5 Comments