7/5/2023 0 Comments String reverse codepoints![]() Remember, Unicode is trying to cover all of human language. And you may have seen snarky responses that break the Stack Overflow comment section.īecause unfortunately, there’s no easy way to prevent people from putting this on your site. You may have seen questions on Stack Overflow asking how to keep people from putting this junk on your web site. You may have seen “zalgo text,” where some poor website’s text box is overflowed with horrible-looking characters, and they think they’ve been hacked. The answer is: you can add a boatload of them! Now, you might wonder: if I can add one mark to a letter, can I add two? How many can I add? Notice that Elixir lets us ask for either the codepoints or the graphemes in that string. A grapheme is what most people would consider a single visible character, and in some cases, what looks like “a letter with an accent mark” may be composed of a “plain” letter followed by a “combining diacritical mark” – which says, “hey, put this mark on the previous letter”.Ī series of codepoints that represent a single grapheme is called a “grapheme cluster.” Not only can we have multiple bytes in one codepoint we can also have multiple codepoints in one “grapheme”. That’s also what lets it correctly measure the length of a string, or get substrings by index: because it knows which bytes go together, it knows whether (for example) the first three bytes express one character or three. Instead, you’d want to reverse it like this, keeping the bytes for "™" intact: First of 3 Continuation Continuation SoloĮlixir does this correctly because, thanks to using UTF-8, it can tell which bytes should go together. You wouldn’t want to reverse it like this, scrambling the multi-byte "™": Continuation Continuation First of 3 Solo Elixir represents that string as a binary with four bytes: the "a" gets a solo byte, and the "™" gets three bytes (a leading byte and two continuation bytes).įor simplicity’s sake, we can picture "a™" like this: Solo First of 3 Continuation Continuation ![]() Suppose we wanted to reverse the string "a™". Smaller codepoints just get a single solo byte.Įach kind of byte has a distinct pattern, and by using those patterns, Elixir can do a lot of things correctly that some other languages mess up, like reverse a string without breaking up its characters. Larger codepoints get a leading byte followed by one, two or three continuation bytes, and the leading byte tells how many continuation bytes we should expect. I explained what Unicode is, and we walked through the encoding process and saw the exact bits it produces.įor this post, what’s important to know is that UTF-8 represents codepoints using three kinds of bytes. ![]() In my post on Unicode and UTF-8, I showed you the basis of Elixir’s great Unicode support: every string in Elixir is a series of codepoints, encoded in UTF-8. ![]() This post was adapted from a talk called “String Theory”, which I co-presented with James Edward Gray II at Elixir & Phoenix Conf 2016. ![]()
0 Comments
Leave a Reply. |