Jon Keam Doesn't Exist: Getting Faster By Going Deeper

Elixir Code Points

This week is the third installment of our look at Strings. As a quick reminder, we tried to implement a function that chops off a prefix from a string.


take_prefix.("Mr. John", "Mr. ")
# returns 'John'

So far we've come up with two implementations, the second improving on the first. Our initial implementation:


# slow bc of multiple String.lengths
take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base, String.length(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

The second implementation used ranges:


# replace one of our slow length call with a range
take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base..-1)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

The next improvement is to not even use String functions at all! Wait, what? How can you do that? What would you use??

To answer this, we have to see how Strings are represented. Everything in computers essentially boil down to 0's and 1's; we call these bits. If you put 8 of these together, you get a byte. With 8 bits put together, and each digit representing 2^n, where n is the number of positions from the far right, you can represent 0 to 255. So with 255 numbers, you just represent each letter with a number. For example, we use the number 97 to represent the small letter `a`. Another way to say this is that the small letter `a` has code point 97. This mapping is called the character encoding and Elixir uses UTF-8.

Problem is we have more than 255 characters that we want to represent, like letters with accent marks, or non-latin characters like Chinese. This means we need more numbers, which means we need more bytes. Let's look at an example using iex, the interactive elixir repl.


# `?` shows us the code point
iex> ?a
97

iex> ?ł
322

The small `a` is less than 255, which means we only need one byte to represent it. But the letter `ł` is over 255, and actually will require 2 bytes to represent 322. We can check double check this to see we are right.


iex> byte_size("a") 
1

iex> byte_size("ł") 
2

We can go even further and force Elixir to spit out it's binary representation by using a trick where we concatenate a null byte `<<0>>` to the string.


iex> "a" <> <<0>>
<<97, 0>>

iex> "ł" <> <<0>>
<<197, 130, 0>>

iex> "ał" <> <<0>>
<<97, 197, 130, 0>>

We can see that the small letter `a` only needs one byte to show 97, but `ł` needs more than one byte. So Elixir splits up the 2 bytes into two different and separate bytes and then represents each byte with it's own number. This is why we get 197 and 130 to show the 322 that really is the letter `ł`.

All this that we covered is really just to say that any letter will always be between 1 to 4 COMPLETE bytes. Elixir will never use a fraction of a byte to represent a letter. We can take advantage of this and use Elixir byte functions, which are WAY faster than Elixir String functions.

There is one more caveat, we have to make absolutely sure that whatever byte functions we use, we never chop in between code points. We don't want to chop `ł` into two separate bytes because then it won't be `ł` anymore, you need both bytes to represent this letter. Let's see our new solution:


take_prefix = fn full, prefix ->
  base = byte_size(prefix)
  binary_part(full, base, byte_size(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

Instead of `String.length` we used `byte_size`. The `String.length` function as you recall gets more expensive the longer the string is, but the latter `byte_size` always runs in constant time, regardless of the input size. This is great! Similar improvements are also gained by using `binary_part` instead of `String.slice`.

This is a good example of how understanding how something works under the covers allows you to employ some neat tricks. This is a valuable bit of insight that you will see often and should employ yourself. Whenever possible, think about replacing String functions with low level byte functions, you will get nice performance gains.

Next week we will use perform our final enhancement to this solution and wrap up our series on Elixir Strings.

Jon Keam Doesn't Exist

AddThis

Tuesday, September 20, 2016

Getting Faster By Going Deeper

Elixir Code Points

No comments: