AddThis

Thursday, September 29, 2016

Elixir Destructuring

Break it all apart

Welcome to the fourth installment of our work with Strings. As usual here's a reminder of the original problem we were trying to solve:


take_prefix.("Mr. John", "Mr. ")
# returns 'John'

We want to be able to chop off some prefix and return the suffix of a string.

So far we've come up with three implementations, each one improving upon each other. Our initial implementation:


# slow bc of multiple String.lengths
take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base, String.length(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

The second implementation used ranges:


# replace one of our slow length call with a range
take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base..-1)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

The third implementation used binary functions due to their constant speed regardless of the size of the given string.


take_prefix = fn full, prefix ->
  base = byte_size(prefix)
  binary_part(full, base, byte_size(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

The final improvement is really more aesthetics. We can make this solution a bit more functional and feel more Elixir-y by using a concept known as destructuring.

Destructuring is a way of taking a complicated data structure and breaking it apart into simplier components. Quick example:


[a, b, c] = [1, 2, 3]
# a now has the value 1
# b now has the value 2
# c now has the value 3


In this example, we took the array holding three numbers and broke it apart into it's separate elements. This is destructuring in a nutshell. Taking something and breaking it apart.

Here's another example using tuples (a data structure holding elements that are contiguous in memory).


{status, status_message} = {:ok, "Success"}

Sometimes our complex object that we want to break apart has information we don't care about. You might think you could do something like this:


# this will not work!
{status, status_message} = {:ok, "Success", "Junk"}
#** (MatchError) no match of right hand side value: {:ok, "Success", "Junk"}

Elixir thinks you messed up, so you have to be very explicit in telling it that you don't care about the last match. You do this by using an underscore.


{status, status_message, _} = {:ok, "Success", "Junk"}

Now this will work. Alright, I think we have all the tools we need to understand the final solution.


take_prefix = fn full, prefix ->
  base = byte_size(prefix)
  # this is the destructuring
  <<_::binary-size(base), rest::binary>> = full
  rest
end

IO.puts take_prefix.("Mr. John", "Mr. ")

Here we are destructuring the full string into two binary components. The << and >> signify that this is a binary data structure, in this case with two elements. The first element will contain a binary string that is the binary size of the prefix and the second element is the rest of the string. Let's get into a bit more detail.

Remember what we saw last week, a String is just a binary string in disguise. So what we are doing here is figuring out the number of bytes used in the prefix (so we know how much to chop off). Then we destructure the full string into two binary parts. The first binary part is the number of bytes that we calculated before (but now represented as the size in binary due to us wanting to destructure this into a binary data structure). And the second part is the actual string that we want to return, represented as a binary data structure. But as we saw last week, a String is just a UTF-8 encoded binary so returning this is just fine.

Destructuring is a powerful technique and is very useful in producing short and concise code. Elixir is not the only language that has this feature. ECMAScript 2016 has it as well.

With that, this concludes our multi-post demonstration on a simple problem I found from the Elixir docs, and how they were able to iterate through a few solutions until the settled on their favorite implementation. I felt like their explanation didn't go into enough detail, which is why we've been looking at each a bit more closely these past few weeks. Hopefully you learned something and had fun while doing it.

Next week, we'll discover and discuss a new topic that I haven't yet decided on yet :)

Tuesday, September 20, 2016

Getting Faster By Going Deeper

Elixir Code Points

This week is the third installment of our look at Strings. As a quick reminder, we tried to implement a function that chops off a prefix from a string.


take_prefix.("Mr. John", "Mr. ")
# returns 'John'

So far we've come up with two implementations, the second improving on the first. Our initial implementation:


# slow bc of multiple String.lengths
take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base, String.length(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

The second implementation used ranges:


# replace one of our slow length call with a range
take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base..-1)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

The next improvement is to not even use String functions at all! Wait, what? How can you do that? What would you use??

To answer this, we have to see how Strings are represented. Everything in computers essentially boil down to 0's and 1's; we call these bits. If you put 8 of these together, you get a byte. With 8 bits put together, and each digit representing 2^n, where n is the number of positions from the far right, you can represent 0 to 255. So with 255 numbers, you just represent each letter with a number. For example, we use the number 97 to represent the small letter `a`. Another way to say this is that the small letter `a` has code point 97. This mapping is called the character encoding and Elixir uses UTF-8.

Problem is we have more than 255 characters that we want to represent, like letters with accent marks, or non-latin characters like Chinese. This means we need more numbers, which means we need more bytes. Let's look at an example using iex, the interactive elixir repl.


# `?` shows us the code point
iex> ?a
97

iex> ?ł
322

The small `a` is less than 255, which means we only need one byte to represent it. But the letter `ł` is over 255, and actually will require 2 bytes to represent 322. We can check double check this to see we are right.


iex> byte_size("a") 
1

iex> byte_size("ł") 
2

We can go even further and force Elixir to spit out it's binary representation by using a trick where we concatenate a null byte `<<0>>` to the string.


iex> "a" <> <<0>>
<<97, 0>>

iex> "ł" <> <<0>>
<<197, 130, 0>>

iex> "ał" <> <<0>>
<<97, 197, 130, 0>>

We can see that the small letter `a` only needs one byte to show 97, but `ł` needs more than one byte. So Elixir splits up the 2 bytes into two different and separate bytes and then represents each byte with it's own number. This is why we get 197 and 130 to show the 322 that really is the letter `ł`.

All this that we covered is really just to say that any letter will always be between 1 to 4 COMPLETE bytes. Elixir will never use a fraction of a byte to represent a letter. We can take advantage of this and use Elixir byte functions, which are WAY faster than Elixir String functions.

There is one more caveat, we have to make absolutely sure that whatever byte functions we use, we never chop in between code points. We don't want to chop `ł` into two separate bytes because then it won't be `ł` anymore, you need both bytes to represent this letter. Let's see our new solution:


take_prefix = fn full, prefix ->
  base = byte_size(prefix)
  binary_part(full, base, byte_size(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

Instead of `String.length` we used `byte_size`. The `String.length` function as you recall gets more expensive the longer the string is, but the latter `byte_size` always runs in constant time, regardless of the input size. This is great! Similar improvements are also gained by using `binary_part` instead of `String.slice`.

This is a good example of how understanding how something works under the covers allows you to employ some neat tricks. This is a valuable bit of insight that you will see often and should employ yourself. Whenever possible, think about replacing String functions with low level byte functions, you will get nice performance gains.

Next week we will use perform our final enhancement to this solution and wrap up our series on Elixir Strings.

Tuesday, September 13, 2016

Home Home on the Elixir Range

Elixir Ranges

Last week we were digging around the Elixir docs and had some fun with strings. As a reminder, we tried to implement a function that chops off a prefix from a string. Something like this:


take_prefix.("Mr. John", "Mr. ")
# returns 'John'

We saw our initial implementation:


take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base, String.length(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

But one big problem were the multiple calls to


String.length

which as we saw last week, does a full traversal of the string. As per the Elixir docs, the first improvement they perform is replacing


String.slice(full, base, String.length(full) - base)


with


String.slice(full, base..-1)

Let's look at this a bit. What are those dots in the second argument? This is something called a range. From the docs, a range is:


A range represents a discrete number of values 
where the first and last values are integers.

Ranges can be either increasing (first <= last)
or decreasing (first > last). Ranges are also always inclusive.

Let's take an example to help illustrate this:


range = 1..5

The `range` variable holds 5 numbers (1, 2, 3, 4, 5). That's it! It's just a collection of sequential numbers. Now let's check the docs on the new slice method:


slice(string, range)

#Returns a substring from the offset given by the
#start of the range to the offset given by the end
#of the range

So our call to slice


# As we saw last week, base is 4
# base = String.length(prefix)
String.slice(full, base..-1)

really is:


String.slice("Mr. John", 4..-1)

This function takes the string starting at the 4th letter and goes to the first letter counting backwards.

To say this another way, it's saying to take the string starting from the 4th letter up to and including the last letter.

Again, the advantage of using the range is that we save having to do the `String.length` call in the Slice method that we did last week.

Ranges are fun and come in quite handy. One more quick example with Ranges


println = fn x ->
  IO.puts x
end

Enum.each 1..5, println
Enum.each [1, 2, 3, 4, 5], println

Both of the last two lines will print out the numbers 1 through 5. The first one does so using a range and the second one uses a list of numbers.

That's it this week on the second improvement to our `take_prefix` method. Next week we will look at the next improvement that we can make to this method. As a teaser, it will involve us knowing how strings are represented. Until then!

Tuesday, September 6, 2016

Elixir Prefix by Suffix

Elixir Strings

I was reading through the Elixir docs and found some interesting code snippets in the `String` section. The example was how we could implement a function that returns the ending of a string.

So in other words, we want a function like this:

take_prefix.("Mr. John", "Mr. ")
# returns 'John'

The docs actually show a few solutions, gradually iterating on each, improving it slowly. I really liked how they did it, but thought that each solution could use a bit more detail.

This post will take the first naive solution and talk about it. This solution is probably the most intuitive, so is a nice way to ease into Elixir strings.

Solution


take_prefix = fn full, prefix ->
  base = String.length(prefix)
  String.slice(full, base, String.length(full) - base)
end

IO.puts take_prefix.("Mr. John", "Mr. ")

Let's jump right in.


First line


take_prefix = fn full, prefix ->

The first line declares the function. We want a function that takes in two parameters, the full/entire string, and the prefix that we want chopped off.


Second line


base = String.length(prefix)

This line figures out how long the prefix string is. There is something to note here, and that this function `String.length` needs to traverse the entire string in order to figure out it's length. The reason for this is that some letters are made up of two characters, but are perceived by humans as one.

One example is this letter:

# é
iex> String.codepoints("é")
["e", "́"]

Two characters used to represent one. So the `String.length` function has no choice but to traverse the entire string to check for weird conditions like this. As a result, as the string gets longer, this function call takes longer to complete as it has more characters to check.


Third line


String.slice(full, base, String.length(full) - base)

The next line is a bit compact, but let's dig in and see if we can't unpack it.

Let's start inside and go out.


String.length(full) - base

We saw the first part before. But now we are taking the length of the full/entire string. And from that we are subtracting out the length of the prefix.

So from our example, the length of our full string is ("Mr. John") is 8, and the length of the prefix ("Mr.  ") is 4. Simple subtraction (8-4) and we have 4.

Our call really then is:


String.slice(full, 4, 4)

Looking up the docs for that:


slice(string, start, len)
Returns a substring starting at the offset start, and of length len

In plain English, this method says, "take the string Mr. John and starting at the 4th element (0 based), give me back a string 4 characters long".


Fourth line


IO.puts take_prefix.("Mr. John", "Mr. ")

And the last line shows the invocation of the method and the write to the console.

Tada. But as you may have noticed, this method is expensive, mainly due to the two length calls we have. And then we have an additional slice call that has to yet again traverse the string in order to give us the substring. For short strings, this solution is no problem. But when you get longer strings, this method will not be so great. There are ways we can improve on this, which we will take a look at next time.