StringScanner: Regular Expressions Made Better
I’ve been using ruby’s StringScanner class a lot lately. It “performs lexical scanning operations on a string” which is a fancy way of saying “read this string from left to right and stop at certain points to tell me what you found.”
It’s similar to using regular expressions. In fact, StringScanner uses regular expressions to get its work done. Let’s look at a contrived example:
dessert_choices = 'Edward enjoys ice cream. Elizabeth enjoys brownies.'
sentences = dessert_choices.split('. ')
We want to extract some data from the sentences, name and dessert. First with regular expressions:
sentences.each do |sentence|
matches = sentence.match(/(\w+) enjoys ([\w\s]+)/)
name = matches[1]
dessert = matches[2]
puts "#{name}: #{dessert}"
end
Then with StringScanner:
sentences.each do |sentence|
scanner = StringScanner.new(sentence)
name = scanner.scan(/\w+/)
scanner.skip(/ enjoys /)
dessert = scanner.scan(/[\w\s]+/)
puts "#{name}: #{dessert}"
end
These both print out exactly the same thing:
Edward: ice cream
Elizabeth: brownies
What is StringScanner doing?
StringScanner keeps track of a point in a string and can advance that pointer forward using scan
. The pointer starts at the beginning:
Edward enjoys ice cream.
^
After running scanner.scan(/\w+/)
, the pointer moves up:
Edward enjoys ice cream.
^
Every time it scans, it returns the text that was scanned over.
In this case, it begins by scanning for multiple word characters \w+
in a row. Then, it skips over the word enjoys
with spaces around it. Finally, it scans for multiple word or space characters [\w\s]+
. Every time scan
is called, it returns what it found, in this case name
and dessert
.
Regular expressions can often extract text from strings using only a single line of very compact syntax. StringScanners on the other hand usually need multiple lines to accomplish even these simple tasks. Even in this example, I needed to split up the regular expression into three parts. So why would anyone use StringScanner?
It’s actually because StringScanner is broken up that I like using it. It allows me to partition my code and label discrete operations. Naming is such a huge part of code clarity, and string scanners make this so much easier.
Here’s a refactor of the StringScanner above:
sentences.each do |sentence|
scanner = StringScanner.new(sentence)
single_word = /\w+/
filler = / enjoys /
multiple_words = /[\w\s]+/
name = scanner.scan(single_word)
scanner.skip(filler)
dessert = scanner.scan(multiple_words)
puts "#{name}: #{dessert}"
end
This is even longer, but I’d argue that it’s also easier to read. Some of it even reads like english like scan(single_word)
or skip(filler)
. Even in this contrived example, it’s much cleaner than a relatively simple regular expression like /(\w+) enjoys ([\w\s]+)/
.
There are other useful things they can do like reading ahead without advancing the pointer (check
) and returning nil
when they don’t find a match. The latter is useful for optional parts of a string. I.e. scan for foo
and if you don’t find it, continue on. I find that especially when I start using lots of capture groups in my regular expressions that a StringScanner cleans them up.
Try out StringScanner next time you need to extract data from a string.