3 ways to extract numbers from a string

December 3, 2017

General Coding, Performance, Reference

Comments Off on 3 ways to extract numbers from a string


I’m working on a project which requires me to parse some data that was formatted more for readability than parsing. Similar to NSStringFromCGRect which outputs in the format “{{x,y},{width,height}}”

let rect = CGRect(x: 124, y: 387, width: 74, height: 74)
let text = NSStringFromCGRect(rect)
// "{{124, 387}, {74, 74}}"

This string can easily be read back again using it’s counterpart function NSStringFromRect.

let text = "{{124.123, 387}, {74, 74}}"
let rect = NSRectFromString(text)

Since my data was similar, I decided to first implement the rectFromString method in various ways. The idea is that NSRectFromString can be used to validate behavior and correctness against my parsing strategy. Having that will give confidence to the strategy for similar situations.

The way this method should work is to find the numbers in the string, up to 4, and ignore everything else. The we use as many numbers as we have to fill in the components in order (x, y, width, height). Any component that does not receive a number will be filled in with zero.

That’s pretty straight forward but how do we get the numbers out of the string?

I came up with three strategies: Regular expression, component separation, and scanner.

Regular Expression

Regular expression is a goto of mine when searching strings.

public func rectFromStringRegularExpression(text: String) -> CGRect {
    do {
        let regex = try NSRegularExpression(pattern: "[\\d.]+", options: [])
        var numbers = regex.matches(in: text, options: [], range: NSRange(location: 0, length: text.count))
            .map { (text as NSString).substring(with: $0.range) }
            .flatMap { Double($0) }
        
        if numbers.count < 4 {
            numbers.append(contentsOf: Array(repeating: 0.0, count: 4-numbers.count))
        }
        
        return CGRect(x: numbers[0], y: numbers[1], width: numbers[2], height: numbers[3])
    } catch {
        return .zero
    }
}

First, the regular expression is created with the panther “[\d.]+” which roughly means “one or more digits or decimal points”.

If you’re not very familiar with regex you should check out regex101.com

Next, we check for matches. It would be nice if this method returned the matches but instead it returns an array of NSRange objects. So we just map those into strings using the range on the input string. Then those strings can be mapped to Doubles. The Double conversion can fail so this is actually done with flatMap to ignore the failed conversions.

Now that we have the numbers from the string we can check if there were enough to fill the rect components. If not append as many zeros as needed.

Finally, create the rect using the array of numbers.

I nearly forgot to mention that the initializer for NSRegularExpression can throw so we have to wrap that all up with a do catch and just return CGRect.zero in the case of failure.

So this method isn’t very nice actually. We have to wrap the code for throws, we have to append extra zeros, and we have to deal with substrings and ranges. Let’s see how this can be improved.

Component separation

The string type has a convenient method to separate components into an array components(separatedBy:). For example if you wanted to parse a CSV string you could do something like this:

let csv: String = loadCSV()
let values = csv.components(separatedBy: ",")

We’re not too far off from that actually but I’ll take it an extra step and separate by anything that isn’t a number.

public func rectFromStringComponentSeparation(text: String) -> CGRect {
    let decimalPoint = CharacterSet(charactersIn: ".")
    let decimal = CharacterSet.decimalDigits.union(decimalPoint)
    
    var numbers = text.components(separatedBy: decimal.inverted)
        .filter { !$0.isEmpty }
        .flatMap { Double($0) }
    
    if numbers.count < 4 {
        numbers.append(contentsOf: Array(repeating: 0.0, count: 4-numbers.count))
    }
    
    return CGRect(x: numbers[0], y: numbers[1], width: numbers[2], height: numbers[3])
}

We need to define the CharacterSet that we are looking for. CharacterSet has a few predefined sets, .decimalDigits gets close but doesn’t contain the decimal point. We first declare a CharacterSet with one character ‘.’ Then we define our set “decimal” by taking the union of both.

Next we separate components by the inverse of the decimal set. This will actually give us many components. Some will contain the numbers we are looking for and the others will be empty. So let’s next filter out the empty components. Then we do the flatMap to convert the string to Double as before.

Finally, just like with the regular expression version, we append any additional zeros then fill in the CGRect components.

This version is much cleaner. We don’t have do catch blocks and we don’t have to deal with ranges and substrings. However, we still have to append extra numbers as needed and now we have the extra empty components that we have to filter out.

Scanner

I don’t use Scanner often but after these results I might just be using it more often 🙂

public func rectFromStringScanner(text: String) -> CGRect {
    let scanner = Scanner(string: text)
    scanner.charactersToBeSkipped = CharacterSet.decimalDigits.inverted
    
    let numbers: [Double] = (1...4)
        .map { _ in
            var number: Double = 0
            scanner.scanDouble(&number)
            return number
        }
    
    return CGRect(x: numbers[0], y: numbers[1], width: numbers[2], height: numbers[3])
}

Creating the Scanner is simple, just pass the text. Normally it will scan all of the characters in the string. For example, given the string “{12}”, we can’t immediately scan the number 12. We would first have to scan the characters. In this way we could parse the syntax as well. For this experiment I don’t care about the rest of the syntax, just the numbers. Luckily there’s a way to skip characters that we aren’t looking for.

We set the inverted .decimalDigits set to scanner.charactersToBeSkipped. We don’t need to add the decimal point as with the previous solution as long as the decimal point is preceded by at least one digit (e.g. 0.12 rather than .12). Now each time we scan we should get a number.

We scan for Double four times. If the scanner finds a number, it sets the value into the location passed (the address of var number). If the scanner doesn’t find a number the initialized value (0) will be returned from the map.

Finally we use the numbers array to set the components of CGRect.

This version is actually really nice. It’s easy to read and understand, there’s no error cases to deal with and there will always be four and only four numbers in the array.

Performance

I measured the time it took to read 100,000 rects from strings. Of the three methods Scanner completed the task in the least amount of time.

Rough measurements:
19 seconds – rectFromStringRegularExpression
13 seconds – rectFromStringComponentSeparation
11 seconds – rectFromStringScanner

Admittedly, Apple’s own CGRectFromString beat rectFromStringScanner by a couple tenths of a second. That’s not much considering the others were worse by several seconds. So, use the built in method if parsing a rect but you can use the scanner for custom structures and feel confident that the code will be clean and performant.



Subscribe via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.





Swift Tweets