mikeash.com: Friday Q&A 2015-11-06: Why is Swift's String API So Hard?

Posted at 2015-11-06 14:00 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2015-11-20: Covariance and Contravariance
Previous article: Friday Q&A 2015-09-18: Building a Gear Warning System
Tags: fridayqna swift

Friday Q&A 2015-11-06: Why is Swift's String API So Hard?

by Mike Ash

This article is also available in Japanese (translation by POSTD).

Welcome to a very delayed edition of Friday Q&A. One of the biggest complaints I see from people using Swift is the String API. It's difficult and obtuse, and people often wish it were more like string APIs in other languages. Today, I'm going to explain just why Swift's String API is designed the way it is (or at least, why I think it is) and why I ultimately think it's the best string API out there in terms of its fundamental design.

What is a String?
Let's build a conceptual foundation before we jump into it. Strings are one of those things we understand implicitly but often don't really think about too deeply. Thinking about them deeply helps understand what's going on.

What is a string, conceptually? The high level view is that a string is some text. "Hello, World" is a string, as is "/Users/mikeash" and "Robert'); DROP TABLE Students;--".

(Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)

How is this general concept of "text" represented at the machine level? Well, it depends. There are a ton of different ways to do it.

In many languages, a string is an array of bytes. Giving meaning to those bytes is mostly left up to the program. This is the state of strings in C++ using std::string, in Python 2, in Go, and in many other languages.

C is a weird special case of this. In C, a string is a pointer to a sequence of non-zero bytes, terminated by a zero byte. The basic effect is the same, but C strings can't contain zero bytes, and operations like finding the length of a string require scanning memory.

A lot of newer languages define strings as a sequence of UCS-2 or UTF-16 code units. Java, C#, and JavaScript are all examples of this, as well as Objective-C using Cocoa and NSString. This is mostly a historical accident. When Unicode was first introduced in 1991, it was a pure 16-bit system. Several popular languages were designed around that time, and they used Unicode as the basis of their strings. By the time Unicode broke out of the 16-bit model in 1996, it was too late to change how these languages worked. The UTF-16 encoding allows them to encode the larger numbers as pairs of 16-bit code units, and the basic concept of a string as a sequence of 16-bit code units continues.

A variant of this approach is to define a string as a sequence of UTF-8 code units, which are 8-bit quantities. This is overall similar to the UTF-16 approach, but allows for a more compact representation for ASCII strings, and avoids conversion when passing strings to functions that expects C-style strings, as those often accept UTF-8 strings.

Some languages represent strings as a sequence of Unicode code points. Python 3 works this way, as do many C implementations of the built-in wchar_t type.

In short, a string is usually considered to be a sequence of some kind of characters, where a character is typically a byte, a UTF-16 code unit, or a Unicode code point.

Problems
Having a string be a sequence of some "character" type is convenient. You can typically treat them like arrays (often they are arrays) which makes it easy to grab subranges, slice pieces off the beginning or end, delete portions, count elements, etc.

The trouble is that we live in a Unicode universe, and Unicode makes things hard. Let's look at an example string and see how it works out:

    aé∞𝄞

Each Unicode code point has a number (written as U+nnnn) and a human-readable name (written in ALL CAPS for some reason) which make it easier to talk about the individual code points. This particular string consists of:

U+0061 LATIN SMALL LETTER A
U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
U+221E INFINITY
U+1D11E MUSICAL SYMBOL G CLEF

Let's remove a "character" from the middle of this string, treating a "character" as a UTF-8 byte, a UTF-16 code unit, or a Unicode code point.

Let's start with UTF-8. Here's what this string looks like as UTF-8:

    61 65 cc 81 e2 88 9e f0 9d 84 9e
    -- -- ----- -------- -----------
    a  e    ´      ∞          𝄞

Let's remove the 3rd "character," which we're treating as the 3rd byte. That produces:

    61 65 81 e2 88 9e f0 9d 84 9e

This string is no longer valid UTF-8. UTF-8 bytes fall into three categories. Bytes of the form 0xxxxxxx with the top bit set to 0 represent plain ASCII characters and stand alone. Bytes of the form 11xxxxxx denote the start of a multi-byte sequence whose length is indicated by the location of the first zero bit. Bytes of the form 10xxxxxx denote the remainder of a multi-byte sequence. The byte cc formed the start of a multi-byte sequence a total of two bytes long, and the byte 81 was the trailing byte of that sequence. By removing cc, the trailing byte 81 is left standing alone. Any validating UTF-8 reader will reject this string. This same problem will occur when removing any of the bytes from the third place onward in this string.

How about the second byte? If we remove that, we get:

    61 cc 81 e2 88 9e f0 9d 84 9e
    -- ----- -------- -----------
    a    ´      ∞          𝄞

This is still valid UTF-8, but the result is not what we might expect:

    á∞𝄞

To a human, the "second character" of this string is "é." But the second byte is just the "e" without the accent mark. The accent mark is added separately as a "combining character." Removing the second byte of the string just removes the "e" which causes the combining accent mark to attach to the "a" instead.

What if we remove the very first byte? This result, at least, is what we'd expect:

    65 cc 81 e2 88 9e f0 9d 84 9e
    -- ----- -------- -----------
    e    ´      ∞          𝄞

Let's take a look at UTF-16 now. Here's what the string looks like as UTF-16:

    0061 0065 0301 221e d834 dd1e
    ---- ---- ---- ---- ---------
     a    e    ´    ∞       𝄞

Let's try removing the second "character":

    0061 0301 221e d834 dd1e
    ---- ---- ---- ---------
     a    ´    ∞       𝄞

This has the same problem as we had above with UTF-8, deleting only the "e" but not the accent mark, causing it to attach to the "a" instead.

What if we delete the 5th character? We get this sequence:

    0061 0065 0301 221e dd1e

Similar to the problem we had with invalid UTF-8 above, this sequence is no longer valid UTF-16. The sequence d834 dd1e forms a surrogate pair, where two 16-bit units are used to represent a code point beyond the 16-bit limit. Leaving a single piece of the surrogate pair standing alone is invalid. Code that deals with UTF-8 usually rejects this sort of thing outright, but UTF-16 is often more forgiving. For example, Cocoa renders the resulting string as:

    aé∞�

What if the string is a sequence of unicode code points? It would look like this:

    00061 00065 00301 0221E 1D11E
    ----- ----- ----- ----- -----
      a     e     ´     ∞     𝄞

With this representation, we can remove any "character" without producing an invalid string. But the problem with the combining accent mark still occurs. Removing the second character produces:

    00061 00301 0221E 1D11E
    ----- ----- ----- -----
      a     ´     ∞     𝄞

Even with this representation, we're not safe from unintuitive results.

These are hardly artifical concerns, too. English is one of the few languages you can write with pure ASCII, and even then you'll have trouble, unless you feel like applying for a job with your "resume" instead of your résumé. The moment you step beyond ASCII, all these weird things start to appear.

Grapheme Clusters
Unicode has the concept of a grapheme cluster, which is essentially the smallest unit that a human reader would consider to be a "character." For many code points, a grapheme cluster is synonymous with a single code point, but it also extends to include things like combining accent marks. If we break the example string into grapheme clusters, we get something fairly sensible:

    a é ∞ 𝄞

If you remove any single grapheme cluster, you get something that a human would generally consider to be reasonable.

Note that I didn't include any numeric equivalents in this example. That's because, unlike UTF-8, UTF-16, or plain unicode code points, there is no single number that can describe a grapheme cluster in the general case. A grapheme cluster is a sequence of one or more code points. A single grapheme cluster is often one or two code points, but it can be a lot of code points in the case of something like Zalgo. For example, consider this string:

    e⃝⃞⃟⃠⃣⃤⃥⃦⃪⃧꙰꙲꙱

This mess consists of 14 separate code points:

U+0065
U+20DD
U+20DE
U+20DF
U+20E0
U+20E3
U+20E4
U+20E5
U+20E6
U+20E7
U+20EA
U+A670
U+A672
U+A671

All of these code points form a single grapheme cluster.

Here's an interesting example. Consider a string containing the Swiss flag:

    🇨🇭

This one symbol is actually two code points: U+1F1E8 U+1F1ED. What are these code points?

U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C
U+1F1ED REGIONAL INDICATOR SYMBOL LETTER H

Rather than include a separate code point for the flag of every country on the planet, Unicode just includes 26 "regional indicator symbols." Add together the indicator for C and the indicator for H and you get the Swiss flag. Combine M and X and you get the Mexican flag. Each flag is a single grapheme cluster, but two code points, four UTF-16 code units, and eight UTF-8 bytes.

Implications For Implementations
We've seen that there are a lot of different ways to look at strings, and a lot of different things you can call a "character." A "character" as a grapheme cluster comes closest to what a human thinks of as a "character," but when manipulating a string in code, which definition you want to use will depend on the context. When moving an insertion point in text in response to an arrow key, you probably want to go by grapheme clusters. When measuring a string to ensure that it fits in a 140-character tweet, you want to go by unicode code points. When squeezing a string into an 80-character database table column, you're probably dealing in UTF-8 bytes.

How do you reconcile these when writing a string implementation, and balancing the conflicting requirements of performance, memory consumption, and clean code?

The typical answer is to pick a single canonical representation, then possibly allow conversions for the cases where other representations are needed. For example, NSString uses UTF-16 as its canonical representation. The entire API is built around UTF-16. If you want to deal with UTF-8 or Unicode code points, you can convert to UTF-8 or UTF-32 and then manipulate the result. Those are provided as data objects rather than strings, so it's not as convenient. If you want to deal with grapheme clusters, you can find their boundaries using rangeOfComposedCharacterSequencesForRange:, but it's a lot of work to do anything interesting.

Swift's String type takes a different approach. It has no canonical representation, and instead provides views on various representations of the string. This lets you use whichever representation makes the most sense for the task at hand.

Swift's String API, In Brief
In older versions of Swift, String conformed to CollectionType and presented itself as a collection of Character. As of Swift 2, this is no longer the case, and String mostly presents the various views as the proper way to access it.

This is not entirely the case, though, and String still somewhat favors Character and presents a bit of a collection-like interface:

    public typealias Index = String.CharacterView.Index
    public var startIndex: Index { get }
    public var endIndex: Index { get }
    public subscript (i: Index) -> Character { get }

You can index into a String to get individual Characters, but that's about it. Notably, you can't iterate using the standard for in syntax.

What is a "character" in Swift's eyes? As we've seen, there are a lot of possibilities. Swift has settled on the grapheme cluster as its idea of a "character." This seems like a good choice, since as we saw above it best matches our human idea of a "character" in a string.

The various views are exposed as properties on String. For example, here's the characters property:

    public var characters: String.CharacterView { get }

CharacterView is a collection of Characters:

    extension String.CharacterView : CollectionType {
        public struct Index ...
        public var startIndex: String.CharacterView.Index { get }
        public var endIndex: String.CharacterView.Index { get }
        public subscript (i: String.CharacterView.Index) -> Character { get }
    }

This looks a lot like the interface of String itself, except it conforms to CollectionType and so gets all of the functionality that provides, like slicing and iteration and mapping and counting. So while this is not allowed:

    for x in "abc" {}

This works fine:

    for x in "abc".characters {}

You can get a string back out of a CharacterView by using an initializer:

    public init(_ characters: String.CharacterView)

You can even get a String from an arbitrary sequence of Characters:

    public init<S : SequenceType where S.Generator.Element == Character>(_ characters: S)

Working our way down the hierarchy, the next view is the UTF-32 view. Swift calls UTF-32 code units "unicode scalars," since UTF-32 code units correspond exactly to Unicode code points. Here's what the (abbreviated) interface looks like:

    public var unicodeScalars: String.UnicodeScalarView

    public struct UnicodeScalarView : CollectionType, _Reflectable, CustomStringConvertible, CustomDebugStringConvertible {
        public struct Index ...
        public var startIndex: String.UnicodeScalarView.Index { get }
        public var endIndex: String.UnicodeScalarView.Index { get }
        public subscript (position: String.UnicodeScalarView.Index) -> UnicodeScalar { get }
    }

Like CharacterView, there's a String initializer for UnicodeScalarView:

    public init(_ unicodeScalars: String.UnicodeScalarView)

Unfortunately, there's no initializer for an arbitrary sequence of UnicodeScalars, so you have to do a little extra work if you prefer to manipulate, for example, an array of them and then turn them back into a string. There isn't even an initializer for UnicodeScalarView that takes an arbitrary sequence of UnicodeScalars. There is, however, a mutable append function, so you can build a String in three steps:

    var unicodeScalarsView = String.UnicodeScalarView()
    unicodeScalarsView.appendContentsOf(unicodeScalarsArray)
    let unicodeScalarsString = String(unicodeScalarsView)

Next is the UTF-16 view. It looks much like the others:

    public var utf16: String.UTF16View { get }

    public struct UTF16View : CollectionType {
        public struct Index ...
        public var startIndex: String.UTF16View.Index { get }
        public var endIndex: String.UTF16View.Index { get }
        public subscript (i: String.UTF16View.Index) -> CodeUnit { get }
    }

The String initializer for this view is slightly different:

    public init?(_ utf16: String.UTF16View)

Unlike the others, this is a failable initializer. Any sequence of Characters or UnicodeScalars is a valid String, but it's possible to have a sequence of UTF-16 code units that don't form a valid string. This initializer will produce nil if the view's contents aren't valid.

Going from an arbitrary sequence of UTF-16 code units back to a String is pretty obscure. UTF16View has no public initializers and few mutating functions. The solution is to use the global transcode function, which works with the UnicodeCodecType protocol. There are three implementations of this protocol: UTF8, UTF16, and UTF32. The transcode function can be used to convert between them. It's pretty gnarly, though. For the input, it takes a GeneratorType which produces the input, and for the output it takes a function which is called for each unit of output. This can be used to build up a string piece by piece by converting to UTF32, then converting each UTF-32 code unit to a UnicodeScalar and appending it to a String:

    var utf16String = ""
    transcode(UTF16.self, UTF32.self, utf16Array.generate(), { utf16String.append(UnicodeScalar($0)) }, stopOnError: true)

Finally, there's the UTF-8 view. It's what we'd expect from what we've seen so far:

    public var utf8: String.UTF8View { get }

    public struct UTF8View : CollectionType {
        /// A position in a `String.UTF8View`.
        public struct Index ...
        public var startIndex: String.UTF8View.Index { get }
        public var endIndex: String.UTF8View.Index { get }
        public subscript (position: String.UTF8View.Index) -> CodeUnit { get }
    }

There's an initializer for going the other way. Like with UTF16View, the initializer is failable, since a sequence of UTF-8 code units may not be valid:

    public init?(_ utf8: String.UTF8View)

Like before, there's no convenient way to turn an arbitrary sequence of UTF-8 code units into a String. The transcode function can be used here too:

    var utf8String = ""
    transcode(UTF8.self, UTF32.self, utf8Array.generate(), { utf8String.append(UnicodeScalar($0)) }, stopOnError: true)

Since these transcode calls are pretty painful, I wrapped them up in a pair of nice failable initializers:

    extension String {
        init?<Seq: SequenceType where Seq.Generator.Element == UInt16>(utf16: Seq) {
            self.init()

            guard transcode(UTF16.self,
                            UTF32.self,
                            utf16.generate(),
                            { self.append(UnicodeScalar($0)) },
                            stopOnError: true)
                            == false else { return nil }
        }

        init?<Seq: SequenceType where Seq.Generator.Element == UInt8>(utf8: Seq) {
            self.init()

            guard transcode(UTF8.self,
                            UTF32.self,
                            utf8.generate(),
                            { self.append(UnicodeScalar($0)) },
                            stopOnError: true)
                            == false else { return nil }
        }
    }

Now we can create Strings from arbitrary sequences of UTF-16 or UTF-8:

    String(utf16: utf16Array)
    String(utf8: utf8Array)

Indexes
The various views are all indexable collections, but they are very much not arrays. The index types are weird custom structs. This means you can't index views by number:

    // all errors
    string[2]
    string.characters[2]
    string.unicodeScalars[2]
    string.utf16[2]
    string.utf8[2]

Instead, you have to start with either the collection's startIndex or endIndex, then use methods like successor() or advancedBy() to move around:

    // these work
    string[string.startIndex.advancedBy(2)]
    string.characters[string.characters.startIndex.advancedBy(2)]
    string.unicodeScalars[string.unicodeScalars.startIndex.advancedBy(2)]
    string.utf16[string.utf16.startIndex.advancedBy(2)]
    string.utf8[string.utf8.startIndex.advancedBy(2)]

This is not fun. What's going on?

Recall that these are all views on the same underlying data, which is stored in some canonical form within the string object. When you use a view which doesn't match that canonical form, the data has to be converted when accessed.

Recall from above that these various encodings have different sizes and lengths. That means that there's no straightforward way to map a location in one view to a location in another view, because the mapping depends on the underlying data. Consider this string for example:

    AƎ工🄞

Imagine the canonical representation within String is UTF-32. That representation will be an array of 32-bit integers:

    0x00041 0x0018e 0x05de5 0x1f11e

Now imagine we get the UTF-8 view of this data. Conceptually, that data is a sequence of 8-bit integers:

    0x41 0xc6 0x8e 0xe5 0xb7 0xa5 0xf0 0x9f 0x84 0x9e

Here's how this sequence maps back to the original UTF-32:

    | 0x00041 |  0x0018e  |     0x05de5    |       0x1f11e       |
    |         |           |                |                     |
    |  0x41   | 0xc6 0x8e | 0xe5 0xb7 0xa5 | 0xf0 0x9f 0x84 0x9e |

If I ask the UTF-8 view for the value at index 6, it has to scan the UTF-32 array starting from the beginning to figure out where that value is and what it contains.

Obviously, this can be done. Swift provides the necessary functionality, it's just not pretty: string.utf8[string.utf8.startIndex.advancedBy(6)]. Why not make it easier, and allow indexing with an integer? It's essentially Swift's way of reinforcing the fact that this is an expensive operation. In a world where UTF8View provided subscript(Int), we'd expect these two pieces of code to be pretty much equivalent:

    for c in string.utf8 {
        ...
    }

    for i in 0..<string.utf8.count {
        let c = string.utf8[i]
        ...
    }

They would work the same, but the second one would be drastically slower. The first loop is a nice linear scan, whereas the second loop has to do a linear scan on each iteration, giving the whole loop a quadratic runtime. It's the difference between scanning a million-character string in a tenth of a second, and having that scan take three hours. (Approximate times taken from my 2013 MacBook Pro.)

Let's take another example, of simply getting the last character in the string:

    let lastCharacter = string.characters[string.characters.endIndex.predecessor()]

    let lastCharacter = string.characters[string.characters.count - 1]

The first version is fast. It starts at the end of the string, scans backwards briefly to figure out where the last Character starts, and then fetches it. The second version does a full scan of the entire string... twice! It has to scan the entire string to count how many Characters it contains, then scan it again to figure out where the specified numeric index is.

Everything you could do with an API like this can still be done with Swift's API, it's just different, and a little harder. These differences show the programmer that these views are not just arrays and they don't perform like arrays. When we see subscript indexing, we assume with pretty good reason that the indexing operation is fast. If String's views supported integer subscripting, it would be break that assumption, and make it easy to write really slow code.

Writing Code With String
What does all of this mean when writing code that uses String for practical purposes?

Use the highest-level API you can. If you need to see if a string starts with a certain letter, for example, don't index into the string to retrieve the first character and compare. Use the hasPrefix method, which takes care of the details for you. Don't be afraid to import Foundation and use NSString methods. For example, if you want to remove whitespace at the beginning and end of a String, don't manually iterate and look at the characters, use stringByTrimmingCharactersInSet.

If you do need to do character-level manipulation yourself, think carefully about exactly what a "character" means to you in this particular case. Often, the right answer is a grapheme cluster, which is what's represented by the Swift Character type and the characters view.

Whatever you're doing with the text, think about it in terms of a linear scan from the beginning or the end. Operations like asking for the count of characters or seeking into the middle will likely be linear time scans anyway, so it's better if you can arrange your code to do so explicitly. Grab the appropriate view, get its start or end index, then move that index around as you need it using advancedBy() and similar functions.

If you really need random access, or don't mind an efficiency hit and want the convenience of a more straightforward container, you can convert a view into an Array of whatever that view contains. For example, Array(string.characters) will produce an array of the grapheme clusters in that string. This is probably not a very efficient representation and will chew up some extra memory, but it's going to be much easier to work with. You can convert back to a String when done.

Conclusion
Swift's String type takes an unusual approach to strings. Many other languages pick one canonical representation for strings and leave you to fend for yourself if you want anything else. Often they simply give up on important questions like "what exactly is a character?" and pick something which is easy to work with in code but which sugar-coats the inherently difficult problems encountered in string processing. Swift doesn't sugar-coat it, but instead shows you the reality of what's going on. This can be difficult, but it's no more difficult than it needs to be.

The String API does have some holes in it, and could use some extra functionality to make life a little easier. In particular, converting from UTF-8 or UTF-16 to a String is unreasonably difficult and annoying. It would be nice to have facilities for initializing a UTF8View and UTF16View from arbitrary sequences of code units, as well as some more useful mutating functions on these views so they can be manipulated directly.

That's it for today. Come back next time for more shenanigans and terror. Friday Q&A is driven by reader ideas, so in the meantime, send me your requests for topics.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Gerard Guillemette at 2015-11-06 14:49:25:

It was pointed out in "The Swift Apprentice" that Swift does the right thing with something like

let stringA = "café"
let stringB = "cafe\u{0301}"
let equal = stringA == stringB

In stringA the bytes are 99-97-102-233 while for the second it is 99-97-102-101-769. "equal" ends up being true despite the difference in representation of é.

calicoding at 2015-11-06 15:15:30:

I would also add that swift's API did not bring over all the "path" methods of NSString, which seems to be a pain point for a lot of people. But here Apple is trying to reduce the scope of the problem and have people just use NSURL instead. This seems reasonable to me, and to be honest, more correct.

Ps. Great stuff as always

mikeash at 2015-11-06 15:33:06:

Gerard Guillemette: Yes, that is really nice. Even NSString doesn't do that by default.

stringA as NSString == stringB as NSString // false

You have to use one of the more specific comparison methods to make that say true.

calicoding: As I said, I think it's a mistake to have "path" be the same type as other kinds of text, so I think moving paths over to NSURL is a pretty good thing. Of course, we end up with a similar problem where, where NSURL can contain arbitrary URLs with arbitrary schemes, but 99% of the framework methods that take an NSURL only accept file: URLs. Note that the path methods are still available if you use as NSString to get an explicit NSString first. Then they return String so it's a bit annoying, but the functionality is at least still there.

Marc P at 2015-11-06 16:04:18:

Is there any way to approach the native NSString bridging performance in pure-Swift? If I compare creating a string using your UTF-16 extensions with that of constructing an NSString with the bytes, the NSString constructing is two orders of magnitude faster.

Here are the test cases I used:



// a very big UTF-16 array

let utf16Array = Array(String(count: 999999, repeatedValue: Character("X")).utf16)



class StringPerformanceTests: XCTestCase {





    func testMikeAshStringConversionPerformance() {

        var str = ""

        measureBlock {

            str = String(utf16: utf16Array)!

        }

        XCTAssertEqual(Array(str.utf16), utf16Array)

    }



    func testNSStringConversionPerformance() {

        var str = ""

        measureBlock {

            str = utf16Array.withUnsafeBufferPointer { ptr in

                NSString(characters: ptr.baseAddress, length: ptr.count) as String

            }

        }

        XCTAssertEqual(Array(str.utf16), utf16Array)

    }

}

Anon at 2015-11-06 16:35:04:

mikeash at 2015-11-06 16:43:12:

Marc P: Good question. I think the transcode function is just too general to be very fast. Calling a user-supplied function for every code point is tough to optimize. Which is all the more reason the standard library ought to provide a direct initializer for String that takes UTF-8 and UTF-16.

Do make sure you have optimizations enabled when testing, though. Swift is still slower, but it speeds up by about a factor of 10. Also, it depends pretty heavily on the encoding and the data. UTF-16 is NSString's native encoding, so creating an NSString from a UTF-16 array is basically just a memcpy. Try with UTF-8 and a Unicode flag as the repeated Character and while Swift still loses the race, it "only" loses by a factor of 3-4 instead of a factor of a bazillion.

ARaybold at 2015-11-06 17:28:10:

It may seem odd that I read this despite never having written a line of Swift code, and not likely to do so in the near future at least, but it is an interesting case of API design.

I think you make a good case, but this got my attention: "Why not make it easier, and allow indexing with an integer? It's essentially Swift's way of reinforcing the fact that this is an expensive operation."

This is a pretty indirect way of making the point. Perhaps the documentation for advancedBy() contains that warning? I went to the Apple Developer web site and searched around for a while, but not only did I not find a statement to this effect, I did not even find a concise reference document covering string functions, operators and methods. Maybe someone who has spent even a little more time with Swift than I have will have stumbled upon (and bookmarked) the sort of documentation that a programmer will need, but I am (for now) left with the impression that the underlying problem here is a failure to communicate.

Your article here also indirectly makes a good counter-example against the proposition that code can be adequately self-documenting, as it shows some real-world examples of where a function name cannot convey all the information you need to know to use it properly.

Jon at 2015-11-06 18:36:38:

ARaybold: Most of the documentation for the Swift core language exists in the library "header" files. You can check them out, rendered in a nice, searchable website, at swiftdoc.org.

mikeash at 2015-11-06 18:39:12:

ARaybold: I think that if any programmer sees this, they will assume that it's O(1):

someVar1[someVar2]

On the other hand, they will not assume any particular performance for this:

someVar1.advancedBy(someVar2)

Yes, this doesn't tell you what the performance actually is, but it at least doesn't lead you to make assumptions. (Note that in Swift, the [] indexing operation is still fast. The cost is in getting the right index value to pass to it.)

As far as the documentation goes, command-click a Swift symbol in Xcode and then read through all the comment documentation in the Swift module. The same information is also available on http://swiftdoc.org, and from Apple at https://developer.apple.com/library/ios/documentation/Swift/Reference/Swift_ForwardIndexType_Protocol/index.html. The documentation for advancedBy explicitly states that the complexity is O(1) on a RandomAccessIndexType, and O(n) otherwise.

Marc P. at 2015-11-06 18:57:01:

It seems like the NSString bridging is helped along by a hidden String implementation _SwiftNativeNSStringBase, which bodes poorly for the imminent Linux port of Swift, since the absence of an accompanying Foundation port will leave the String implementation non-performant when it needs to work with real-world encodings like UTF-8 and UTF-16.

Thanks for the tip about testing with optimization; it does help out the string performance quite a bit. But I don't think that the calling of the user-supplied block accounts for most of the overhead. The transcode() function be manually performed without any blocks like so:



    func testUTF16StringConversion() {

        var str = ""



        measureBlock {

            var string = ""

            var utf16 = UTF16()

            var gen = utf16Array.generate()

            var done = false

            while !done {

                switch utf16.decode(&gen) {

                case .Result(let val): string.append(val)

                case .EmptyInput: done = true

                case .Error: fatalError("bad string")

                }

            }



            str = string

        }



        XCTAssertEqual(Array(str.utf16), utf16Array)

    }

This method gives about a 15% boost to the extension where transcode() is being invoked, but it is still a far cry from the internal optimization that the NSString conversion gives you.

OTOH, NSString's constructor almost certainly is not doing any validation of the byte array, so it's not an entirely fair comparison. But it would be nice to have a fast-track mechanism to create a String from a sequence of encoded bytes available in Swift.

araybold at 2015-11-06 20:25:09:

@Jon: tanks for the links.

@mikeash: I am not sure if it entirely consistent to say that the implication is clear, while also providing a detailed explanation here - does that not imply that you think some people are not, in fact, getting it?

mikeash at 2015-11-06 20:38:18:

araybold: I'm not sure what you're referring to. Which implication is clear, and what "it" are people not getting?

coldtea at 2015-11-06 20:42:22:

> I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)

Rebol does that.

Pierre Lebeaupin at 2015-11-06 21:35:45:

First, there is a small mistake: the third Unicode code point of your example string in "problems" is not U+00B4 (ACUTE ACCENT), but U+0301 (COMBINING ACUTE ACCENT), among other things we can see it encodes to 0xCC 0X81 in UTF-8 (not to mention that this accent… combines, you know). Also, the last possible Unicode code point (so as to be encodable with two surrogates) is at U+10FFFF, you may need more than 20 bits (five nibbles), so I never represent "A" as 0x00041 or U+00041, rather 0x000041 (or 0x0041 if dealing with UTF-16 or 0x41 if dealing with UTF-8), because, granted, 0x00000041 or U+00000041 is a bit too long…

I argue (at http://wanderingcoder.net/2015/07/24/string-character-processing/ among others) that ordinary programmers need never care about the individual constituent of a string. Doesn't matters if you consider a string to be a sequence of bytes, words, code points, graphemes clusters: simply don't. Need to move the insertion point? Send the advanceInsertionPoint/moveBackInsertionPoint message to the text editing engine, it is going to worry about what that means, not you. A tweet? Serialize the string to a byte array with the UTF-32BE encoding (I haven't checked what the Twitter API takes, to be honest; adapt as appropriate), divide the length of the byte array by 4, that will give you whether you are at less, more, or exactly at 140. Database column? Same, except the encoding it UTF-8, and again you check if the byte array fits. Need to check whether the file name has extension "avi"? Do a case-insensitive, anchored, reverse, locale-independent search for ".avi" in the file name string. etc.

As a result, I completely agree with the Swift design of opaque indexes from string. Besides saving you from quadratic algorithms, it forces you to think about what it is you are actually doing to the string.

I am surprised that Swift still does not have a way to create a string by decoding a byte array in a specified encoding, and to create a byte array in a specified encoding from a string as part of the type. In particular, these UTF-x views should only be though as ways to integrate with foreign APIs (and as first step to decode from/encode to a byte array, since there is no direct way to do so), not means in and of themselves. Python 3 has the right idea (though I think it does not go far enough): there is no character type, merely very short strings when one does character-like processing, and the encodings (UTF-8, UTF-16, UTF-32) are only that: encodings that you can specify when decoding from/encoding to a byte array (for a file, network packet, etc.)

Matt at 2015-11-06 21:59:23:

Thanks for this write up. I'm not a Swift programmer whatsoever and still found it useful and interesting.

mikeash at 2015-11-06 22:14:31:

Pierre Lebeaupin: Thanks for pointing out the mistake with the accent. I don't know what happened there. Maybe I had some temporary brain damage. (I can hear everybody now, "What do you mean, 'temporary'?")

And yes, you're right, avoid actually peeking at the components of the string whenever you can. Unfortunately you can't always avoid it, because the APIs aren't always there for you, but definitely look real hard first.

Note that with Twitter, you need to count code points but first you need to normalize the string, so that's a bit of an extra complication there. Unfortunately Swift doesn't provide anything for normalization at the moment, although you can use the NSString APIs for it.

araybould at 2015-11-06 23:18:41:

@mikeash: My mistake - when I wrote the last reply I was still thinking that "why not make it easier, and allow indexing with an integer? It's essentially Swift's way of reinforcing the fact that this is an expensive operation" was intended to imply that programmers would deduce, from the absence of the operation, that it would be expensive. I see now that your point is that in the presence of the operation, many programmers would assume that it is efficient. I was wrongly thinking the contrapositive was implied, but programmers are not necessarily assuming anything from the operator's absence. I imagine some of them will go on to use advancedBy() inefficiently, but at least they are warned.

ttilley at 2015-11-07 00:05:21:

you can use hidden functionality to convert UTF16 arrays. here is an example from my project (UChar is an alias pulled in from C, its a single UTF16 value):



internal func ucharCollectionToString<T:CollectionType where T.Generator.Element == UChar>(collection: T) -> String {

    let count = collection.underestimateCount()

    var sc = _StringCore.init()

    sc.reserveCapacity(count)

    for codeunit in collection {

        // terminate processing at NULL like C string behavior

        if codeunit != UChar(0) {

            sc.append(codeunit)

        } else { break }

    }

    return String(sc)

}

I'm pretty sure this is going to be much faster than transcoding before creating your string object, but it depends on an internal API.

ttilley at 2015-11-07 00:23:32:

...and normal people can ignore the null check. I need it for my particular use case for unrelated reasons.

Eelco at 2015-11-07 00:38:12:

Re: different string types, this is a great article (from almost a decade ago!) on how to achieve this in Haskell: http://blog.moertel.com/posts/2006-10-18-a-type-based-solution-to-the-strings-problem.html

Keith Thompson at 2015-11-07 00:48:09:

In C, a string is a pointer to a sequence of non-zero bytes, terminated by a zero byte.

Not quite. In C, a string is by definition "a contiguous sequence of characters terminated by and including the ﬁrst null character". Strings are manipulated using pointers, but the pointer itself is not the string; it's a pointer to a string.

Reference: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf 7.1.1p1

Josh Bleecher Snyder at 2015-11-07 01:56:12:

In Go, strings are UTF8 sequences, not byte sequences, even though indexing into a string returns the nth byte. See https://blog.golang.org/strings.

mikeash at 2015-11-07 02:03:32:

araybould: Gotcha, and yes, I just meant it in the sense that allowing [integer] would be misleading, but not that leaving it out somehow explains the limitations.

Josh Bleecher Snyder: "It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes." And their very first example shows a string literal that does not contain valid UTF-8. Am I missing something here?

Pierre Lebeaupin at 2015-11-07 10:43:39:

re: "you can't always avoid it", I am trying to define a set of minimal primitives (with any other operation expressed as convenience methods over these primitives) for the string type to support all non-specialist text operations such that peeking at the components is never necessary:
* defining literal ASCII strings (typically for dictionary keys and debugging)
* reading and writing strings from byte arrays with a specified encoding
* printing the value of variables to a string, possibly under the control of a format and locale
* attempting to interpret the contents of a string as an integer or floating-point number, possibly under the control of a format and locale
* concatenating strings
* hashing strings (with an implementation of hashing that takes into account the fact strings that only vary in character composition are considered equal and so must have equal hashes)
* searching within a string with appropriate options (regular expression or not, case sensitive or not, anchored or not, etc.) and getting the first match (which may compare equal while not being the exact same Unicode sequence as the searched string), the part before that match, and the part after that match, or nothing if the search turned empty.
* comparing strings for equality and sorting with appropriate options (similar to that of searching, plus specific options such as numeric sort, that is "1" < "2" < "100")
* and for very specific purposes, a few text transformations: mostly convert to lowercase, convert to uppercase, and capitalize words.

Though I probably need to add string folding with the same options as comparing for equality (in order to use the folded strings as keys to a dictionary, for instance).

I have yet to hear of an ordinary text operation that cannot be factored as a combination of these primitives.

But what about text editing operations, typesetting, text rendering, full-text indexing, figuring word boundaries, word breaks, etc? Those are to be implemented by specialists who write libraries providing these services and ordinary programmers are to use the APIs of these libraries. Yes, even "translating" to pig latin is such a specialist operation, because you have to think first about what it means for Chinese text, for instance.

Anthony Bailey at 2015-11-07 15:21:23:

This is one of those "nice clear expression of how I fuzzily thought things should be" blog posts that I really enjoy and appreciate. Thanks!

Does the approach fit nicely with another real-world concern - a time/space-efficient implementation of string interning? Or is that kind of optimization outside of Swift's intended domain?

mikeash at 2015-11-08 03:12:20:

I don't know if Swift cares much about interning strings. It does care a lot about using them as dictionary keys, though, which has essentially the same performance requirements.

I think this approach would fit that well, simply because the implementation details are completely hidden, so String is free to use whatever internal representation makes the most sense for this. Contrast with NSString, where an internal representation using (say) UTF-32 would conflict badly with the API. Whatever internal representation makes the most sense for hashing, String could use it.

Josh Ballanco at 2015-11-08 04:18:22:

I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.

Another, more recent, language that does this is Julia with its "Non-Standard String Literals": http://docs.julialang.org/en/release-0.4/manual/metaprogramming/#man-non-standard-string-literals2 . It takes Python's r"foo.*bar" syntax for regexes and extends it in a user-definable way. For example, v"1.1.0" creates a VersionNumber.

Chad at 2015-11-10 03:00:21:

As a unicode nerd, I'm really happy that someone with deep domain knowledge finally developed a string API that matches reality. I like that Swift takes inspiration from functional programming languages where the type system exists to enforce correct usage of the API, not just to say what the bytes will be in memory.

On this aside:

You should checkout Haskell's newtype operator and Scala's AnyVal that allow for this. Basically 0-cost at runtime types that don't let you mix during compile time.

Keegan at 2015-12-18 10:53:20:

What if all you only need to deal with ascii? Is there a swift library that has less verbose string functionality?

Now that swift is open sourced many people are going to want to use it for things don't require Unicode, and thus don't need the complexity of the built in string type.

Aaron at 2016-01-11 21:04:16:

I wondered about your thoughts on why integer subscripts are not provided for the various string views? For example, it would be handy to be able to write:

let char = "Hello!".characters[0]

It would be easy to implement:

subscript(index: Int) -> Character {

    let index = startIndex.advancedBy(index)

    return self[index]

}

I really like the String API, but I think what trips people up more than anything is not having simple integer subscript access to the elements, and having instead to go through the process of creating an Index and advancing it.

Given how easy it would be to implement, any ideas why it is not available? Is it a performance thing?

Aaron at 2016-01-11 21:06:19:

I'm sorry, I just re-read your post and saw that you covered this. Please ignore my last comment.

Lisper at 2016-05-29 19:19:21:

"(Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)"

Common Lisp essentially does this: #P"foo" denotes a pathname string, though many stdlib functions still also accept ordinary strings.

Python also recently added Path objects, though there's no syntax for them, and I'm not sure anyone uses them yet.

Database interfaces don't tend to be part of programming languages, but most database *libraries* do this. Again, though, since most languages don't offer extensible syntax, they have to use function calls or classes, like SQLAlchemy's text() wrapper.

الوليد at 2016-09-30 15:55:46:

Do make sure you have optimizations enabled when testing, though. Swift is still slower, but it speeds up by about a factor of 10. Also, it depends pretty heavily on the encoding and the data. UTF-16 is NSString's native encoding, so creating an NSString from a UTF-16 array is basically just a memcpy. Try with UTF-8 and a Unicode flag as the repeated Character and while Swift still loses the race, it "only" loses by a factor of 3-4 instead of a factor of a bazillion.

Rennie at 2016-12-14 08:03:50:

"(Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)"

Boy, do I disagree. I have painful memories of working with some C++ programs where there were about a half-dozen different representations for string. The basic 8-bit zero-terminated strings, a variation that had a 16-bit length in front, two different "wide character" types from two different development groups in Microsoft and two other text string "standards" created by independent organisations or library implementers. What a mess. Every time some text data had to be conveyed from one part of the program to another, or calling a function at a different level in the implementation, it almost always involved converting from one kind of string representation to another.

Please, never again!

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks to Pygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.