mikeash.com: just this guy, you know?

Posted at 2018-04-28 01:27 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2018-06-29: Debugging with C-Reduce
Previous article: Friday Q&A 2017-12-08: Type Erasure in Swift
Tags: fridayqna
Friday Q&A 2018-04-27: Generating Text With Markov Chains in Swift
by Mike Ash

Markov chains make for a simple way to generate realistic looking but nonsensical text. Today, I'm going to use that technique to build a text generator based on this blog's contents, an idea suggested/inspired by reader Jordan Pittman.

Markov Chains
At a theoretical level, a Markov chain is a state machine where each transition has a probability associated with it. You can walk through the chain by choosing a start state and then transitioning to subsequent states randomly, weighted by the transition probabilities, until you reach a terminal state.

Markov chains have numerious applications, but the most amusing is for text generation. There, each state is some unit of text, typically a word. The states and transitions are generated from some input corpus, and then text is generated by walking through the chain and outputting the word for each state. The result rarely makes sense, as the chain doesn't contain enough information to retain any of the underlying meaning of the input corpus, or even much of the grammatical structure, but that lack of sense can be hilarious.

Representation
The nodes in the chain will be represented as instances of a `Word` class. This class will store a `String` for the word it represents, and a set of links to other words.

How do we represent that set of links? The most obvious approach would be some sort of counted set, which would store other `Word` instances along with a count of the number of times that transition was seen in the input corpus. Randomly choosing a link from such a set can be tricky, though. A simple way is to generate a random number betewen 0 and the total count of the entire set, then iterate through the set until you encounter that many links, and choose the link you landed on. This is easy but slow. Another approach would be to precompute an array that stores the cumulative total for each link in the array, then do a binary search on a random number between 0 and the total. This is harder but faster. If you want to get really fancy, you can do even more preprocessing and end up with a compact structure you can query in constant time.

Ultimately, I decided to be lazy and use a structure that's extremely wasteful in space, but efficient in time and easy to implement. Each `Word` contains an array of subsequent `Word`s. If a link occurs multiple times, the duplicates remain in the array. Choosing a random element with the appropriate weight consists of choosing a random index in the array.

Here's what the `Word` class looks like:

```    class Word {
let str: String?
var links: [Word] = []

init(str: String?) {
self.str = str
}

func randomNext() -> Word {
let index = arc4random_uniform(UInt32(links.count))
return links[Int(index)]
}
}
```

Note that the `links` array will likely result in lots of circular references. To avoid leaking memory, we'll need to have something manually clean those up.

That something will be the `Chain` class, which will manage all of the `Word`s in a chain:

```    class Chain {
var words: [String?: Word] = [:]
```

In `deinit`, it clears all of the `links` arrays to eliminate any cycles:

```        deinit {
for word in words.values {
word.links = []
}
}
```

Without this step, a lot of the `Word` instances would leak.

Let's look at how words are added to the chain. The `add` method will take an array of `String`s, each one of which holds a word (or any other unit that the caller wants to work with):

```        func add(_ words: [String]) {
```

If there aren't actually any words, bail out early:

```            if words.isEmpty { return }
```

We want to iterate over pairs of words, where the second element in the pair is the word that follows the first element. For example, in the sentence "Help, I'm being oppressed," we want to iterate over `("Help", "I'm")`, `("I'm", "being")`, `("being", "oppressed")`.

Actually, we want a bit more as well, since we want to encode the beginning and end of the sentence. We represent those as `nil`, so the actual sequence we want to iterate over is `(nil, "Help")`, `("Help", "I'm")`, `("I'm", "being")`, `("being", "oppressed")`, `("oppressed", nil)`.

To allow for `nil`, we need an array whose contents are `String?` rather than plain `String`:

```            let words = words as [String?]
```

Next, we'll construct two arrays, one by prepending `nil` and one by appending `nil`. Zipping them together produces the sequence we want:

```            let wordPairs = zip([nil] + words, words + [nil])
for (first, second) in wordPairs {
```

For each word in the pair, we'll fetch the corresponding `Word` object using a handy helper function:

```                let firstWord = word(first)
let secondWord = word(second)
```

Then all we have to do is add `secondWord` into the links of `firstWord`:

```                firstWord.links.append(secondWord)
}
}
```

The `word` helper fetches the instance from the `words` dictionary if it exists, otherwise it creates a new instance and puts it into the dictionary. This frees other code from worrying about whether there's already a `Word` for any given string:

```        func word(_ str: String?) -> Word {
if let word = words[str] {
return word
} else {
let word = Word(str: str)
words[str] = word
return word
}
}
```

Finally, we want to generate new sequences of words:

```        func generate() -> [String] {
```

We'll generate the words one by one, accumulating them here:

```            var result: [String] = []
```

Loop "forever." The exit condition doesn't map cleanly to a loop condition, so we'll handle that inside the loop:

```            while true {
```

Fetch the `Word` instance for the last string in `result`. This neatly handles the initial case where `result` is empty, since `last` produces `nil` which indicates the first word:

```                let currentWord = word(result.last)
```

Randomly get a linked word:

```                let nextWord = currentWord.randomNext()
```

If the linked word isn't the end, append it to `result`. If it is the end, terminate the loop:

```                if let str = nextWord.str {
result.append(str)
} else {
break
}
}
```

Return the accumulated result:

```            return result
}
}
```

One last thing: we're using `String?` as the key type for `words`, but `Optional` doesn't conform to `Hashable`. Here's a quick extension that adds it when its wrapped type conforms:

```    extension Optional: Hashable where Wrapped: Hashable {
public var hashValue: Int {
switch self {
case let wrapped?: return wrapped.hashValue
case .none: return 42
}
}
}
```

Generating Input
That's the Markov chain itself, but it's pretty boring without some real text to put into it.

I decided to pull text from an RSS feed. What better feed to choose than my own blog's full text feed?

```    let feedURL = URL(string: "https://www.mikeash.com/pyblog/rss.py?mode=fulltext")!
```

RSS is an XML format, so use `XMLDocument` to parse it:

```    let xmlDocument = try! XMLDocument(contentsOf: feedURL, options: [])
```

The article bodies are in XML `description` nodes which are nested inside `item` nodes. An XPath query retrieves those:

```    let descriptionNodes = try! xmlDocument.nodes(forXPath: "//item/description")
```

We want the strings in the XML nodes, so extract those and throw away any that are `nil`:

```    let descriptionHTMLs = descriptionNodes.compactMap({ \$0.stringValue })
```

We don't care about the markup at all. `NSAttributedString` can parse `HTML` and produce a string with attributes, which we can then throw away:

```    let descriptionStrings = descriptionHTMLs.map({
NSAttributedString(html: \$0.data(using: .utf8)!, options: [:], documentAttributes: nil)!.string
})
```

Let's take a quick detour to a function that breaks up a string into parts. We ultimately want to consume arrays of `String`, where each array represents a sentence. A string will contain zero or more sentences, so this `wordSequences` function returns an array of arrays of `String`:

```    func wordSequences(in str: String) -> [[String]] {
```

Results get accumulated into a local variable:

```        var result: [[String]] = []
```

Breaking a `String` into sentences isn't always easy. You could search for the appropriate punctuation, but consider a sentence like "Mr. Jock, TV quiz Ph.D., bags few lynx." That's one sentence, despite having four periods.

`NSString` provides some methods for intelligently examining parts of a string, and `String` gets those when you `import Foundation`. We'll ask `str` to enumerate its sentences, and let `Foundation` figure out how:

```        str.enumerateSubstrings(in: str.startIndex..., options: .bySentences, { substring, substringRange, enclosingRange, stop in
```

We face a similar problem splitting each sentence into words. `NSString` does provide a method for enumerating over words, but this presents some problems, like losing punctuation. I ultimately decided to take a dumb approach for word splitting and just split on spaces. This means that you end up with words that contain punctuation as part of their string. This constrains the Markov chain more than if the punctuation was removed, but on the other hand it means that the output naturally contains something like reasonable punctuation. It seemed like a good tradeoff.

Some newlines make their way into the data set, so we'll cut those off at this point:

```            let words = substring!.split(separator: " ").map({
\$0.trimmingCharacters(in: CharacterSet.newlines)
})
```

The sliced-up sentence gets added to `result`:

```            result.append(words)
})
```

After enumeration is complete, `result` is filled out with the sentences from the input, and we return it to the caller:

```        return result
}
```

Back to the main code. Now that we have a way to convert a string into a list of sentences, we can build our Markov chain. We'll start with an empty `Chain` object:

```    let chain = Chain()
```

Then we go through all the strings, extract the sentences, and add them to the chain:

```    for str in descriptionStrings {
for sentence in wordSequences(in: str) {
chain.add(sentence)
}
}
```

All that's left is to generate some new sentences! We'll call `generate()` and then join the result with spaces. The output is hit-or-miss (which is no surprise given the random nature of the technique) so we'll generate a lot:

```    for _ in 0 ..< 200 {
print("\"" + chain.generate().joined(separator: " ") + "\"")
}
```

Example Output
For your entertainment, here are some examples of the output of this program:

• "We're ready to be small, weak references in New York City."
• "It thus makes no values?"
• "Simple JSON tasks, it's wasteful if you can be."
• "Another problem, but it would make things more programming-related mystery goo."
• "The escalating delays after excessive focus on Friday, September 29th."
• "You may not set."
• "Declare conformance to use = Self.init() to detect the requested values."
• "The tagged pointer is inlined at this nature; even hundreds of software and writing out at 64 bits wide."
• "We're ready to express that it works by reader ideas, so the decoding methods for great while, it's inaccessible to 0xa4, which takes care of increasing addresses as the timing."
• "APIs which is mostly a one-sided use it yourself?"
• "There's no surprise."
• "I wasn't sure why I've been called 'zero-cost' in control both take serious effort to miss instead of ARC and games."
• "For now, we can look at the filesystem."
• "The intent is intended as reader-writer locks."
• "For example, we can use of the code?"
• "Swift's generics can all fields of Swift programming, with them is no parameters are static subscript, these instantiate self = cluster.reduce(0, +) / Double(cluster.count)"
• "However, the common case, you to the left-hand side tables."

There's a lot of nonsense as well, so you have to dig through to find good ones, but Markov chains can produce some pretty funny output.

Conclusion
Markov chains have many practical uses, but they can also be hilariously useless when used to generate text. Aside from being entertaining, this code also demonstrates how to deal with circular references in a situation where there's no clear directionality, how to use `NSString`'s intelligent enumeration methods to extract features from text, and a brief demonstration of the power of conditional conformances.

That wraps it up for today. Stop by next time for more fun, games, and maybe even a little education. Until then, Friday Q&A is driven by reader ideas, so if you have a topic you'd like to see covered here, please send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Could a method like this also be used for forensics - determining if a given body of text was written by a suspect?

For example, if you had a large body of writing from User_123 and used it to populate the probabilities of a Markov chain, you would expect that another large body of writing from the same user would generate similar probabilities. So given a body of writing from an unknown user, you could compare the probabilities and determine whether or not User_123 wrote both. Although I imagine this approach would have quite large error bars - and would be more error prone as you dealt with smaller and smaller bodies of writing.

I'm also guessing the accuracy of the above approach would be affected by the topic covered by the body of writing - for instance I'm guessing the Markov chain generated from your blog would be very different from one generated from your text messages to family members - probably fewer references to ARC and APIs.

In any case - very interesting write up!
NLTokenizer in 12 beta/10.14 beta looks to be really useful for something like this.

https://developer.apple.com/documentation/naturallanguage/nltokenizer?changes=latest_minor
A lot of confusion. but useful.
This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information. Keep it up. Keep blogging. Looking to reading your next post.
http://www.dailyhealthstudy.com
Thank you for posting this article, the concept is clear and really useful. I appreciate.
I really love this article, the concept is clear. I really appreciate your work.
I am pretty much pleased with your good work.You put really very helpful informationGet Step-by-Step guide for Office – Activate, Download & complete installation from office.com/setup.

Your friend's blog I read, I am impressed by your blog, I hope you will have more blogs or more articles to bring to the reader.

I really appreciate this wonderful post that you have provided for us. I assure this would be beneficial for most of the people.

Amzing website, love it. great work done. Thanx for sharing with us, keep postings...

Wow I really enjoyed reading this post. Thanks a lot I love it...
This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information. Keep it up. Keep blogging. Looking to reading your next post.
amzing website, love it. great work done. Thanx for sharing with us, keep postings
Thanks for sharing this marvelous post. I m very pleased to read this article .I enjoy this site - its so usefull and helpfull.
I wish you all the best. I just discovered the blog when I read one of your older posts. Unfortunately I had to read this post. Hope you can continue the blog soon.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

 Name: Web site: Comment: Formatting:
. URLs are automatically hyperlinked.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.