Next article: Friday Q&A 2018-06-29: Debugging with C-Reduce
Previous article: Friday Q&A 2017-12-08: Type Erasure in Swift
Tags: fridayqna
Markov chains make for a simple way to generate realistic looking but nonsensical text. Today, I'm going to use that technique to build a text generator based on this blog's contents, an idea suggested/inspired by reader Jordan Pittman.
Markov Chains
At a theoretical level, a Markov chain is a state machine where each transition has a probability associated with it. You can walk through the chain by choosing a start state and then transitioning to subsequent states randomly, weighted by the transition probabilities, until you reach a terminal state.
Markov chains have numerious applications, but the most amusing is for text generation. There, each state is some unit of text, typically a word. The states and transitions are generated from some input corpus, and then text is generated by walking through the chain and outputting the word for each state. The result rarely makes sense, as the chain doesn't contain enough information to retain any of the underlying meaning of the input corpus, or even much of the grammatical structure, but that lack of sense can be hilarious.
Representation
The nodes in the chain will be represented as instances of a Word
class. This class will store a String
for the word it represents, and a set of links to other words.
How do we represent that set of links? The most obvious approach would be some sort of counted set, which would store other Word
instances along with a count of the number of times that transition was seen in the input corpus. Randomly choosing a link from such a set can be tricky, though. A simple way is to generate a random number betewen 0 and the total count of the entire set, then iterate through the set until you encounter that many links, and choose the link you landed on. This is easy but slow. Another approach would be to precompute an array that stores the cumulative total for each link in the array, then do a binary search on a random number between 0 and the total. This is harder but faster. If you want to get really fancy, you can do even more preprocessing and end up with a compact structure you can query in constant time.
Ultimately, I decided to be lazy and use a structure that's extremely wasteful in space, but efficient in time and easy to implement. Each Word
contains an array of subsequent Word
s. If a link occurs multiple times, the duplicates remain in the array. Choosing a random element with the appropriate weight consists of choosing a random index in the array.
Here's what the Word
class looks like:
class Word {
let str: String?
var links: [Word] = []
init(str: String?) {
self.str = str
}
func randomNext() -> Word {
let index = arc4random_uniform(UInt32(links.count))
return links[Int(index)]
}
}
Note that the links
array will likely result in lots of circular references. To avoid leaking memory, we'll need to have something manually clean those up.
That something will be the Chain
class, which will manage all of the Word
s in a chain:
class Chain {
var words: [String?: Word] = [:]
In deinit
, it clears all of the links
arrays to eliminate any cycles:
deinit {
for word in words.values {
word.links = []
}
}
Without this step, a lot of the Word
instances would leak.
Let's look at how words are added to the chain. The add
method will take an array of String
s, each one of which holds a word (or any other unit that the caller wants to work with):
func add(_ words: [String]) {
If there aren't actually any words, bail out early:
if words.isEmpty { return }
We want to iterate over pairs of words, where the second element in the pair is the word that follows the first element. For example, in the sentence "Help, I'm being oppressed," we want to iterate over ("Help", "I'm")
, ("I'm", "being")
, ("being", "oppressed")
.
Actually, we want a bit more as well, since we want to encode the beginning and end of the sentence. We represent those as nil
, so the actual sequence we want to iterate over is (nil, "Help")
, ("Help", "I'm")
, ("I'm", "being")
, ("being", "oppressed")
, ("oppressed", nil)
.
To allow for nil
, we need an array whose contents are String?
rather than plain String
:
let words = words as [String?]
Next, we'll construct two arrays, one by prepending nil
and one by appending nil
. Zipping them together produces the sequence we want:
let wordPairs = zip([nil] + words, words + [nil])
for (first, second) in wordPairs {
For each word in the pair, we'll fetch the corresponding Word
object using a handy helper function:
let firstWord = word(first)
let secondWord = word(second)
Then all we have to do is add secondWord
into the links of firstWord
:
firstWord.links.append(secondWord)
}
}
The word
helper fetches the instance from the words
dictionary if it exists, otherwise it creates a new instance and puts it into the dictionary. This frees other code from worrying about whether there's already a Word
for any given string:
func word(_ str: String?) -> Word {
if let word = words[str] {
return word
} else {
let word = Word(str: str)
words[str] = word
return word
}
}
Finally, we want to generate new sequences of words:
func generate() -> [String] {
We'll generate the words one by one, accumulating them here:
var result: [String] = []
Loop "forever." The exit condition doesn't map cleanly to a loop condition, so we'll handle that inside the loop:
while true {
Fetch the Word
instance for the last string in result
. This neatly handles the initial case where result
is empty, since last
produces nil
which indicates the first word:
let currentWord = word(result.last)
Randomly get a linked word:
let nextWord = currentWord.randomNext()
If the linked word isn't the end, append it to result
. If it is the end, terminate the loop:
if let str = nextWord.str {
result.append(str)
} else {
break
}
}
Return the accumulated result:
return result
}
}
One last thing: we're using String?
as the key type for words
, but Optional
doesn't conform to Hashable
. Here's a quick extension that adds it when its wrapped type conforms:
extension Optional: Hashable where Wrapped: Hashable {
public var hashValue: Int {
switch self {
case let wrapped?: return wrapped.hashValue
case .none: return 42
}
}
}
Generating Input
That's the Markov chain itself, but it's pretty boring without some real text to put into it.
I decided to pull text from an RSS feed. What better feed to choose than my own blog's full text feed?
let feedURL = URL(string: "https://www.mikeash.com/pyblog/rss.py?mode=fulltext")!
RSS is an XML format, so use XMLDocument
to parse it:
let xmlDocument = try! XMLDocument(contentsOf: feedURL, options: [])
The article bodies are in XML description
nodes which are nested inside item
nodes. An XPath query retrieves those:
let descriptionNodes = try! xmlDocument.nodes(forXPath: "//item/description")
We want the strings in the XML nodes, so extract those and throw away any that are nil
:
let descriptionHTMLs = descriptionNodes.compactMap({ $0.stringValue })
We don't care about the markup at all. NSAttributedString
can parse HTML
and produce a string with attributes, which we can then throw away:
let descriptionStrings = descriptionHTMLs.map({
NSAttributedString(html: $0.data(using: .utf8)!, options: [:], documentAttributes: nil)!.string
})
Let's take a quick detour to a function that breaks up a string into parts. We ultimately want to consume arrays of String
, where each array represents a sentence. A string will contain zero or more sentences, so this wordSequences
function returns an array of arrays of String
:
func wordSequences(in str: String) -> [[String]] {
Results get accumulated into a local variable:
var result: [[String]] = []
Breaking a String
into sentences isn't always easy. You could search for the appropriate punctuation, but consider a sentence like "Mr. Jock, TV quiz Ph.D., bags few lynx." That's one sentence, despite having four periods.
NSString
provides some methods for intelligently examining parts of a string, and String
gets those when you import Foundation
. We'll ask str
to enumerate its sentences, and let Foundation
figure out how:
str.enumerateSubstrings(in: str.startIndex..., options: .bySentences, { substring, substringRange, enclosingRange, stop in
We face a similar problem splitting each sentence into words. NSString
does provide a method for enumerating over words, but this presents some problems, like losing punctuation. I ultimately decided to take a dumb approach for word splitting and just split on spaces. This means that you end up with words that contain punctuation as part of their string. This constrains the Markov chain more than if the punctuation was removed, but on the other hand it means that the output naturally contains something like reasonable punctuation. It seemed like a good tradeoff.
Some newlines make their way into the data set, so we'll cut those off at this point:
let words = substring!.split(separator: " ").map({
$0.trimmingCharacters(in: CharacterSet.newlines)
})
The sliced-up sentence gets added to result
:
result.append(words)
})
After enumeration is complete, result
is filled out with the sentences from the input, and we return it to the caller:
return result
}
Back to the main code. Now that we have a way to convert a string into a list of sentences, we can build our Markov chain. We'll start with an empty Chain
object:
let chain = Chain()
Then we go through all the strings, extract the sentences, and add them to the chain:
for str in descriptionStrings {
for sentence in wordSequences(in: str) {
chain.add(sentence)
}
}
All that's left is to generate some new sentences! We'll call generate()
and then join the result with spaces. The output is hit-or-miss (which is no surprise given the random nature of the technique) so we'll generate a lot:
for _ in 0 ..< 200 {
print("\"" + chain.generate().joined(separator: " ") + "\"")
}
Example Output
For your entertainment, here are some examples of the output of this program:
- "We're ready to be small, weak references in New York City."
- "It thus makes no values?"
- "Simple JSON tasks, it's wasteful if you can be."
- "Another problem, but it would make things more programming-related mystery goo."
- "The escalating delays after excessive focus on Friday, September 29th."
- "You may not set."
- "Declare conformance to use = Self.init() to detect the requested values."
- "The tagged pointer is inlined at this nature; even hundreds of software and writing out at 64 bits wide."
- "We're ready to express that it works by reader ideas, so the decoding methods for great while, it's inaccessible to 0xa4, which takes care of increasing addresses as the timing."
- "APIs which is mostly a one-sided use it yourself?"
- "There's no surprise."
- "I wasn't sure why I've been called 'zero-cost' in control both take serious effort to miss instead of ARC and games."
- "For now, we can look at the filesystem."
- "The intent is intended as reader-writer locks."
- "For example, we can use of the code?"
- "Swift's generics can all fields of Swift programming, with them is no parameters are static subscript, these instantiate self = cluster.reduce(0, +) / Double(cluster.count)"
- "However, the common case, you to the left-hand side tables."
There's a lot of nonsense as well, so you have to dig through to find good ones, but Markov chains can produce some pretty funny output.
Conclusion
Markov chains have many practical uses, but they can also be hilariously useless when used to generate text. Aside from being entertaining, this code also demonstrates how to deal with circular references in a situation where there's no clear directionality, how to use NSString
's intelligent enumeration methods to extract features from text, and a brief demonstration of the power of conditional conformances.
That wraps it up for today. Stop by next time for more fun, games, and maybe even a little education. Until then, Friday Q&A is driven by reader ideas, so if you have a topic you'd like to see covered here, please send it in!
Comments:
https://developer.apple.com/documentation/naturallanguage/nltokenizer?changes=latest_minor
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.
For example, if you had a large body of writing from User_123 and used it to populate the probabilities of a Markov chain, you would expect that another large body of writing from the same user would generate similar probabilities. So given a body of writing from an unknown user, you could compare the probabilities and determine whether or not User_123 wrote both. Although I imagine this approach would have quite large error bars - and would be more error prone as you dealt with smaller and smaller bodies of writing.
I'm also guessing the accuracy of the above approach would be affected by the topic covered by the body of writing - for instance I'm guessing the Markov chain generated from your blog would be very different from one generated from your text messages to family members - probably fewer references to ARC and APIs.
In any case - very interesting write up!