Compsci 100, Fall 2011, Markov

You should snarf assignment markov from http://www.cs.duke.edu/courses/fall11/cps100/snarf or browse the code directory for code files provided. See the markov howto file for help and more instructions.

Genesis of Assignment

*typing monkey*
(image from Wikipedia, public domain)
This assignment has its roots in several places. I first read a story as a young boy now found in pages 91-98 from Fantasia Mathematica (Google Books) (edited by Clifton Fadiman) from a 1940 New Yorker story called Inflexible Logic by Russell Maloney. You can see the modern summary of this idea in this Wikipedia entry for the Infinite Monkey Theorem.

The true mathematical roots are from a 1948 monolog by Claude Shannon, A Mathematical Theory of Communication which discusses in detail the mathematics and intuition behind this assignment. This article was popularized by AK Dewdney in both Scientific American and later reprinted in his books that collected the articles he wrote.

In 2005 two students at MIT had a randomly-generated paper accepted at a conference. Their paper-generating tool, SCIgen, has a Wikipedia entry with links to the program. In 1984 an Internet Personality named Mark V Shaney participated in the then chat-room-of-the time using randomly generated text as described in this Wikipedia entry.

In December 2008 Herbert Schlangemann had a auto-generated paper accepted and published in an IEEE "conference" (quotes used judiciously).

Joe Zachary developed this into a nifty assignment in 2003 and the folks at Princeton have been using this as the basis for an assignment since 2005 and possibly earlier.


High Level View of Assignment

You'll do three things for this assignment.

  1. Improve the performance of code that generates random text based on predicting characters. For this you'll use a map to store information rather than (re)computing the information.

  2. Write a new random-text generation program based on words rather than characters.

  3. Modify, run, and report on a performance test comparing HashMap and TreeMap implementations of the Map interface.

You'll be provided with code that uses a brute-force approach to generate random text using an order-k Markov Model based on characters. You'll first improve the code to make it more efficient, then you'll write a new model based on words rather than on characters.

The term brute-force refers to the characterstic that the entire text that forms the basis of the Markov Model is rescanned to generate each letter of the random text.

For the first part of the assignment you'll develop and use a Map (with other appropriate structures as keys and values) so that the text is scanned only once. When you scan once, your code will store information so that generating random text requires looking up information rather than rescanning the text.

Background on Markov Models

An order-k Markov model uses strings of k-letters to predict text, these are called k-grams. An order-2 Markov model uses two-character strings or bigrams to calculate probabilities in generating random letters. For example suppose that in the text we're using for generating random letters using an order-2 Markov model the bigram "th" is followed 50 times by the letter 'e', 20 times by the letter 'a', and 30 times by the letter 'o', because the sequences "the", "tha" and "tho" occur 50, 20, and 30 times, respectively while there are no other occurrences of "th" in the text we're modeling.

Now suppose that in generating random text we generate the bigram "th" and based on this we must generate the next random character using the order-2 model. The next letter will be an 'e' with a probability of 0.5 (50/100); will be an 'a' with probability 0.2 (20/100); and will be an 'o' with probability 0.3 (30/100). If 'e' is chosen, then the next bigram used to calculate random letters will be "he" since the last part of the old bigram is combined with the new letter to create the next bigram used in the Markov process.

In general here's pseudo-code to generate random letters (and thus random text) using an order-k Markov model and a training text from which probabilities are calculated.

  seed = random k-character substring from the training text --- the initial seed
  repeat N times to generate N random letters
     for each occurrence of seed in training text 
        record the letter that follows the occurrence of seed in a list
     choose a random element of the list as the generated letter C
     print or store C
     seed = (last k-1 characters of seed) + C 
Using this algorithm, whose Java equivalent is shown below, here are a series of 200-randomly generated letters using the order-k Markov model. The training text for these models are speeches made by Barack Obama as his acceptance speech at the Democractic National Convention, John McCain in his speech to the American Legion, and Sarah Palin in her acceptance speech as McCain's vice president (all late August, 2008). See training texts for the training data files.

Order-k Obama McCain Palin
2 n rentrion't jobackeentry I knompin he and the in ard ould th cou'll spow. We forponowelves chat, agethilonow, age thembecal th. Ted judelisted we we all Hareaven the se a nuclas nom I've then no muc nd, al ditationg tationtionfits one the of minal orthe re VA lain Kostreed ve rave etens. I we Frans. Firces, me wore litypecest stareparefte and the millse wilignatte shosed make thend truslair comen ur arever ser; a going therven cal camend weetbacker, antater. Tod Theld prom het younto tol; and the withar kidepe, whe just whe could the wout to be ounted we ations. Oileaterall cles it st ou, This
3 y new dreams, have same time that prevel, I save of that eign purplummethe me from tough," to Democrations onces. I've have for our smally keeps talking American it. The new lege have by are future retary, in mand our VA heall you harity Counding in next continue that, in as of ther. I will the are to jointo elied to proverns a serve in a rought too gress well to essed. Pam elems will in an that righter Sept. 11 of highbors allies righ school, to serving on a finisher in thank as Sen. He who he PTA and things as but to hear. Todd and polities safer is of Amercial elebrator troops and thing t
4 nse on is at we his countries to progress for you'll investments across America, the United overty? Pull your own. Our grace the fuels. And just money work better an America's promise of go to resourc e left their cours extend upon the good. I belonger has the enterests of our respecial very invites are not force for peace, and firmly or she work of freedom the VA should make got lost conce introdu ct a greats and to threat effort for the people who have places week public office and when Sen. Join ourse, Sen. McCain, for that I didn't just his opport forms, and the PTA and to just at all the ol
5 y watched my life is just takes us from harm's way in theirs are differences; that the best hope that young veterans to take more accountry, and science, and more accompanies that we need to record's es than the anti-America receive benefit is not the people of Arizona, and their home are among those complaints about aiming a few expeditiously assess, and women of us returned from misrepresential ough, he'll be honoring the way, Todd is only one expects us today. I never really set out of five of America, then we just your average "hockey mom" in American president of the United States. Well,
8 gy independence on something firmer, and more honest in our battlefields may be Democrats and Republican nominee, John McCain, I will stop giving them to companies stop discriminating against those wi administration will do all that we can do, in less trying and treat conditions that predominantly or exclusively affect women. And here the Veterans' service to one another, of leaving no one behind, , Todd and I met way back in high school, and I promised him a little surprise for the right to vote. I think as well today of two other women who came before me in national Guard, that's not why the

Here's the Java code that implements the brute-force solution to generate numLetters at random from the training-text in instance variable myString using an order-k Markov model.

public void brute(int k, int numLetters) { // pick random k-character substring as initial seed int start = myRandom.nextInt(myString.length() - k + 1); String seed = myString.substring(start, start + k); // copy first k characters to back to simulate wrap-around String wrapAroundString = myString + myString.substring(0,k); StringBuilder build = new StringBuilder(); ArrayList<Character> list = new ArrayList<Character>(); for (int i = 0; i < numLetters; i++) { list.clear(); int pos = 0; while ((pos = wrapAroundString.indexOf(seed, pos)) != -1 && pos < myString.length()) { char ch = wrapAroundString.charAt(pos + k); list.add(ch); pos++; } int pick = myRandom.nextInt(list.size()); char ch = list.get(pick); build.append(ch); seed = seed.substring(1) + ch; } }

The code above works fine, but to generate N letters in a text of size T the code does NT work since it rescans the text each time a character is found.

A Smarter Approach for Characters

Instead of scanning the training text N times to generate N random characters, you'll scan the text once to create a structure representing every possible k-gram used in an order-k Markov Model. text. This is the first part of the assignment: creating the smarter/faster/better method and then reasoning about the new method empirically.

For example, Suppose the training text is "bbbabbabbbbaba" and we're using an order-3 Markov Model.

The 3-letter string (3-gram) "bbb" occurs three times, twice followed by 'a' and once by 'b'. We represent this by saying that the 3-gram "bbb" is followed twice by "bba" and once by "bbb" since the next 3-gram in generating Markov-modeled, random text is formed by taking the last two characters of the seed 3-gram "bbb", which are "bb" and adding the letter that follows the original 3-gram seed.

The 3-letter string "bba" occurs three times, each time followed by 'b'. The 3-letter string "bab" occurs three times, followed twice by 'b' and once by 'a'. However, we treat the original string/training-text as circular, i.e., the end of the string is followed by the beginning of the string. This means "bab" also occurs at the end of the string (last two characters followed by the first character) again followed by 'b'.

In processing the training text from left-to-right we see the following transitions between 3-grams starting with the left-most 3-gram "bbb"

  bbb -> bba -> bab -> abb -> bba -> bab ->abb -> bbb ->bbb -> bba -> bab -> aba ->bab -> abb -> bbb
This can be represented as a map of each possible three grams to the three grams that follow it:

3-gram following three grams
bbb bba, bbb, bba
bba bab, bab, bab
bab abb, abb, aba, abb
abb bba, bbb, bbb
aba bab

This information can also be represented in a state diagram as shown below (from the Princeton website).

In your code you'll replace the brute-force re-scanning algorithm for generating random text based on characters with code that builds a data structure that you'll then use to follow the state transitions diagrammed above. Specifically, you'll create a map to make the operation of creating random text more efficient. Keys in the map are k-grams in an k-order Markov model. The value associated with each key is a list of related k-grams. Each different k-gram in the training text will be a key in the map. The value associated with a k-gram key is a list of every k-gram that follows key in the training text. The list of k-grams that constitute a value should be in order of occurrence in the training text. See the table of 3-grams above as an example using the training text "bbbabbabbbbaba". Note that the 3-gram key "bbb" would map to the list ["bba", "bbb", "bba"], the 3-gram key "bba" would map to the list ["bab", "bab", "bab"], and the 3-gram key "abb" would map to the list ["bba", "bbb", "bbb"] (think about why this is.)

To generate random text your code should generate an initial seed k-gram at random from the training text, exactly as in the brute-force approach. Then use the pseudo-code outlined below.

  seed = random k-character substring (k-gram) from the training text (key) --- the initial seed
  repeat N times to generate N random letters
     find the list (value) associated with seed (key) using the map
     next-k-gram = choose a random k-gram from the list (value)
     print or store C, the last character of next-k-gram
     seed = next-k-gram    // Note this is (last k-1 characters of seed) + C

Construct the map once --- don't construct the map each time the user tries to generate random text unless the value of k in the order-k Markov model has changed. See the howto pages for details on the class you'll implement and how it fits into the other classes.


Markov Models for Words

This is the second part of the assignment. You'll use the character-generating Markov code you wrote as a model to create a new program that generates word models: instead of generating random characters based on the preceding character k-grams your program will generate random words based on the preceding word k-grams. You'll create a WordMarkovModel class that doesn't use brute force, see the howto pages for details.

In the k-order Markov model with letters you just coded, k characters are used to predict/generate another character. In a k-order word Markov model k words are used to predict/generate another word --- words replace characters. Here instead of a k-gram consisting of k letters, a WordNgram consists of k words. You are to write two classes: one has been started for you: WordNgram that represents N words from a training text. The other class you write should be used to generate a k-order Markov model random text. The howto pages provide details on how your new model class fits in with the existing classes of this project.

The idea is that whereas in your previous program you mapped every k-gram represented by a String to a list of following k-grams as Strings in this new program the key in the map will be a WordNgram representing every possible sequence of k-words. The associated value will be a list of the word k-grams that follow. Again, the howto has details. Your WordNgram class must correctly and efficiently implement .equals, .hashCode, and .compareTo so that WordNgram objects can be used in HashMap or TreeMap maps. For full credit the hashCode value should be calculated once, not every time the method is called.

Document all .java files you write or modify. Submit all the .java files in your project. You should also submit a Analysis.pdf (described below) and README.

Analysis

The analysis has 2 parts.

Part 1

Answer these questions:
  1. How long does it take using the provided, brute force code to generate order-5 text using Romeo and Juliet (romeo.txt) and generating 100, 200, 400, 800, and 1600 random characters. Do these timings change substantially when using order-1 or order-10 Markov models? Why?
  2. Romeo has roughly 153,000 characters. Hawthorne's Scarlet Letter contains roughly 500,000 characters. How long do you expect the brute force code to take to generate order-5 text when trained on hathorne.txt given the timings you observe for romeo when generating 400, 800, 1600 random characters? Do empirical results match what you think? How long do you think it will take to generate 1600 random characters using an order-5 Markov model when the King James Bible is used as the training text --- our online copy of this text contains roughly 4.4 million characters. Justify your answer -- don't test empirically, use reasoning.
  3. Provide timings using your Map/Smart model for both creating the map and generating 200, 400, 800, and 1600 character random texts with an order-5 Model and romeo.txt. Provide some explanation for the timings you observe.

Part 2

The goal of the second part of the analysis is to analyze the performance of WordMarkovModel using a HashMap (and the hashCode function you wrote) and a TreeMap (and the compareTo function you wrote). The main difference between them should be their performance as the number of keys (that is WordNGrams as keys in your map) gets large. So set up a test with the lexicons we give you and a few of your own. Figure out how many keys each different lexicon generates (for a specific number sized n-gram). Then generate some text and see how long it takes.

Graph the results. On one axis you'll have the number of keys, on the other you'll have the time it took to generate a constant of words (you decide...choose something pretty big to get more accurate results). Your two lines will be HashMap and TreeMap. Try to see if you can see any differences in their performance as the number of NGrams in the map get large. If you can't, that's fine. Briefly write up your analysis (like 1 or 2 paragraphs) and include both that and the graph in a PDF you submit.

Call the file Analysis.pdf so that our checking program can know that you submitted the right thing.

Grading

Criteria Points
Map-based Markov Model works and is more efficient than brute force code given 3
WordNGram Markov Model works and meets the requirements of hashCode and Comparable 3
Analysis 3
README/Style of code (see guidelines) 1

This assignment carries a weight of 1.5 whereas the previous assignments were weighted as 1.0.