CompSci 100e: Program Design & Analysis II(Spring 2010) | ||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||
Markovian Text GenerationYou should snarf assignment markov from w or browse the code directory for code files provided. See the markov howto file for help and more instructions.
For the extra credit Music part snarf the extra credit code which
includes a modified GUI/Model that you can use. Be sure to look at the
Genesis of Assignment![]() (image from Wikipedia, public domain) The true mathematical roots are from a 1948 monolog by Claude Shannon, A Mathematical Theory of Communication which discusses in detail the mathematics and intuition behind this assignment. This article was popularized by AK Dewdney in both Scientific American and later reprinted in his books that collected the articles he wrote. In 2005 two students at MIT had a randomly-generated paper accepted at a conference. Their paper-generating tool, SCIgen, has a Wikipedia entry with links to the program. In 1984 an Internet Personality named Mark V Shaney participated in the then chat-room-of-the time using randomly generated text as described in this Wikipedia entry. In December 2008 Herbert Schlangemann had a auto-generated paper accepted and published in an IEEE "conference" (quotes used judiciously). Joe Zachary developed this into a nifty assignment in 2003 and the folks at Princeton have been using this as the basis for an assignment since 2005 and possibly earlier.
High Level View of AssignmentYou'll do two things for this assignment.
You'll be provided with code that uses brute-force to generate random text using an order-k Markov Model based on characters. You'll first improve the code to make it more efficient, then you'll write a new model based on words rather than on characters. The term brute-force refers to the characterstic that the entire text that forms the basis of the Markov Model is rescanned to generate each letter of the random text. For the first part of the assignment you'll develop and use appropriate data structures so that the text is scanned only once. When you scan once, your code will store information in appropriate data structures so that generating random text requires looking up information rather than rescanning the text.
Background on Markov ModelsAn order-k Markov model uses strings of k-letters to predict text, these are called k-grams. An order-2 Markov model uses two-character strings or bigrams to calculate probabilities in generating random letters. For example suppose that in the text we're using for generating random letters using an order-2 Markov model the bigram "th" is followed 50 times by the letter 'e', 20 times by the letter 'a', and 30 times by the letter 'o', because the sequences "the", "tha" and "tho" occur 50, 20, and 30 times, respectively while there are no other occurrences of "th" in the text we're modeling.Now suppose that in generating random text we generate the bigram "th" and based on this we must generate the next random character using the order-2 model. The next letter will be an 'e' with a probability of 0.5 (50/100); will be an 'a' with probability 0.2 (20/100); and will be an 'o' with probability 0.3 (30/100). If 'e' is chosen, then the next bigram used to calculate random letters will be "he" since the last part of the old bigram is combined with the new letter to create the next bigram used in the Markov process. In general here's pseudo-code to generate random letters (and thus random text) using an order-k Markov model and a training text from which probabilities are calculated.
seed = random k-character substring from the training text --- the initial seed
repeat N times to generate N random letters
for each occurrence of seed in training text
record the letter that follows the occurrence of seed in a list
choose a random element of the list as the generated letter C
print or store C
seed = (last k-1 characters of seed) + C
Using this algorithm, whose Java equivalent is shown below, here are a
series of 200-randomly generated letters using the order-k Markov model.
The training text for these models are speeches made by Barack Obama as
his acceptance speech at the Democractic National Convention, John
McCain in his speech to the American Legion, and Sarah Palin in her
acceptance speech as McCain's vice president (all late August, 2008).
See training texts for
the training data files.
Here's the Java code that implements the brute-force solution to generate numLetters at random from the training-text in instance variable myString using an order-k Markov model.
The code above works fine, but to generate N letters in a text of size T the code does NT work since it rescans the text each time a character is found. A Smarter Approach for CharactersInstead of scanning the training text N times to generate N random characters, you'll scan the text once to create a structure representing every possible k-gram used in an order-k Markov Model. text. This is the first part of the assignment: creating the smarter/faster/better method and then reasoning about the new method empirically.For example, Suppose the training text is "bbbabbabbbbaba" and we're using an order-3 Markov Model. The 3-letter string (3-gram) "bbb" occurs three times, twice followed by 'a' and once by 'b'. We represent this by saying that the 3-gram "bbb" is followed twice by "bba" and once by "bbb" since the next 3-gram in generating Markov-modeled, random text is formed by taking the last two characters of the seed 3-gram "bbb", which are "bb" and adding the letter that follows the original 3-gram seed. The 3-letter string "bba" occurs three times, each time followed by 'b'. The 3-letter string "bab" occurs three times, followed twice by 'b' and once by 'a'. However, we treat the original string/training-text as circular, i.e., the end of the string is followed by the beginning of the string. This means "bab" also occurs at the end of the string (last two characters followed by the first character) again followed by 'b'. In processing the training text from left-to-right we see the following transitions between 3-grams starting with the left-most 3-gram "bbb" bbb -> bba -> bab -> abb -> bba -> bab ->abb -> bbb ->bbb -> bba -> bab -> aba ->bab -> abb -> bbbThis can be represented as a map of each possible three grams to the three grams that follow it:
This information can also be represented in a state diagram as shown below (from the Princeton website).
In your code you'll replace the brute-force re-scanning algorithm for generating random text based on characters with code that builds a data structure that you'll then use to follow the state transitions diagrammed above. Specifically, you'll create a map to make the operation of creating random text more efficient. Keys in the map are k-grams in an k-order Markov model. The value associated with each key is a list of related k-grams. Each different k-gram in the training text will be a key in the map. The value associated with a k-gram key is a list of every k-gram that follows key in the training text. The list of k-grams that constitute a value should be in order of occurrence in the training text. See the table of 3-grams above as an example using the training text "bbbabbabbbbaba". Note that the 3-gram key "bbb" would map to the list ["bba", "bbb", "bba"], the 3-gram key "bba" would map to the list ["bab", "bab", "bab"], and the 3-gram key "abb" would map to the list ["bba", "bbb", "bbb"] (think about why this is.) To generate random text your code should generate an initial seed k-gram at random from the training text, exactly as in the brute-force approach. Then use the pseudo-code outlined below.
seed = random k-character substring (k-gram) from the training text (key) --- the initial seed
repeat N times to generate N random letters
find the list (value) associated with seed (key) using the map
next-k-gram = choose a random k-gram from the list (value)
print or store C, the last character of next-k-gram
seed = next-k-gram // Note this is (last k-1 characters of seed) + C
Construct the map once --- don't construct the map each time the user tries to generate random text unless the value of k in the order-k Markov model has changed. See the howto pages for details on the class you'll implement and how it fits into the other classes. In your README file you will answer several questions regarding the run-time of the brute-force method and the new, map-based method you create. The README template supplied in the assignment provides questions you must answer as part of doing the assignment. The README accounts for a significant portion of your grade on this assignment. You should characterize the running time of both the brute force solution and the smarter map solution in terms of N the number of letters generated and T the length of the training text. You'll need to make assumptions about the number of k-grams, but in the worst case this is T/k and that assumes every character is unique. For te purposes of this assignment you can treak the k in k-gram as a constant (and thus ignore it).
Markov Models for WordsThis is the second part of the assignment. You'll use the character-generating Markov code you wrote as a model to create a new program that generates word models: instead of generating random characters based on the preceding character k-grams your program will generate random words based on the preceding word k-grams. You'll create aWordMarkovModel class that doesn't use brute force, see
the howto pages for details.
In the k-order Markov model with letters you just coded, k characters are used to
predict/generate another character. In a k-order word Markov
model k words are used to predict/generate another word --- words
replace characters. Here
instead of a k-gram consisting of
The idea is that whereas in your previous program you mapped every
k-gram represented by a String to a list of following k-grams
as Strings in this new program the key in the map will be a
Document all .java files you write or modify. Submit all the .java files
in your project. You should also submit a README as outlined
in the howto pages and these pages. Do not
submit jmusic.jar or .class files.
|
||||||||||||||||||||||||||||||||||||
| Last updated Fri Feb 26 14:13:15 EST 2010 | ||||||||||||||||||||||||||||||||||||