CompSci 6
Fall 2008
Program Design and Analysis

Making Random Text

Write a program that reads one or more reference files and, based on frequency and order that characters appear in the given text, generates new text randomly that sounds like it might have been written by the author of the reference files.

The model used to generate the new random text is a called an N-gram model, which uses the assumption that the previous n characters mostly determine the probabilities for the next character. For our purposes, we will use the most recently randomly generated n characters to determine the next randomly generated character by looking at all occurrences of those n characters in the reference text.

Our algorithm starts by picking a random n-character substring as the seed that starts the random text generation. This is done simply by picking an n-character long string at random from the reference text. This initial n-character string is called the predictor in the code that follows. Then, every occurrence of the predictor string is located in the reference text. The character that follows each occurrence is a possible candidate to be the next randomly generated character.

For example, imagine 1-gram generation as a typewriter with lots of 'e' keys, lots of 't' keys, lots of 'o' keys, but fewer 'r' keys, still fewer 'f' keys, and only one or two 'z' and 'y' keys (just like the distribution of Scrabble tiles or the letters given for the final game in Wheel of Fortune).

For example, suppose for a 3-gram the predictor is "the" and suppose the reference file is:

these are the authentic anthems of the ethernet, I hypothesize mathematically

Then the characters that follow the predictor are, in order (note the two space characters):

s nm rsm

One of these is chosen at random to be the next character in the generated text (notice that 's', ' ', and 'm' are the most likely since there are two occurrences of each, while 'n' and 'r' are less likely since there is only one occurrence of each). Then the predictor is updated to be the last n-1 characters of the old predictor followed by the chosen character. So, if 'm' is chosen in the example, the new predictor will be "hem" (which has following characters 's' and 'a').

This process is repeated to randomly generate a text of any given length that resembles the original reference text used as its source.

Writing the Code

If you download the project examples/10_ngram, you will see a shell that shows the basic algorithm described above, but that is missing three key details. You are to complete and test the three methods of the class NGram described below that work together to generate random text. For fun, we have provided a few large data files that have very characteristic styles.

Note, this program is based primarily on randomness to achieve an interesting result. Whenever randomness is involved, it is hard to verify that a program actually works as intended; thus, it is important to test each method you write individually to verify that each works as intended. Only by testing each method separately can you have any confidence that the program as a whole works. To test your program, we have also provided the class NGramTester that calls each method individually with specific parameter values and prints the results for you to inspect. You may also want to make some smaller reference files than those we provided to help verify each method works as expected.

Choosing a Random Substring

Complete the method getRandomSubstring of the class NGram that chooses a random substring of length subSize from the string s, given as parameters. Once you are confident you have completed this method, test it by using the class NGramTester to verify that you are returning a string of the correct length that occurs within the given string.

Getting all the Following Characters

Complete the method getFollowingCharacters of the class NGram which has been started for you. The code should:

  1. Find every occurrence of parameter toFind in the string s
  2. Add the character immediately following each occurrence of toFind to the string result (without any delimiter between the added characters, as in the example given above)

Once you are confident you have completed this method, test it by using the class NGramTester to verify that you are returning all of the possible following characters given a variety of strings to find.

Putting it all together

Complete the function makeNGram that is passed the reference text as one big string of every word in order, a value for the n in N-gram, and the length of the text to generate randomly. The function returns a new string constructed based on N-gram frequencies. You should implement the body of the loop. You will need to call both of the functions you have written previously to find the possible next characters and then choose one randomly.

Once you are confident you have completed this method, test it by using the class NGramTester to verify that you are returning the correct total number of characters and that it follows the text more closely as n increases.