Making Random Text

Write a program will read one or more reference files and, based on frequency and order that characters appear in the given text, generate new text randomly that sounds like it might have been written by that author.

The model used to generate the new random text is a called an N-gram model, which uses the assumption that the previous n characters mostly deterine the probabilities for the next character. For our purposes, we will use the most recently randomly generated n characters to determine the next randomly generated character by looking at all occurrences of those n characters in the reference text.

Our algorithm starts by picking a random n-character substring as the seed that starts the random text generation. This is done simply by picking an n-character long string at random from the reference text. This initial n-character string is called the predictor in the code that follows. Then, every occurrence of the predictor string is located in the reference text. The character that follows each occurrence is a possible candidate to be the next randomly generated character.

For example, imagine 1-gram generation as a typewriter with lots of 'e' keys, lots of 't' keys, lots of 'o' keys, but fewer 'r' keys, still fewer 'f' keys, and only one or two 'z' and 'y' keys (just like the distribution of Scrabble tiles or the letters given for the final game in Wheel of Fortune).

For example, suppose for a 3-gram the predictor is "the" and suppose the reference file is:

these are the authentic anthems of the ethernet, I hypothesize mathematically

Then the characters that follow the predictor are, in order:

's', ' ', 'n', 'm', ' ', 'r', 's', 'm'

One of these is chosen at random to be the next character in the generated text (notice that 's', ' ', and 'm' are the most likely since there are two occurrences of each, while 'n' and 'r' are less likely since there is only one occurrence of each). Then the predictor is updated to be the last n-1 characters of the old predictor followed by the chosen character. So, if 'm' is chosen in the example, the new predictor will be "hem" (which has following characters 's' and 'a').

This process is repeated to randomly generate a text of any given length that resembles the original reference text used as its source.

Writing the Code

If you check out the project examples/10_ngram, you will see a shell that shows the basic algorithm described above, but that is missing three key details. You are to complete and test the three functions described below that work together to generate random text. To test your program, we have provided a few large data files. You should also make some smaller test files on your own to help verify each function works as expected.

Choosing a Random Substring

Complete the function getRandomSubstring that chooses a random substring of length subSize from the string s; subSize and s are given as parameters.

Getting all the Following Characters

Complete the function getFollowingCharacters which has been started for you. The code should:

Find every occurrence of parameter toFind in the string s
Add the character immediately following each occurrence of toFind to the string result

Putting it all together

Complete the function makeNGram that is passed the reference text as one big string of every word in order, a value for the n in N-gram, and the length of the text to generate randomly. The function returns a new string constructed based on N-gram frequencies. You should implement the body of the loop. You will need to call both of the functions you have written previously to find the possible next characters and then choose one randomly.

Comments?