Making Random Text
Write a program that reads one or more reference files and, based on frequency and order that characters appear in the given text, generates new text randomly that sounds like it might have been written by the author of the reference files.
The model used to generate the new random text is a called an N-gram model, which uses the assumption that the previous n characters mostly determine the probabilities for the next character. For our purposes, we will use the most recently randomly generated n characters to determine the next randomly generated character by looking at all occurrences of those n characters in the reference text.
Our algorithm starts by picking a random n-character substring as the seed that starts the random text generation. This is done simply by picking an n-character long string at random from the reference text. This initial n-character string is called the predictor in the code that follows. Then, every occurrence of the predictor string is located in the reference text. The character that follows each occurrence is a possible candidate to be the next randomly generated character.
For example, imagine 1-gram generation as a typewriter with lots of 'e' keys, lots of 't' keys, lots of 'o' keys, but fewer 'r' keys, still fewer 'f' keys, and only one or two 'z' and 'y' keys (just like the distribution of Scrabble tiles or the letters given for the final game in Wheel of Fortune).
For example, suppose for a 3-gram the predictor is "the" and
suppose the reference file is:
these are the authentic anthems of the ethernet, I hypothesize mathematically
Then the characters that follow the predictor are, in order (note the two space characters):
s nm rsm
One of these is chosen at random to be the next character in the
generated text (notice that 's', ' ', and 'm' are the most likely
since there are two occurrences of each, while 'n' and 'r' are less
likely since there is only one occurrence of each). Then the predictor
is updated to be the last n-1 characters of the old predictor
followed by the chosen character. So, if 'm' is chosen in the example,
the new predictor will be "hem" (which has following
characters 's' and 'a').
This process is repeated to randomly generate a text of any given length that resembles the original reference text used as its source.
Writing the Code
If you download the project examples/10_ngram, you will see a shell
that shows the basic algorithm described above, but that is missing three key
details. You are to complete and test the three methods of the class NGram described
below that work together to generate random text. For fun, we have provided
a few large data files that have very characteristic styles.
Note, this program is based primarily on randomness to achieve an interesting
result. Whenever randomness is involved, it is hard to verify
that a program actually works as intended; thus, it is important to test each
method you write individually to verify that each works as intended. Only by
testing each method separately can you have any confidence
that the program as a whole works. To test your program, we have also provided
the class NGramTester that calls each method individually with
specific parameter values and prints the results for you to inspect. You may
also want to make some smaller reference files than those we provided to help
verify each method works as expected.
Choosing a Random Substring
Complete the method getRandomSubstring of the class NGram that
chooses a random substring of length subSize from the string
s, given as parameters.
Once you are confident you have completed this method, test it by using the class
NGramTester to verify that you are returning a string of the correct length that
occurs within the given string.
Getting all the Following Characters
Complete the method getFollowingCharacters of the class NGram which
has been started for you. The code should:
- Find every occurrence of parameter
toFindin the strings - Add the character immediately following each occurrence of
toFindto the stringresult(without any delimiter between the added characters, as in the example given above)
Once you are confident you have completed this method, test it by using the
class NGramTester to verify that you are returning all of the
possible following characters given a variety of strings to find.
Putting it all together
Complete the function makeNGram that is passed the reference text
as one big string of every word in order, a value for the n in N-gram,
and the length of the text to generate randomly. The function returns a new string
constructed based on N-gram frequencies.
You should implement the body of the loop. You will need to call both
of the functions you have written previously to find the possible next
characters and then choose one randomly.
Once you are confident you have completed this method, test it by using the
class NGramTester to verify that you are returning the correct
total number of characters and that it follows the text more closely as n increases.