forked from chriszf/markov_exercise
/
test_stage1.py
224 lines (150 loc) · 9.63 KB
/
test_stage1.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
"""
Tests
"""
import markov
import markov_solution
import pytest
import random
def fn_exists(name):
try:
getattr(markov, name)
return True
except:
return False
def test_initial_layout():
"""Start your function with high-level pseudocode"""
assert fn_exists("main"), """\
Your program is missing a 'main' function. Please add it.
Before you do anything, you should produce a rough sketch of your code, starting with your main function.
Your main function should contain your main loop:
def main():
some_setup()
loop some number of times:
read_input()
process_input()
When you put your code in a main function, you also need to tell python to run that function when it starts, so you should also add the following lines to the bottom of your file:
if __name__ == "__main__":
main()
"""
assert fn_exists("process_file"), """\
First, we start by doing our setup. In this case, our setup is some sort of file processing. We roughly sketch it out by creating a function named process_file. The signature looks like this
process_file(str) -> ???
Right now, we don't know what the output should look like, but we'll just place it in there to start. For now, it should return nothing and do nothing. Remember, you can make an empty function by doing the following:
def fn(arg):
pass
You should call this function right before your main loop."""
def test_basic_markov():
assert markov.process_file("sample1.txt") == {}, """\
Here is the heart of the markov analysis. Let us assume for now that all markov analysis uses a prefix length of two. We have provided for you a series of incrementally complex files to test your markov function.
The first file, 'sample1.txt' is completely empty, and should produce an empty output.
Now is the time to decide what kind of output process_file returns. The functionality can be stated (in non-programming terms) in the following way:
process_file opens a text file, analyzes the text, and returns a mapping of word prefix chains to their suffixes.
The only data structure we've learned so far that supports the concept of 'mapping' one value to another is the dictionary, or hash.
If sample1.txt is an empty file, we can assume that running process_file("sample1.txt") will produce an empty dictionary, {}.
This solidifies our interface, the signature now looks like this:
process_file(str) -> dict
Where 'str' is the filename and 'dict' contains the Markov mappings.
Note: we do not need to specifically check that 'sample1.txt' is an empty file. If we do our processing correctly, our function will return an empty dictionary if there is no input. For now, it is sufficient to blindly return an empty dictionary without actually opening the file.
"""
output = {("hi", "there"): ["friend."]}
assert markov.process_file("sample2.txt") == output, """\
If you open 'sample2.txt', you will find it contains the phrase 'hi there friend'. Aside from an empty file, this is what we call the 'degenerate case' of the input. According to our instructions, we should use a prefix of length 2, and each prefix should map to a suffix that contains a single word. In this case,
hi there => friend
In the vein of choosing the right data structures, we need to think about how to represent the prefix and the suffix.
As I've mentioned before, string manipulation is difficult. In this case, we're working on the level of individual words, and as we'll see later, after we've read the initial input, we won't want to worry about splitting and rejoining strings until the very end. For now, trust that you'll want to keep individual words of the prefixes separated, so that means using a collection, like a list or a tuple.
prefix example: ["hi", "there"] or ("hi", "there")
Remember, this prefix is used as the key in your mapping. Try playing around in the console to see which one makes sense to use.
The other thing to think about is what data structure to use to hold your suffix. In this example, the suffix is simple, there is only one word.
However, consider this example:
would you, could you, in a box
would you, could you, with a fox
would you, could you, in a house
would you, could you, with a mouse
Notice that 'could you,' an be followed by either 'in' or 'with'. The partial mapping might look like:
could you, => in a, with a
in a => box, house
with a => fox, mouse
The suffix must then be some sort of collection as well, either a set or a tuple.
As you implement this, remember the mantra: do it first for a single case, then figure out how to repeat it.
It will help to simply open your file, read only one line, then close it, like this:
f = open(filename)
line = f.readline()
f.close()
Then try writing your function to process that single line.
"""
def test_shift():
assert fn_exists("shift"), """\
Utility function: shift
As a quick aside, we need to build a function called 'shift' with certain useful behavior. I promise that you will use this later, but it is non-obvious right now how. Still, this is a good time to build it.
The signature of shift looks like this:
shift(tuple, anything) => tuple
Shift takes a tuple, removes the first element, then adds the second argument to the end, and returns the new tuple.
t = (1,2)
shift(t, 3)
=> (2, 3)
'Shifting' values onto the end of a collection while removing them from the front is a commonly used operation.
Try experimenting with adding tuples together in the console to get an idea of how to write this function."""
assert markov.shift((1,2), 3) == (2,3), """Here are more test cases to try out your shift function."""
assert markov.shift(("cat", "in"), "the") == ("in","the"), """Notice the specification of 'any' type for the second argument. Shift should work on a tuple, and any variable that can go in a tuple."""
def test_phrase_generation():
assert fn_exists("build_sentence"), """\
Now that we've produced a super simple markov mapping, we need to check to see if we can use it to build a phrase. We'll start by creating a function with the following signature:
build_sentence(dict) => str
The dictionary that it accepts is the markov mapping we produced in our process_file function.
Notice how we are explicit about inputs and return values from our function, and how the output of the other function is used as the input to this one. So far, we've been using global variables to store our intermediate values, which is considered a bad practice."""
your_sentence = markov.build_sentence({("this", "old"): ["man."]})
expected = "this old man."
print "Expected sentence:", expected
print "Your sentence:", your_sentence
assert your_sentence == expected, """\
To produce a random sentence from a markov mapping, we choose a random key/value pair from our map, which should look like this:
("word1", "word2") : ["word3"]
We then assemble the three words into a single sentence. Using 'sample2.txt':
# after running process_file, our mapping looks like this:
mapping = {("hi", "there"): ["friend."]}
We randomly choose one element in the dictionary, but since there is only one, we use that and assemble the sentence.
build_sentence(mapping)
=> "hi there friend."\
"""
def test_first_integration(capsys):
markov.main()
out, err = capsys.readouterr()
assert out == "hi there friend.\n", """\
Now is the time to bring it all together. When you run your file, markov.py from the command line
python markov.py
It should should then try to run markov analysis and build a single sentence from sample2.txt, outputting
hi there friend.
You have to chain this behavior together in your main function, calling process_file and feeding the output of that into build_sentence, then printing the output of that."""
def test_complex_input_1():
mapping = markov.process_file("sample3.txt")
output = {
("how", "are"): ["you"],
("are", "you"): ["doing?"]
}
print "Expected map:", output
print "Your map:", mapping
assert mapping == output, """\
Now that we've done a full integration on our degenerate (basic) case, we're increasing complexity. Try to improve your 'process_file' function to process more than one prefix-suffix pair.
Remember, a prefix consists of two words, and a suffix is a single word, so the maximum number of pairs that can be produced from three words is one.
In general, if there are N words in the input, it should be possible to generate N-2 pairs."""
def test_complex_input_2():
correct_map = markov_solution.build_chains("sample4.txt", 2)
print "Expected map:", correct_map
new_map = markov.process_file("sample4.txt")
print "Your map:", new_map
assert new_map == correct_map, """\
The previous example tested a single line longer than 3 words, now update your function to work on a multiline file.
Remember that calling string.split() with no arguments will automatically split on all whitespace, including newlines, tabs, and spaces."""
def test_complex_input_3():
correct_map = markov_solution.build_chains("sample5.txt", 2)
new_map = markov.process_file("sample5.txt")
for key in correct_map.keys():
print "Expected: %r => %r"%(key, correct_map[key])
print "Received: %r => %r"%(key, new_map[key])
assert sorted(correct_map[key]) == sorted(new_map[key]), """\
Sometimes a prefix will have more than allowable suffix, as in the following example:
Humpty Dumpty sat on a wall
Humpty Dumpty had a great fall
The prefix "Humpty Dumpty" can become either "sat" or "had".
Make sure that your processing function can produce handle that scenario.
"""