/
readme_test.txt
329 lines (215 loc) · 14.4 KB
/
readme_test.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
Manual testing
==============
I am assuming that you have cloned the source code along with sys.config file
to your local directory and also compiled all the .erl files into .beam files.
All .beam files need to be corresponding ebin directory.
When you clone the project, it creates a directory erlang_raft and under this
directory all other directories (raft, replicated_log, state_machine) are being
created.
erlang_raft is the root directory, if you wish you can create data directory under
this root directory and modify the data_dir paramenter in sys.config file. If you
do not create data directory, all the raft data (*.repl.log, *meta.dat) files
will be created under root directory.
Delete any raft data that was created during the previous run, you need not do this
in actual run but it is convenient to do so for test runs.
Start 6 terminal Shells and cd all the Shells to erlang_raft root directory.
On first Shell type the following command to create Erlang node:
Note : for windows use werl.
erl -sname er_n1 -pa ./raft/ebin -pa ./replicated_log/ebin -pa ./state_machine/ebin -config ./sys
Once you are in the Erlang shell, type the following command to create a set of local erlang raft servers
Note : a number followed by > denotes Erlang Shell prompt.
1> erlang_raft:start().
You will see that this command will start 7 new processes as given below:
1. er_sup
2. er_raft_server
3. er_peer_server
4. er_event
5. rl_server
6. sm_sup
7. sm_server
Processes 1 to 5 are part of erlang_raft implementation, the last 2 have been created only for demo
and testing purpose. In your actual run, you will replace them with your actual state machine.
Repeat this step for next 4 Shells with node name as er_n2, er_n3, er_n4 and er_n5.
At this stage, we do have 5 erlang_raft servers running as stand alone and
so far they have not been clustered. so you will not see any activity on these Erlang shell.
All these nodes are running with status as NOT_IN_CONFIG. When a erlang_raft server is in
NOT_IN_CONFIG status, it does not belong to a erlang_raft cluster and it also does not
timeout. It is just waiting to join the erlang_raft cluster.
Now, let us start the last Erlang shell, this shell acts like a erlang_raft client and
we perform different activities from this 6th Erlang shell.
So type the following command on the 6th shell:
erl -sname er_nz -pa ./raft/ebin
Once you have the Erlang shell (6th) running, create the following Node variable, these node names are based
on my computer and will be different for your computer. So make the right correction when you create them.
1> Ern1 = 'er_n1@Gyanendras-MacBook-Pro'.
2> Ern2 = 'er_n2@Gyanendras-MacBook-Pro'.
3> Ern3 = 'er_n3@Gyanendras-MacBook-Pro'.
4> Ern4 = 'er_n4@Gyanendras-MacBook-Pro'.
5> Ern5 = 'er_n5@Gyanendras-MacBook-Pro'.
To create an erlang_raft cluster of 3 nodes, type the command on Erlang Shell:
6> erlang_raft:config_entry(Ern1, [Ern1, Ern2, Ern3]).
Here, the first parameter Ern1 is the node where sending our command, this could have been
any node (Ern1, Ern2 or Ern3) that is going to be part of erlang_raft cluster. Second parameter is
a list of nodes that will form the cluster.
Once, you issue this command, you will see some activity on these 3 nodes, they will become FOLLOWER and
timeout for leader election and eventually one of the nodes will become LEADER.
Here, I am assuming the Ern1 has become LEADER for further discussion.
Now let us issue a command to enter our first log entry, type the following command:
7> erlang_raft:log_entry(Ern1, make_ref(), "cmd01").
Here, the first parameter is any active node in the cluster, it need not be LEADER.
Second parameter is unique id for state machine command to check for duplicate entry,
you can use any suitable id. Third parameter is the actual command (any erlang term)
that you want to issue for your state machine.
Once you issue this, you will more activity in all the 3 nodes.
Issue this command 3 more times, just change the value of last parameter, as given below:
8> erlang_raft:log_entry(Ern1, make_ref(), "cmd02").
9> erlang_raft:log_entry(Ern1, make_ref(), "cmd03").
10> erlang_raft:log_entry(Ern1, make_ref(), "cmd04").
At this time, you will see the following for LEADER node:
[er_raft_server] "log_entry.99" status=er_leader, leader_id='er_n1@Gyanendras-MacBook-Pro', current_term=1,
prev_log_term=1, prev_log_index=4, commit_term=1, commit_index=4, applied_term=1, applied_index=4
Both FOLLOWER nodes will have the following display:
[er_raft_server] "peer_append_entries_op.99" status=er_follower, leader_id='er_n1@Gyanendras-MacBook-Pro',
current_term=1, prev_log_term=1, prev_log_index=4, commit_term=1, commit_index=3, applied_term=1, applied_index=3
Here LEADER node will always 1 entry ahead of FOLLOWER node because it can commit the latest entry
if it receive the acceptance for majority of servers (including itself).
Now, bring down one of the FOLLOWER node (say Ern3), go to Erlang shell of Ern3 and type the following command:
2> erlang_raft:stop().
Now, we have only 2 active erlang_raft nodes. This is ok because we can still fulfill the quorum requirement.
Issue 4 more log entries from erlang_raft client shell (6th node) and you will that these 4 entries
are also accepted. Please check the display on LEADER and FOLLOWER node to ensure that they are
still in sync.
Now, bring back Ern3, go to Ern3 erlang shell the type the following command:
3> erlang_raft:start().
This will start the erlang_raft Node 3, we also know that this node is not current.
Now let us make 2 more log entries from erlang_raft client shell.
You will notice that, at this stage LEADER (Ern1) and FOLLOWER (Ern2) are current with 10 log entries
but FOLLOWER node (Ern3) may or may not be current depending upon the setting of optimistic_mode
in sys.config file.
If this has been set to true then Ern3 will not become current because LEADER can still fulfill
the quorum requirement just with node Ern2. This is the default setting right now. This leads to
faster normal operation.
If you have set this value to false then Ern3 will become current immediately after the first log entry.
Let us assume we are testing it with optimistic_mode set to true.
Now, stop the FOLLOWER node Ern2 that is current and make 2 more log entries.
You will notice that FOLLOWER node Ern3 that was not current becomes current immediately after the first entry
because LEADER needs Ern3 to be current to fulfill the quorum requirement.
At this moment we should have 12 entries for both LEADER and FOLLOWER. If you wish, you can check
the file size of replicated log files (er_n1_repl-log, er_n3_repl.log) for Ern1 and Ern3.
They should be same.
Now let us change our cluster configuration, we are still assuming the Ern1 is still a LEADER
and Ern3 is the only FOLLOWER.
On Erlang client shell, type the following command
> erlang_raft:config_entry(Ern1, [Ern3, Ern4, Ern5]).
We are taking out Ern1 (who is our current LEADER) and adding Ern4, Ern5 which are not current
servers to the cluster.
You will notice that Ern1 (LEADER) make Ern4 and Ern5 current immediately before it goes out of
cluster and its status becomes NOT_IN_CONFIG.
At this stage, the cluster does not have a LEADER so it goes through a leader election process
and they elect a new LEADER.
Once a new LEADER has been elected make 4 more entries, you can not use Ern1 as a parameter
in your log_entry command because it is not an active node now. So, make a suitable change
and make these 4 entries.
Once you make these 4 entries, you will notice something strange. Every thing looks ok but
file size of your replicated files are reduced. It is because, it compacts the log files after
it reaches a particular limit as specified in sys.config (log_retention_size_max).
We have set to 15 for demo/testing purpose but normally this value will be much higher.
Well this should be a happy ending of testing. If you see any discrepancy, please report back.
Automated Testing
=================
This implementation of RAFT supports the following features:
1. Leader Election
2. Replicated Log
3. Config Change
4. Log Compaction
5. Snapshot Installation
Log Compaction is performed automatically when we make log entries and log file size exceeds
a specific log entry size.
Snapshot Installation is performed automatically as a part of log entries and also as a part
of config change.
So we do not need to create any test cases to test these 2 features. If other features work
correctly, we can be sure that these 2 features are also working correctly.
We create a random number of test cases and each test case is selected randomly from the
following 5 test scenario.
We assume that we are going to use N number of RAFT nodes (Full Configuration) for testing
and we start our RAFT cluster with M number of RAFT nodes (Initial Configuration, M =< N).
Each test case starts with RAFT cluster configuration change, so we randomly choose one of
the 5 scenarios as configuration change for next test case. The 5 test scenarios are:
1. Current LEADER will be retained and 1 RAFT node from current RAFT cluster will be
replaced by 1 new RAFT node that is NOT_IN_CONFG.
2. Current LEADER will be retained and 2 RAFT nodes from current RAFT cluster will be
replaced by 2 new RAFT nodes that are NOT_IN_CONFIG.
These test cases help to determine if RAFT implementation performed Config Change correctly
or not and also every RAFT node in new RAFT cluster is consistent (possibly by installing
the snapshot if needed).
3. Current LEADER is removed and replaced by 1 new RAFT node that is NOT_IN_CONFIG.
4. Current LEADER and 1 more RAFT node from current cluster are removed and replaced by
2 new RAFT nodes that are NOT_IN_CONFIG.
These test cases help to determine if RAFT implementation performed Leader Election correctly
or not in addition to Config Change.
5. Bring all the RAFT nodes to cluster (Full Configuration).
This test case helps to determine that all the RAFT nodes eventually become consistent.
A Test Case
===========
We perform the following tasks and make validation for correctness for each test:
1. Make a Config Change as specified in the test case
and validate the following properties
a. There is only 1 LEADER in the new configuration and rest of the nodes in the
configuration are FOLLOWER.
b. Value of current_term and leader_id is same for every RAFT node in the new configuration.
c. Value of prev_log_index for FOLLOWERs is =< value of prev_log_index of LEADER, this property
ensures that the right LEADER was chosen if Leader Election was performed.
d. All other RAFT nodes that are not in new configuration must have their status as NOT_IN_CONFIG
and leader_id undefined.
2. Make N number of log entries (N is randomly chosen for each test case)
and validate the following properties
a. Value of prev_log_index is same for majority of RAFT nodes (including LEADER).
b. For majority of FOLLOWERS, commit_index (FOLLOWER) is =< commit_index (LEADER)
and commit_index (FOLLOWER) >= commit_index-1 (LEADER).
c. For LEADER, value of prev_log_index after making N entries should be equal to
prev_log_index value of LEADER just after Config Change + N.
Note : if we have 5 RAFT nodes in our configuration then majority of nodes will constitute
of 1 LEADER and at least 2 FOLLOWRs.
How to run automated test
=========================
Use raft_erl.sh to start er_n1, er_n2, er_n3, er_n4, er_n5 erlang shell.
Use erlang_raft:start() to start a RAFT node in each erlang shell.
Use raft_el.sh er_nz to start another erlang shell, this shell will be used to start the automated test.
Run erlang_raft_test:raft_test() to perform automated test. It uses the following parameter to run the test.
If you want to run for a different of set of parameters, you can use another function in the same module
where you can specify them as part of function invocation.
Parameters used by this function are.
GENERATE_NODE_NAME : true -> user has given only sname for next 2 parameters so it will
generate actual node names. This is possible only when we are testing
this on a single machine.
false -> user has given actual node names.
RAFT_NODES : sname of the nodes that participate in this test (Full Configuration)
INITIAL_NODES : sname of the nodes that participate in the first test case
SET_STATE_ON_FAIURE: If this value is set to true and if a test case fails then it sets the
state of RAFT cluster to the final state of previous successful test case.
SLEEP_TIME : this time is given in milliseconds, if a Leader Election happens because
of Config Change, it needs to wait for Leader Election process to complete
before it can proceed further testing
CONFIG_CHANGE_MIN :
CONFIG_CHANGE_MAX : These 2 parameters are used to generate a random number that is in the range of
MIN and MAX value.If this value is N then we generate N number of test cases.
In this N number of test cases, when we are creating test case, we randomly select
a test scenario (from first 4 scenarios) and insert Full Configuration test cases
in between.
FULL_CONFIG_CHANGE : Minimum value of this parameter should be 1.
This parameter denotes the number of times a Full Configuration should be
introduced in the automated test. Last test case is always Full Configuration.
If this value is more than 1 then we introduce this value -1 Full Configuration
test cases in between other test cases.
LOG_ENTRIES_MIN :
LOG_ENTRIES_MAX : These 2 parameters are used a generate a random number that is in the range of
MIN and MAX value. This random value is generated for each test case and
this random number specifies the number of log entries we need to make for
that test case
RESULT_FILE_NAME : This parameter is the name of the file where we save test results.
Once you finish your automated test run and you should see the following line under cur_test_result
section.
config_state=er_valid_state, not_in_config_state=er_valid_state,
log_index_state=er_valid_state, log_entries_state=er_valid_state
This indicates the RAFT implementation successfully passed the testing, if you see any thing different,
please report back your results.