Implemented #221, called tronstore #258

Codeacious · 2013-07-22T22:23:36Z

This is what is hopefully a really close to final version of tronstore, a new process to be run as a child alongside of trond that will parallelize state serialization for #221. Trond will now spawn this process as part of the PersistenceManagerFactory by default, rather than independently creating each store type class.

Here's a basic overview of what's happened structurally:

There's now a new folder in tron/serialize/runstate, called tronstore. This, obviously, holds all the new files related to tronstore.
PersistenceManagerFactory now spawns only ParallelStore objects, a new, top level class that implements the same front facing methods as the old *Store objects (ShelveStore, YamlStore, etc), as well as one new front facing method, update_from_config.
The PersistentStateManager now has an update_from_config method that calls ParallelStore's update_from_config method, and StateChangeWatcher.update_from_config has been updated to use this.
The ParallelStore object creates a StoreProcessProtocol object (located in tronstore/process.py) that handles all low level process management and communication. It does couple with ProcessProtocol in assuming that everything sent to tronstore is going to be either a StoreRequest or a StoreResponse (more on that later), and relying on the ParallelStore object to update its StoreRequestFactory whenever the configuration is updated.
The StoreProcessProtocol object works by using the Python multiprocessing module. It spawns tronstore.main() from tronstore/tronstore.py as a daemonized process, passing in a Pipe object for message communication. Pipes, by default, use pickle (not cPickle) for serialization, so instead, there are factories for request and response objects, which are serialized based on the Tron configuration given at runtime. The raw, serialized text is what is actually sent over the Pipe object. Pipes work somewhat like a two-way Queue, so there's no need to deal with chunking (despite there being some chunking related code left over in tronstore/chunking.py).
The two factory objects, StoreRequestFactory and StoreResponseFactory live in tronstore/messages.py. They deal with constructing and reconstructing StoreRequests and StoreResponses to and from tronstore, respectively. There's not too much here- just an id for matching the requests and responses together (which is actually done by the StoreProcessProtocol), and a serialization class mapping (the actual methods for serialization/deserialization live in tronstore/transport.py; the mapping references these classes).
There's four types of requests, which have enumerators in tronstore/msg_enums.py. They are SAVE, RESTORE, CONFIG, and SHUTDOWN. Which type is sent is controlled by ParallelStore, which constructs requests according to whatever method was called. SAVE is a non-blocking requests, while RESTORE, CONFIG and SHUTDOWN are. Blocking requests time out, after which a log entry is written and a failure is returned.
tronstore itself works by polling for requests on the Pipe. When a request is received, it usually delegates a thread to handle and serve the request. It puts the thread in a queue, which is then actually started by another thread that acts as a thread pool (it simply makes sure more than a certain number of threads aren't already running). If the request was a SHUTDOWN or CONFIG request, tronstore will wait until all other requests have finished, and then take appropriate action. Responses from all requests are sent back over the Pipe to trond.
There are signal handlers in tronstore for SIGINT, SIGTERM, and SIGHUP, which are the three signals also registered to custom handlers in trond. The handlers do nothing, so that trond can take the correct, complete action when one of the signals is recieved. This is necessary, as signals are sent to all processes in a process group, and tronstore is spawned in the same group as trond. To allow for multiple processes with signal handling, a flag, installSignalHandlers, had to be switched off on the Twisted reactor.
There's a couple new configuration options, both living in the state_persistence section of the config, called transport_method and db_store_method. Both must be set to one one of json, msgpack, pickle, or yaml, but db_store_method isn't actually enforced unless store_type is set to sql or mongo. store_type is still the same as before, but is now instead used by tronstore to set up its own internal store object (new store objects are located in tronstore/store.py). transport_method determines what serialization method tronstore uses to send requests and responses, while db_store_method determines what serialization method tronstore uses to store data within a database (only for SQLAlchemy and MongoDB configurations, obviously). Everything else in the state_persistence is effectively the same as before.
There is a script in tools/ that will migrate the stored state of Tron from 0.6.1 to 0.6.2. It's got a really long docstring describing how to use it. The old migrate_state.py script has been updated to migrate state for tronstore state objects.

…store_trond

dnephin · 2013-07-23T18:21:01Z

tron/config/config_parse.py

@@ -426,7 +438,7 @@ def validate_jobs_and_services(config, config_context):
    config_utils.unique_names(fmt_string, config['jobs'], config['services'])


-DEFAULT_STATE_PERSISTENCE = ConfigState('tron_state', 'shelve', None, 1)
+DEFAULT_STATE_PERSISTENCE = ConfigState('tron_state', 'shelve', 'pickle', 1, 'json', 'pickle')


This first pickle should still be None. It's ignored so we don't want it set to anything.

…es accordingly to match other tron patterns

…s for dbs

…store_trond

Codeacious · 2013-08-12T19:27:34Z

Okay, so the post at the top is still pretty accurate, but here's what's changed:

We use cPickle as a default method for serializing messages across the Pipe object, instead of letting it be configurable. Normally, Pipe objects use Pickle, but cPickle's faster and we already had a simple, clean class for serializing request and response objects. As such, the transport_method configuration flag is now gone.
There was some renaming with the message transport/serialization objects. They're now in serialize.py, and have been renamed to use the world Serializer instead of Transport where appropriate.
The update_from_config method in the StateChangeWatcher has been updated to not rely on an object already being created. This also was coupled with a change where we can spawn StateChangeWatcher objects anytime, and they will be created with a null configuration. As such, update_from_config must be called on the object before it is actually usable. This is in line with other objects in Tron, which initialize with null objects until configured. In order to allow for this, null objects had to be created all the way down into tronstore.py.
Because of the above change, the PersistenceManagerFactory was now completely useless, as we aren't configuring or creating anything other than null objects when creating the PersistentStateManager. It has been removed.
chunking.py was deleted. It wasn't used anywhere.
The SQLAlchemy store mechanism now encodes all data via unicode(repr(serialize(data))) where serialize is the serialization function assigned to a SQLStore instance. It then calls deserialize(eval(str(data))) when loading from a SQL table. This is to ensure that all strings passed through SQL databased are Unicode, which SQLAlchemy warned about in testing. <-- THIS WAS FIXED, IGNORE. In addition, the serialization method used to encode data in any table entry is now saved along with the encoded data; SQLStore now ensures that it uses the right serialization method. This way, the serialization method can be safely changed via reconfiguration with SQL state storing configurations. Right now, Tronstore silently handles this, but there is a TODO about emitting a log entry (to be done once Tronstore can log).
There were two fixes in job.py that weren't really related to this branch, but are in here. The first is a simple addition to JobState.status that returns STATUS_RUNNING if there's an ActionRun in the STATE_STARTING state. Previously, the job would show up in an unknown state in tronview if it was polled in the brief moment between a run being started and a run actually running.
The second job.py fix is to the restore_state functions of JobScheduler and JobContainer. The JobContainer function now has its job_scheduler object watch all of the JobRun objects that were restored, which happened in the old (< 3 month) version of the code, but I missed in First pass at refactoring the Job class structure #253. In addition, there is some extra logic in JobScheduler.restore_state that starts the first queued run for a Job if there were no scheduled ones. This change means that Jobs that had a queued run when Tronstore saved/exited will actually continue to properly run once Tron starts up again.
The JSON and MsgPack serializers have changed to use custom encoders and decoders- this is to deal with datetime objects, which can't be encoded by default. JSON also preserves tuples and lists properly, but MsgPack spits everything out as an immutable tuple.
There's an extra case in version checking that ensures that version info wasn't just saved as a tuple rather than a list.
tronstore.py was refactored. See below.

Codeacious · 2013-08-12T19:43:52Z

tronstore.py was refactored into a much more class oriented structure. Here's an overview of how it looks now:

We still start with tronstore.main() with the same arguments. However, this method is now only three lines, and just registers the null signal handlers (_register_null_handlers), spawns a TronstoreMain object, and then calls main_loop on that object.
TronstoreMain is a new, main class for Tronstore. It handles keeping track of everything it needs to, along with supporting a main_loop method for running the store process. It takes a configuration object (which should be the state_persistence configuration object from Tron), and a Pipe object (realistically, though, this could be any socket related object that supports both the send_bytes and recv_bytes methods). TronstoreMain, on creation, makes one SyncPipe, SyncStore, and TronstorePool object.
The SyncPipe object is a simple class with one mutex that has send_bytes and recv_bytes functions. It simply calls the exact same method with the same passed args and kwargs, but locks the operation over said mutex to avoid contention.
The SyncStore object is similar to the SyncPipe object, but also has some nice behavior where it creates the needed store.py object for the passed configuration. It locks over all save and restore requests. This probably could be better, but due to the modular nature of the store.py implementations, either we specifically implement synchronization in each of the store classes, or we simply don't allow multiple operations on them at the same time; I opted for the latter here for the use of simplicity and practicality.
The TronstorePool object is a thread pool that spawns TronstorePool.POOL_SIZE worker threads that attempt to consume work from a python native Queue (which is synchronized by default). It has enqueue_work and has_work functions to support this. Note that the thread pool won't be actually running until start is called (and it can be stopped with stop). It uses a PoolBool class to signal all threads to stop; PoolBool is just a simple wrapper class that allows us to have a mutable boolean. Workers in the pool are still run on the tronstore.handle_request function, and attempt to continuously grab work from the queue and perform work on it. Workers should only get save and restore requests, as TronstoreMain handles config and shutdown requests.

dnephin · 2013-08-12T21:30:45Z

tron/core/job.py

-        scheduled = self.job_runs.get_scheduled()
-        for job_run in scheduled:
-            self._set_callback(job_run)
+        scheduled = [run for run in self.job_runs.get_scheduled()]


If this is just to get a list,

scheduled = list(self.job_runs.get_scheduled()

is probably more appropriate.

Implemented #221, called tronstore

James McGuinness added 18 commits July 8, 2013 14:58

tronstore first steps and some unit tests, incomplete

39fd2d4

added unit tests

79cb5b5

unit tests done

d0e4dc7

docstrings. docstrings everywhere

17f0a21

made transport_method an optional variable

53b3b46

first steps for state migration

ddc37af

more migration

f1dde38

conversion script progress, bugfixing

ccf1c36

large refactor to not use twisted for tronstore

9c1f0ed

added unit tests for tronstore.py

74ff784

slight bugfix to solve race conditions in the migration script

c94d8c1

Merge remote-tracking branch 'canon/release_0.6.2' into jrm_parallel_…

d031635

…store_trond

merge of release_0.6.2, fix to default state persistence

ada38df

updated migrate_state.py, slight cleanup, enabled parallelstore in tron

a513345

tons of bugfixes, test fixes, new tests, etc

8806cc4

comment cleanup

b353fd4

a bit more comment cleanup

9af7ef8

fixed up tronstore reconfiguration, new unit tests to reflect this

b135429

dnephin reviewed Jul 23, 2013
View reviewed changes

James McGuinness added 3 commits July 23, 2013 17:37

some cleanup

57cb0f3

changed tronstore to use a nullstore object by default, updated class…

c2a85cc

…es accordingly to match other tron patterns

renamed rebuild function in request/response factories to from_msg

23659f0

James McGuinness added 20 commits August 5, 2013 14:34

another update to conversion script, fix for list stored versions

0f2131d

comment cleanup

837b4ed

updated version info

e9b94cb

slight improvements to conversion, investigating serialization method…

1c185d9

…s for dbs

fix to sqlstore to use unicode in db

9e657de

verification fixes to conversion script

73e9177

fix to reconfig bug

362bab7

docstring update

2ed832b

Merge remote-tracking branch 'canon/release_0.6.2' into jrm_parallel_…

ade6218

…store_trond

fixed broken test, code cleanup

42ebf72

cleaned up commented code

82b0eeb

fixed JobState.status to return correctly if a run is starting

449097f

fixed serializers, migrate_state

d2f3ab5

all the docstrings

a5583c6

reduced default POOL_SIZE

87e6890

improved sqlstore, fixed restore_state on queued runs

7143081

updated docs

7c39f08

made job scheduler properly watch runs on restore

98f7a23

enforced new watch condition in unit test

d96de3f

style cleanup for queued run callback

c0d2d6f

it's a docstring, captain

746f620

dnephin reviewed Aug 12, 2013
View reviewed changes

James McGuinness added 3 commits August 12, 2013 15:43

improved sqlstore to not use eval/repr

befac57

style cleanup in jobscheduler

668d271

fixed inspect_serialized_state

a905d0b

dnephin added a commit that referenced this pull request Aug 13, 2013

Merge pull request #258 from Codeacious/jrm_parallel_store_trond

804c332

Implemented #221, called tronstore

dnephin merged commit 804c332 into Yelp:release_0.6.2 Aug 13, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented #221, called tronstore #258

Implemented #221, called tronstore #258

Codeacious commented Jul 22, 2013

dnephin Jul 23, 2013

Codeacious commented Aug 12, 2013

Codeacious commented Aug 12, 2013

dnephin Aug 12, 2013

Implemented #221, called tronstore #258

Implemented #221, called tronstore #258

Conversation

Codeacious commented Jul 22, 2013

dnephin Jul 23, 2013

Choose a reason for hiding this comment

Codeacious commented Aug 12, 2013

Codeacious commented Aug 12, 2013

dnephin Aug 12, 2013

Choose a reason for hiding this comment