Abstracted Replay Data Parsers #119

cole-maclean · 2017-11-09T22:00:24Z

PR to implement an abstraction of the replay_actions module to separate running the replays and the data scrapping as per issue #51 , similar to how running agents is abstracted to allow for arbitrary user written agents. This will likely be a fairly large redesign, so PR is currently a Work In Progress to get things started. Current commit is proof of concept of abstracting a parser class from the replay setup to run the action stats scrapping of the current replay_actions script.

… parser class

… module

cole-maclean · 2017-11-10T02:20:29Z

Ok @tewalds , do you wanna take a look and see if this is on the right track? I've separated the replay_actions (this name should probably change, I'm thinking parse_replays) from the actual replay parsers, which can be found in the new folder "replay_parsers". There's a base_parser class that implements basic stats and methods needed to parse a replay. The key parsing function is parse_step, which takes as input from the ReplayParser obs, feat and info. Other optional overrides are valid_replay for users to write their own definitions of what constitutes a valid replay, and stats merging/ printing if they want custom printed stats. If the parser parse_step function returns anything, whatever is returned is appended to a data list at each step in the replay, and this data list is saved into a json data file for that specific replay&playerId. There's an optional data_dir argument for users to pass in for where to save these files, if their parser requires it.

I've also included my state scrapping script I wrote for a project as an example of how to parse state info from a replay and save them to files.

Let me know your thoughts, and if you think this is the right way to go I'll work on updating documentation. Could probably use some testing as well.

Thanks!
-Cole

cole-maclean · 2017-11-10T02:21:17Z

pysc2/bin/replay_actions.py

+flags.DEFINE_integer("screen_resolution", 16,
+                     "Resolution for screen feature layers.")
+flags.DEFINE_integer("minimap_resolution", 16,
+                     "Resolution for minimap feature layers.")


Added resolution args to be consistent with play.py

cole-maclean · 2017-11-10T02:23:09Z

pysc2/bin/replay_actions.py

 flags.mark_flag_as_required("replays")
+FLAGS(sys.argv)


I needed this to get around Issue #77

This should be dealt with by app.run and shouldn't be explicitly needed.

cole-maclean · 2017-11-10T02:23:41Z

pysc2/bin/replay_actions.py

+interface.feature_layer.resolution.x = FLAGS.screen_resolution
+interface.feature_layer.resolution.y = FLAGS.screen_resolution
+interface.feature_layer.minimap_resolution.x = FLAGS.minimap_resolution
+interface.feature_layer.minimap_resolution.y = FLAGS.minimap_resolution



Copied from play.py to be consistent

cole-maclean · 2017-11-10T02:24:17Z

pysc2/bin/replay_actions.py

    self.proc_id = proc_id
    self.time = time.time()
    self.stage = ""
    self.replay = ""
-    self.replay_stats = ReplayStats()
+    self.parser = parser_cls()


renamed replay_stats to parser, updated throughout code

cole-maclean · 2017-11-10T02:24:49Z

pysc2/bin/replay_actions.py

-    merge_dict(self.made_actions, other.made_actions)
-    self.crashing_replays |= other.crashing_replays
-    self.invalid_replays |= other.invalid_replays
-


merge moved to parser class

cole-maclean · 2017-11-10T02:25:52Z

pysc2/bin/replay_actions.py

@@ -192,7 +122,7 @@ def run(self):
              self._print("Empty queue, returning")
              return
            try:
-              replay_name = os.path.basename(replay_path)[:10]
+              replay_name = os.path.basename(replay_path)


changed 10 character truncating to full replay name

The reason it was just the first 10 is that I was running mainly over replays that were sha1 named, so very long and the prefix gave enough uniqueness. I'm fine with showing the full as long as it's readable.

Okay, I've left the full replay name but I'll leave it up to you to decide on readability. The only catch is that data files are uniquely named from their replay name and will be overwritten if there are colliding names.

cole-maclean · 2017-11-10T02:26:54Z

pysc2/bin/replay_actions.py

@@ -216,16 +146,16 @@ def run(self):
                  self._print("Starting %s from player %s's perspective" % (
                      replay_name, player_id))
                  self.process_replay(controller, replay_data, map_data,
-                                      player_id)
+                                      player_id,info,replay_name)


passing info and replay_name to process_replay to allow user to parse info data, replay_name for file naming

cole-maclean · 2017-11-10T02:27:19Z

pysc2/bin/replay_actions.py

-      for ability_id in feat.available_actions(obs.observation):
-        self.stats.replay_stats.valid_actions[ability_id] += 1
-
-      if obs.player_result:


action stats parsing moved to custom ActionParser class

cole-maclean · 2017-11-10T02:27:55Z

pysc2/bin/replay_actions.py

+        if data:
+          data_file = FLAGS.data_dir + replay_name + "_" + str(player_id) + '.json'
+          with open(data_file,'w') as outfile:
+            json.dump(data,outfile)


Added to disk file saving at end of replay if custom parser parse_step returns data

cole-maclean · 2017-11-10T02:29:42Z

pysc2/replay_parsers/state_parser.py

+					friendly_army,enemy_army,all_features['player'].tolist(),
+					all_features['available_actions'].tolist(),actions,winner,
+					race,enemy_race]
+		return full_state


Returning from this function initiates a data list for each step that will be saved to a file for this replay and player perspective

tewalds · 2017-11-10T16:43:55Z

pysc2/bin/replay_actions.py

 FLAGS = flags.FLAGS
 flags.DEFINE_integer("parallel", 1, "How many instances to run in parallel.")
 flags.DEFINE_integer("step_mul", 8, "How many game steps per observation.")
 flags.DEFINE_string("replays", None, "Path to a directory of replays.")
+flags.DEFINE_string("parser", "pysc2.replay_parsers.base_parser.BaseParser",
+                    "Which agent to run")


update help text.

tewalds · 2017-11-10T16:47:06Z

pysc2/bin/replay_actions.py

 FLAGS = flags.FLAGS
 flags.DEFINE_integer("parallel", 1, "How many instances to run in parallel.")
 flags.DEFINE_integer("step_mul", 8, "How many game steps per observation.")
 flags.DEFINE_string("replays", None, "Path to a directory of replays.")
+flags.DEFINE_string("parser", "pysc2.replay_parsers.base_parser.BaseParser",
+                    "Which agent to run")
+flags.DEFINE_string("data_dir", "C:/",


Please don't set a default value that will be wrong or invalid for a large number of users. C:/ doesn't exist on linux, and any default location will be wrong for someone.

Okay, I've changed the default to be None, and data is only saved to a file if the user supplies a data_dir. Another option would be to use os.getcwd() and use the current working directory as the default directory.

tewalds · 2017-11-10T16:51:31Z

pysc2/bin/replay_actions.py

@@ -192,7 +122,7 @@ def run(self):
              self._print("Empty queue, returning")
              return
            try:
-              replay_name = os.path.basename(replay_path)[:10]
+              replay_name = os.path.basename(replay_path)


The reason it was just the first 10 is that I was running mainly over replays that were sha1 named, so very long and the prefix gave enough uniqueness. I'm fine with showing the full as long as it's readable.

tewalds · 2017-11-10T16:52:13Z

pysc2/bin/replay_actions.py

@@ -216,16 +146,16 @@ def run(self):
                  self._print("Starting %s from player %s's perspective" % (
                      replay_name, player_id))
                  self.process_replay(controller, replay_data, map_data,
-                                      player_id)
+                                      player_id,info,replay_name)


Please use the google python style guide: spaces between args.

tewalds · 2017-11-10T16:52:53Z

pysc2/bin/replay_actions.py

@@ -237,7 +167,7 @@ def _update_stage(self, stage):
    self.stats.update(stage)
    self.stats_queue.put(self.stats)

-  def process_replay(self, controller, replay_data, map_data, player_id):
+  def process_replay(self, controller, replay_data, map_data, player_id,info,replay_name):


more spaces for style guide

tewalds · 2017-11-10T17:51:20Z

pysc2/replay_parsers/state_parser.py

+def calc_armies(screen):
+	friendly_army = []
+	enemy_army = []
+	unit_list = np.unique(screen[6])


Do not hardcode these numbers. Look them up in features.SCREEN_FEATURES. That's both for readability and so they don't break if we add new layers.

tewalds · 2017-11-10T17:52:05Z

pysc2/replay_parsers/state_parser.py

+class StateParser(base_parser.BaseParser):
+	"""Action statistics parser for replays."""
+	def valid_replay(self,info, ping):
+		return True


ahh yeup. Removed.

tewalds · 2017-11-10T17:59:59Z

pysc2/replay_parsers/state_parser.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Action statistics parser for replays."""


Looking through this file it may be better to put it in your repo. The action stats is already a little bit complicated as an example. It'd be nice to have a simpler example. Some ideas:

distribution of steps between actions, which could be useful for finding reasonable APMs and step_mul

distribution of APM and MMRs to graph APM vs MMR

Agreed, updated example to just scrape the General player information layer from each step, along with the winning player. Could be used to build a winner prediction model based on resource/supply/army size data as the game progresses.

tewalds · 2017-11-10T18:16:14Z

pysc2/replay_parsers/base_parser.py

+import collections
+import six
+
+class BaseParser(object):


Have you read about MapReduce? I haven't done enough research, but consider whether that would be a good framework to use here.

Unfortunately no, not currently in my wheelhouse, but might be a good opportunity to dig into it a little.

tewalds · 2017-11-10T18:17:08Z

pysc2/bin/replay_actions.py

 flags.mark_flag_as_required("replays")
+FLAGS(sys.argv)


This should be dealt with by app.run and shouldn't be explicitly needed.

cole-maclean · 2017-11-10T20:45:02Z

pysc2/bin/replay_actions.py

+        # Save scraped replay data to file at end of replay if parser returns
+        # and data_dir provided        
+        if data:
+          if FLAGS.data_dir:


Data only saved to file if data_dir provided from user

cole-maclean · 2017-11-10T22:03:45Z

Alright, @tewalds can you take a look when you get a chance? Think I hit all your comments. I also added some documentation about replays and parsers.

tewalds · 2017-11-11T07:15:23Z

I haven't looked at the recent changes, but it's probably also worth renaming replay_actions.py to something more general like process_replays.py or similar. Also, thanks for adding documentation. If you have a reasonable way to do it, some unit tests would be appreciated as well.

cole-maclean · 2017-11-12T20:34:31Z

Agreed, updated to process_replays. I'll put some thought to testing.

…lay loading

…and caught in path doesn't exist exception

cole-maclean · 2017-11-17T20:31:40Z

pysc2/bin/process_replays.py

+              self._print("Empty queue, returning")
+              return
+            try:
+              self.load_replay(replay_path, controller, ping)


replay loading refactored into own method 'load_replay'

cole-maclean · 2017-11-17T20:32:20Z

pysc2/bin/process_replays.py

+        return
+
+  def load_replay(self, replay_path, controller, ping):
+    replay_name = os.path.basename(replay_path)


refactored load_replay method to modularize replay processing

cole-maclean · 2017-11-17T20:38:06Z

Ok @tewalds I've added some tests, replay_parser_test. Probably not comprehensive, but should provide a good framework for building up future tests if needed, and follows pretty closely to the design of other test scripts in the repo. I think at this point this should close out a first pass for this functionality. We can discover and experiment with improvements (ie. MapReduce) in a future PR. Let me know if you have any feedback or see some snags! Thanks Timo.

cole-maclean · 2017-11-17T20:40:34Z

pysc2/replay_parsers/action_parser.py

+    merge_dict(self.made_actions, other.made_actions)
+
+  def valid_replay(self, info, ping):
+    """Make sure the replay isn't corrupt, and is worth looking at."""


Changed valid replay conditions to be very low for test replay pass. The checks can stand as an example for users who wish to make valid_replay checks more aggressive.

cole-maclean added 7 commits November 7, 2017 15:26

added SetProcessDPIAware() to fix screen size issue for windows users

bffd088

abstarting parser class

1a78352

replar parser abstracted, analog to agents abstraction

3eb8869

further abstraction of replay processing

1bdf502

action stats scrapper fully abstracted into seperate class using base…

d31a9e9

… parser class

updated args to allow user inputter feature resolutions, same as play…

ad28d78

… module

added state scrapper example and ability to save to file

09e3274

cole-maclean commented Nov 10, 2017

View reviewed changes

tewalds requested changes Nov 10, 2017

View reviewed changes

cole-maclean added 2 commits November 10, 2017 13:36

updates as per @tewalds review

6c6caf4

remove local flags workaround

bf4ad86

cole-maclean commented Nov 10, 2017

View reviewed changes

cole-maclean added 6 commits November 10, 2017 13:46

trailing line

e642f72

trailing line

a25dff6

renamed state parser to player_info_parser

5dd050b

added documentation for replay parsing

6fa4d1e

remove state-parser example

9969d86

doc edits and typo fixes

b65ed0d

rename StateParser to PlayerInfoParser

677ec46

cole-maclean added 2 commits November 12, 2017 13:31

updated replay_actions to process_replays and documentation references

e9e4924

fix typo

9b62b47

cole-maclean added 5 commits November 17, 2017 10:42

building out unit tests for replay parsing

ea30f55

refactored process_replays to have load_replays method for single rep…

be6cda9

…lay loading

removed required flag for replays argument, default is string 'None' …

c72f8a9

…and caught in path doesn't exist exception

added unit tests for replay parsing

6aa04aa

remove debug lines

14d732d

cole-maclean commented Nov 17, 2017

View reviewed changes

cole-maclean changed the title ~~[WIP] Abstracted Replay Data Parsers~~ Abstracted Replay Data Parsers Nov 17, 2017

cole-maclean commented Nov 17, 2017

View reviewed changes

cole-maclean mentioned this pull request Nov 18, 2017

Replay logger script cole-maclean/autocraft#1

Closed

		flags.mark_flag_as_required("replays")
		FLAGS(sys.argv)

		flags.mark_flag_as_required("replays")
		FLAGS(sys.argv)

Abstracted Replay Data Parsers #119

Are you sure you want to change the base?

Abstracted Replay Data Parsers #119

Conversation

cole-maclean commented Nov 9, 2017

cole-maclean commented Nov 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cole-maclean commented Nov 10, 2017

tewalds commented Nov 11, 2017

cole-maclean commented Nov 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cole-maclean commented Nov 17, 2017

Choose a reason for hiding this comment