Skip to content

v0.2.47..v0.2.48 changeset GenericLineConflation.asciidoc

Garret Voltz edited this page Sep 27, 2019 · 1 revision
diff --git a/docs/algorithms/GenericLineConflation.asciidoc b/docs/algorithms/GenericLineConflation.asciidoc
index 81c456b..27d2708 100644
--- a/docs/algorithms/GenericLineConflation.asciidoc
+++ b/docs/algorithms/GenericLineConflation.asciidoc
@@ -87,21 +87,21 @@ See also:
 
 [[coastlines]]
 .Coastlines
-image::algorithms/images/Coastlines.png[Coastlines at two scales,scalewidth="50%"]
+image::images/Coastlines.png[Coastlines at two scales,scalewidth="50%"]
 
 In <<coastlines>> is an example of two coastlines extracted at two different
 scales. While there are obvious similarities there are also significant
 differences. Especially with regard to the angular differences. Using a standard
 road conflation routine with coastlines would likely produce sub-optimal
 results, but conflating with a user derived set of rules could perform quite
-well. 
+well.
 
 === Approach
 
 For the next phase of development we propose creating a new JavaScript API for
 conflating linear data. This new approach will support some of the more complex
 linear conflation situations such as one to many, many to many and partial
-matches. In the following sections we will present an approach that provides the 
+matches. In the following sections we will present an approach that provides the
 mechanisms for defining the following functionality:
 
 * Designate line candidate bounds
@@ -244,14 +244,14 @@ performance. The `perty-test` command is used with the following configuration:
 }
 -----
 
-During each test run the `perty.systematic.error.x` and 
+During each test run the `perty.systematic.error.x` and
 `perty.systematic.error.y` values are modified to vary the amount of error in
 the tests.
 
 ==== Jakarta Easy Test
 
 This test scenario uses two manually conflated data in a simple region of
-Jakarta as a baseline for evaluation. The methods described in the 
+Jakarta as a baseline for evaluation. The methods described in the
 <<Evaluation,Evaluation>> section are used for comparison. A higher value means
 closer agreement with the manually conflated data.
 
@@ -276,7 +276,7 @@ are merged.
 
 [[GenericConflationQuality]]
 .Conflation Quality
-image::algorithms/images/GenericLineTestGraph.png[Random Forest vs. Generic Line Road Conflation Performance,scalewidth="50%"]
+image::images/GenericLineTestGraph.png[Random Forest vs. Generic Line Road Conflation Performance,scalewidth="50%"]
 ////
 #TODO: replace with MPL - #267
 #[gnuplot, algorithms/GenericLineTestGraph.png]
@@ -322,7 +322,7 @@ for this requires more investigation.
 
 [[Conflation Speed]]
 .Conflation Speed
-image::algorithms/images/GenericLineTestTimingGraph.png[Random Forest vs. Generic Line Road Conflation Timing,scalewidth="50%"]
+image::images/GenericLineTestTimingGraph.png[Random Forest vs. Generic Line Road Conflation Timing,scalewidth="50%"]
 ////
 #TODO: replace with MPL - #267
 #[gnuplot, algorithms/GenericLineTestTimingGraph.png]
@@ -358,56 +358,56 @@ differences.
 
 ==== Overview
 
-A variant of generic line conflation specific to conflating rivers (linear waterways) is available within 
+A variant of generic line conflation specific to conflating rivers (linear waterways) is available within
 Hootenanny.  A river conflation model was developed based on test scenarios using manually matched data
 in the regions of Haiti and San Diego, USA as a baseline for evaluation.
 
 ==== Approach
 
-The goal for the initial conflation of rivers was that the number of correctly conflated features plus 
-the number of features marked for manual review would equal at least 80% of the overall conflated 
+The goal for the initial conflation of rivers was that the number of correctly conflated features plus
+the number of features marked for manual review would equal at least 80% of the overall conflated
 features in each test dataset.  It is very likely that over time a higher accuracy than 80% could be
-achieved with Hootenanny, however, this seemed a realistic goal for the initial implementation of generic river 
-conflation.  An attempt was made to have a as close to a minimum of 200 manually matched features as possible in each dataset, 
-while keeping datasets small enough that a single test against them could be run in roughly ten minutes or 
-less.  One dataset at a time was tested against until the conflation performance goal was reached before moving 
-onto testing against additional datasets.  After all datasets were tested against, a final model 
-was constructed that performed acceptably against all tested datasets.  
-
-The original plan was to test against three datasets.  It was later found that the third acquired dataset, rivers in 
+achieved with Hootenanny, however, this seemed a realistic goal for the initial implementation of generic river
+conflation.  An attempt was made to have a as close to a minimum of 200 manually matched features as possible in each dataset,
+while keeping datasets small enough that a single test against them could be run in roughly ten minutes or
+less.  One dataset at a time was tested against until the conflation performance goal was reached before moving
+onto testing against additional datasets.  After all datasets were tested against, a final model
+was constructed that performed acceptably against all tested datasets.
+
+The original plan was to test against three datasets.  It was later found that the third acquired dataset, rivers in
 Mexico, contained longer rivers with higher node counts such that the subline matching routine took
 unreasonable amounts of time to complete.  Optimizing the subline matching to address this issue requires work outside of
-the scope of this initial river conflation task (see "Future Work" section), therefore, 
+the scope of this initial river conflation task (see "Future Work" section), therefore,
 the third dataset was not tested against.
 
-During testing, an optimal search radius (controlled by the "error:circular" attribute) was first determined empirically for each 
-river dataset.  After testing, the capability to automatically calculate this value was added, so 
+During testing, an optimal search radius (controlled by the "error:circular" attribute) was first determined empirically for each
+river dataset.  After testing, the capability to automatically calculate this value was added, so
 manually determining it is no longer necessary but is allowed in the case the auto-calculation does
 not provide an acceptable value.
 
 Initially, during testing reference rubbersheeting was then used to bring the OSM river data towards the dataset it was being
-conflated with.  This helped increase the conflation accuracy.  Later during testing, the addition 
-of the automatically calculated search radius used without rubber sheeting provided even better results. 
-Using the automatically calculated search radius precludes rubber sheeting of the input data since 
+conflated with.  This helped increase the conflation accuracy.  Later during testing, the addition
+of the automatically calculated search radius used without rubber sheeting provided even better results.
+Using the automatically calculated search radius precludes rubber sheeting of the input data since
 tie points from the rubber sheeting algorithm are used to calculate the search radius.
 
 Several features were extracted from the river data tested against to help determine which ones
 could be used to most accurately conflate the data.  Weka (<<hall2009>>) was used to build models
-based on extracted features.  After running many tests, the two most influential features were found 
-to be weighted shape distance and a sampled angle histogram value.  Those features were used to 
-derive a model for conflating the river data.  
+based on extracted features.  After running many tests, the two most influential features were found
+to be weighted shape distance and a sampled angle histogram value.  Those features were used to
+derive a model for conflating the river data.
 
-Weighted shape distance is similar to Shape Distance as described in <<savary2005>>.  
+Weighted shape distance is similar to Shape Distance as described in <<savary2005>>.
 
-Angle histogram extraction works by calculating the angle of each line segment in a line string 
-and adding that angle to a histogram where the weight is the length of the line segment.  
-A histogram is built up in this way for both input lines, then normalized, and the difference calculated.  
-To conflate rivers, a sampled angle histogram value was derived, which allows for specifying a configurable 
-distance along each line segment to sample the angle value.  The distance from the sampled location to 
+Angle histogram extraction works by calculating the angle of each line segment in a line string
+and adding that angle to a histogram where the weight is the length of the line segment.
+A histogram is built up in this way for both input lines, then normalized, and the difference calculated.
+To conflate rivers, a sampled angle histogram value was derived, which allows for specifying a configurable
+distance along each line segment to sample the angle value.  The distance from the sampled location to
 calculate the directional heading along the way is also configurable.
 
-Two additional approaches were attempted that did not increase performance of the river conflation model 
-against the datasets tested, but are worth mentioning: increasing the unnecessary reviewable feature 
+Two additional approaches were attempted that did not increase performance of the river conflation model
+against the datasets tested, but are worth mentioning: increasing the unnecessary reviewable feature
 count to aid in decreasing the incorrectly conflated feature count and weighting matches between
 extracted sublines closer in distance higher than those that were further apart.
 
@@ -423,7 +423,7 @@ See the User Guide section on river conflation for details on configuration opti
 | San Diego | 784 | 55.3 | 17.0 | 72.3 | 02:19:22
 |======
 
-The goal of 80% combined correct and reviewable was met with the Haiti river data, but was not met with the San Diego river data.  
+The goal of 80% combined correct and reviewable was met with the Haiti river data, but was not met with the San Diego river data.
 Future work listed in a following section should help to increase the conflation accuracy further.
 
 ==== Future Work
@@ -438,11 +438,11 @@ with larger amounts of nodes and overlapping sublines will improve the performan
 
 ==== Overview
 
-The goal for the initial conflation of power lines was that the number of correctly conflated features plus the number of features marked 
-for manual review would equal at least 80% of the overall conflated features in each test dataset.  An attempt was made to have a as close 
-to a minimum of 200 manually matched features as possible in each dataset, while keeping datasets small enough that a single test against 
+The goal for the initial conflation of power lines was that the number of correctly conflated features plus the number of features marked
+for manual review would equal at least 80% of the overall conflated features in each test dataset.  An attempt was made to have a as close
+to a minimum of 200 manually matched features as possible in each dataset, while keeping datasets small enough that a single test against
 them could be run in roughly ten minutes or less.  A rules based conflation model was constructed that maximized the conflation output quality
-for all input datasets.  
+for all input datasets.
 
 ==== Approach
 
@@ -480,22 +480,22 @@ turned on by default, for the situation resulted in an increased conflation scor
 For the input data that contained them, voltage tags were very valuable when trying to disambiguate matches in dense areas of power lines
 (especially near power stations).  Fortunately, a lot of the open source data had fairly accurate voltage tags.
 
-To a lesser degree, the location tag used in the OpenStreetMap power line mapping specification was valuable.  Since power lines can exist 
+To a lesser degree, the location tag used in the OpenStreetMap power line mapping specification was valuable.  Since power lines can exist
 both above and under, it can be difficult to correctly conflate underground power lines if they are not so labeled.
 
 ==== Difficulties
 
-The largest difficulty in conflating power lines was due to: 1) differing standards on mapping power lines on towers, 2) data mapped incorrectly based off of aerial imagery, 3) power lines that start as overhead and are later buried underground.  
+The largest difficulty in conflating power lines was due to: 1) differing standards on mapping power lines on towers, 2) data mapped incorrectly based off of aerial imagery, 3) power lines that start as overhead and are later buried underground.
 
 The OpenStreetMap power line mapping specification states that power towers carrying multiple cables be mapped as
-a single way with tags indicating the number of cables carried.  Other datasets (ENERGYDATA.INFO, California State Government) map each 
-individual cable as a separate way.  The desired outcome when conflating these two types of data would be to generate a review for user 
-to decide which way they want to represent the features.  However, that proved difficult to implement in Hootenanny for this situation and 
+a single way with tags indicating the number of cables carried.  Other datasets (ENERGYDATA.INFO, California State Government) map each
+individual cable as a separate way.  The desired outcome when conflating these two types of data would be to generate a review for user
+to decide which way they want to represent the features.  However, that proved difficult to implement in Hootenanny for this situation and
 needs improvement.
 
 Since power lines can become very dense in urban areas, it can be difficult to correctly map them based off of aerial imagery.  Some of
 the OpenStreetMap data used during testing that had been mapped from aerial imagery appeared to incorrectly connect power line ways when
-compared to the more specialized open source power line datasets coming from power companies (ENERGYDATA.INFO, California State Government). 
+compared to the more specialized open source power line datasets coming from power companies (ENERGYDATA.INFO, California State Government).
 
 ==== Results
 
@@ -507,17 +507,17 @@ compared to the more specialized open source power line datasets coming from pow
 | 2 | Namibia | EDI | OSM | 200 | 51.7 | 0.0 | 51.7 | 23:15
 | 3 | California Bay Area | California State Govt | OSM | 229 | 78.4 | 0.0 | 78.4 | 01:39
 | 4 | Los Angeles | California State Govt | OSM | 204 | 69.6 | 1.2 | 70.8 | 00:56
-| 5 | Namibia | MGCP | EDI | 41 | 71.7 | 0.0 | 71.7 | 00:30 
-| 6 | Namibia | EDI | OSM | 51 | 94.7 | 0.0 | 94.7 | 03:53 
+| 5 | Namibia | MGCP | EDI | 41 | 71.7 | 0.0 | 71.7 | 00:30
+| 6 | Namibia | EDI | OSM | 51 | 94.7 | 0.0 | 94.7 | 03:53
 |======
 
-The 80% correct conflation threshold was met by two of the tests, with three additional tests within 10% of that value.  The Nambia 
-test with only 51.7% correct obviously still requires quite a bit of attention.  
+The 80% correct conflation threshold was met by two of the tests, with three additional tests within 10% of that value.  The Nambia
+test with only 51.7% correct obviously still requires quite a bit of attention.
 
-It is worth noting that some of the ENERGYDATA.INFO (EDI) and MGCP data contain previously added OpenStreetMap (OSM) data.  Therefore, in some cases 
-nearly identical sections of data are being conflated together, which Hootenanny performs very well against (as expected).  In those areas 
-test scores could be considered artificially inflated.  However, since it is a quite common workflow to conflate OpenStreetMap into other 
-custom data sources due to OSM's richness as a result of open source contribution, testing conflating such overlapping data is still quite 
+It is worth noting that some of the ENERGYDATA.INFO (EDI) and MGCP data contain previously added OpenStreetMap (OSM) data.  Therefore, in some cases
+nearly identical sections of data are being conflated together, which Hootenanny performs very well against (as expected).  In those areas
+test scores could be considered artificially inflated.  However, since it is a quite common workflow to conflate OpenStreetMap into other
+custom data sources due to OSM's richness as a result of open source contribution, testing conflating such overlapping data is still quite
 valid.
 
 [[GenericLineConflationFutureWork]]
Clone this wiki locally