Skip to content

Commit

Permalink
GitBook: [#56] No subject
Browse files Browse the repository at this point in the history
  • Loading branch information
sonalgoyal authored and gitbook-bot committed Aug 4, 2022
1 parent 923bf0a commit 21a3478
Show file tree
Hide file tree
Showing 12 changed files with 37 additions and 15 deletions.
1 change: 1 addition & 0 deletions docs/dataSourcesAndSinks/connectors.md
Expand Up @@ -2,6 +2,7 @@
title: Data Sources and Sinks
nav_order: 3
has_children: true
description: Data sources and file formats supported by Zingg
---

# Data Sources and Sinks
Expand Down
6 changes: 5 additions & 1 deletion docs/setup/link.md
@@ -1,6 +1,10 @@
---
description: To match two datasets against each other
---

# Linking across datasets

In many cases like reference data mastering, enrichment, etc, two individual datasets are duplicates free but they need to be matched against each other. The link phase is used for such scenarios.
In many cases like reference data mastering, enrichment, etc, two individual datasets are free of duplicates but they need to be matched against each other. The link phase is used for such scenarios.

`./zingg.sh --phase link --conf config.json`

Expand Down
13 changes: 7 additions & 6 deletions docs/setup/match.md
@@ -1,17 +1,18 @@
---
layout: default
title: Find the matches
parent: Step By Step Guide
nav_order: 8
description: Identifying matching records
---

### match
Finds the records which match with each other.
# Finding the matches

Finds the records which match with each other.

`./zingg.sh --phase match --conf config.json`

As can be seen in the image below, matching records are given the same z_cluster id. Each record also gets a z_minScore and z_maxScore which shows the least/greatest it matched with other records in the same cluster.
As can be seen in the image below, matching records are given the same z\_cluster id. Each record also gets a z\_minScore and z\_maxScore which shows the least/greatest it matched with other records in the same cluster.

![Match results](/assets/match.gif)
![Match results](../../assets/match.gif)

If records across multiple sources have to be matched, the [link phase](./link.md) should be used.
If records across multiple sources have to be matched, the [link phase](link.md) should be used.
10 changes: 7 additions & 3 deletions docs/setup/train.md
@@ -1,10 +1,14 @@
---
layout: default
title: Build and save the model
parent: Step By Step Guide
nav_order: 7
description: Guide to build and save model
---
### train - training and saving the models

# Building and saving the model

Builds up the Zingg models using the training data from the above phases and writes them to the folder zinggDir/modelId as specified in the config.

./zingg.sh --phase train --conf config.json
```
./zingg.sh --phase train --conf config.json
```
1 change: 1 addition & 0 deletions docs/setup/training/addOwnTrainingData.md
Expand Up @@ -3,6 +3,7 @@ parent: Creating training data
nav_order: 3
title: Using preexisting training data
grand_parent: Step By Step Guide
description: Instructions on using existing training data with Zingg
---

# Using pre-existing training data
Expand Down
3 changes: 2 additions & 1 deletion docs/setup/training/createTrainingData.md
Expand Up @@ -3,8 +3,9 @@ parent: Step By Step Guide
nav_order: 6
title: Creating training data
has_children: true
description: Guide to working with training data
---

# Training data
# Working With Training Data

Zingg builds models to predict similarity. Training data is needed to build these models. The next sections describe how you can use the Zingg Interactive Labeler to create the training data.
3 changes: 2 additions & 1 deletion docs/setup/training/exportLabeledData.md
Expand Up @@ -3,10 +3,11 @@ parent: Creating training data
title: Exporting labeled data as csv
grand_parent: Step By Step Guide
nav_order: 4
description: Writing labeled data to CSV for exporting
---

# Exporting Labeled Data

If we need to send our labeled data for a subject matter expert to review or if we want to build another model in a new location and [reuse training effort](addOwnTrainingData.md) from earlier, we can write our labeled data to a csv 
If we need to send our labeled data for a subject matter expert to review or if we want to build another model in a new location and [reuse training efforts](addOwnTrainingData.md) from earlier, we can write our labeled data to a CSV.

`./scripts/zingg.sh --phase exportModel --conf <path to conf> --location <folder to save the csv>`
3 changes: 2 additions & 1 deletion docs/setup/training/findAndLabel.md
Expand Up @@ -3,6 +3,7 @@ parent: Creating training data
title: Find training data and labelling
grand_parent: Step By Step Guide
nav_order: 2
description: Phase which creates training data
---

# Find And Label
Expand All @@ -11,4 +12,4 @@ This phase is composed of two phases namely [findTrainingData](findTrainingData.

`./zingg.sh --phase findAndLabel --conf config.json`

As this is phase runs findTrainingData and label together, it should be run only for small datasets where findTrainingData takes a short time to run, else the the user will have to wait long for the console for labeling.&#x20;
As this phase runs findTrainingData and label together, it should be run only for small datasets where findTrainingData takes a short time to run, else the user will have to wait long for the console for labeling.&#x20;
2 changes: 1 addition & 1 deletion docs/setup/training/findTrainingData.md
Expand Up @@ -2,7 +2,7 @@
parent: Creating training data
nav_order: 1
grand_parent: Step By Step Guide
description: pairs of records that could be similar to train Zingg
description: Pairs of records that could be similar to train Zingg
---

# Finding Records For Training Set Creation
Expand Down
2 changes: 1 addition & 1 deletion docs/setup/training/label.md
Expand Up @@ -15,4 +15,4 @@ The label phase opens an interactive learner where the user can mark the pairs f

Proceed running findTrainingData followed by label phases till you have at least 30-40 positives, or when you see the predictions by Zingg converging with the output you want. At each stage, the user will get different variations of attributes across the records. Zingg performs pretty well with even a small number of training, as the samples to be labeled are chosen by the algorithm itself.

The showConcise flag when passed to the Zingg command line only shows fields which are NOT DONT\_USE
The showConcise flag when passed to the Zingg command line only shows fields that are NOT DONT\_USE.
@@ -1,3 +1,7 @@
---
description: Requirements to optimize the performance
---

# Tuning Label, Match And Link Jobs

#### numPartitions
Expand Down
4 changes: 4 additions & 0 deletions docs/updatingLabels.md
@@ -1,3 +1,7 @@
---
description: To update the existing labeled pairs as the data modifies
---

# Updating Labeled Pairs

**Please note: This is an experimental feature. Please keep a backup copy of your model folder in a separate place before running this**
Expand Down

0 comments on commit 21a3478

Please sign in to comment.