Consider catalog file when populating a schema message #90

agrandotech · 2022-06-02T09:39:42Z

Description of change

In the current implementation, the catalog.json being passed is not considered for the creation of the schema message.
The schema always matches the schema in the schema json files within the repository.
This leads to the following issues:

Fields that are not selected will still stay in the schema
Fields provided by the catalog file are not present in the schema at all

We encountered this issue when we synced to a BigQuery target. The table created did not match the data in the records.

Manual QA steps

Run the tap in sync mode without providing a catalog.json file -> it should use the default schema and populate this in the schema message
Run the tap in sync mode with a catalog.json file where you set certain fields to "selected": false -> those fields should not show up in the schema message
Run the tap in sync mode with a catalog.json file that contains fields that are not part of the default schema (e.g. custom jira fields) -> those fields should show up in the schema message

Risks

Existing implementations might show different behavior in case they provided a catalog.json file where certain fields are unselected

Rollback steps

revert this branch

The provided catalog.json should be used for the schema message generation to mismatches between schema and extracted records.

Use provided catalog in schema message

singer-bot · 2022-06-02T09:39:43Z

Hi @agrandotech, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

agrandotech · 2022-06-02T09:40:41Z

tap_jira/__init__.py

+    if "anyOf" not in property_schema and "type" not in property_schema:
+        return None  # Could not detect data type
+    for property_type in property_schema.get("anyOf", [property_schema.get("type")]):
+        if "object" in property_type or property_type == "object":
+            return True
+    return False
+
+
+def is_property_selected(
+    stream_name,
+    breadcrumb,
+):
+    """Return True if the property is selected for extract.
+    Breadcrumb of `[]` or `None` indicates the stream itself. Otherwise, the
+    breadcrumb is the path to a property within the stream.
+    The code is based on https://github.com/meltano/sdk/blob/c9c0967b0caca51fe7c87082f9e7c5dd54fa5dfa/singer_sdk/helpers/_catalog.py#L63
+    """
+    breadcrumb = breadcrumb or tuple()
+    if isinstance(breadcrumb, str):
+        breadcrumb = tuple([breadcrumb])
+
+    if not Context.catalog:
+        return True
+
+    catalog_entry = Context.get_catalog_entry(stream_name).to_dict()
+    if not catalog_entry:
+        LOGGER.warning(f"Catalog entry missing for '{stream_name}'. Skipping.")
+        return False
+
+    if not catalog_entry.get('metadata'):
+        return True
+
+    md_map = metadata.to_map(catalog_entry['metadata'])
+    md_entry = md_map.get(breadcrumb)
+    parent_value = None
+    if len(breadcrumb) > 0:
+        parent_breadcrumb = tuple(list(breadcrumb)[:-2])
+        parent_value = is_property_selected(
+            stream_name, parent_breadcrumb
+        )
+    if parent_value is False:
+        return parent_value
+
+    if not md_entry:
+        LOGGER.warning(
+            f"Catalog entry missing for '{stream_name}':'{breadcrumb}'. "
+            f"Using parent value of selected={parent_value}."
+        )
+        return parent_value or False
+
+    if md_entry.get("inclusion") == "unsupported":
+        return False
+
+    if md_entry.get("inclusion") == "automatic":
+        if md_entry.get("selected") is False:
+            LOGGER.warning(
+                f"Property '{':'.join(breadcrumb)}' was deselected while also set"
+                "for automatic inclusion. Ignoring selected==False input."
+            )
+        return True
+
+    if "selected" in md_entry:
+        return bool(md_entry['selected'])
+
+    if md_entry.get('inclusion') == 'available':
+        return True
+
+    raise ValueError(
+        f"Could not detect selection status for '{stream_name}' breadcrumb "
+        f"'{breadcrumb}' using metadata: {md_map}"
+    )
+
+
+def pop_deselected_schema(
+    schema,
+    stream_name,
+    breadcrumb,
+):
+    """Remove anything from schema that is not selected.
+    Walk through schema, starting at the index in breadcrumb, recursively updating in
+    place.
+    This code is based on https://github.com/meltano/sdk/blob/c9c0967b0caca51fe7c87082f9e7c5dd54fa5dfa/singer_sdk/helpers/_catalog.py#L146
+    """
+    for property_name, val in list(schema.get("properties", {}).items()):
+        property_breadcrumb = tuple(
+            list(breadcrumb) + ["properties", property_name]
+        )
+        selected = is_property_selected(
+            stream_name, property_breadcrumb
+        )
+        if not selected:
+            schema["properties"].pop(property_name)
+            continue
+
+        if is_object_type(val):
+            # call recursively in case any subproperties are deselected.
+            pop_deselected_schema(
+                val, stream_name, property_breadcrumb
+            )
+


Reviewer note: Those helper functions are used to remove unselected fields from the schema message

agrandotech · 2022-06-02T09:42:17Z

tap_jira/__init__.py

-    schema = load_schema(stream.tap_stream_id)
-    singer.write_schema(stream.tap_stream_id, schema, stream.pk_fields)
+    stream_id = stream.tap_stream_id
+    catalog_entry = Context.get_catalog_entry(stream_id).to_dict()


Reviewer note: Using Context.get_catalog_entry(stream_id).to_dict() ensures that the catalog file is used for the schema message creation in case one is provided. If not, it uses the default schema from load_schema

singer-bot · 2022-06-02T09:46:30Z

You did it @agrandotech!

Thank you for signing the Singer Contribution License Agreement.

edgarrmondragon · 2024-04-16T21:53:43Z

I guess it's an old PR, but would the maintainers interested in getting this merged?

agrandotech added 2 commits June 1, 2022 14:20

Use provided catalog in schema message

4a67485

The provided catalog.json should be used for the schema message generation to mismatches between schema and extracted records.

Merge pull request #1 from agrandotech/use-catalog-for-schema-creation

8adf4d4

Use provided catalog in schema message

singer-bot added the cla-missing label Jun 2, 2022

agrandotech commented Jun 2, 2022

View reviewed changes

singer-bot removed the cla-missing label Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider catalog file when populating a schema message #90

Consider catalog file when populating a schema message #90

agrandotech commented Jun 2, 2022

singer-bot commented Jun 2, 2022

agrandotech Jun 2, 2022

agrandotech Jun 2, 2022

singer-bot commented Jun 2, 2022

edgarrmondragon commented Apr 16, 2024

Consider catalog file when populating a schema message #90

Are you sure you want to change the base?

Consider catalog file when populating a schema message #90

Conversation

agrandotech commented Jun 2, 2022

Description of change

Manual QA steps

Risks

Rollback steps

singer-bot commented Jun 2, 2022

agrandotech Jun 2, 2022

Choose a reason for hiding this comment

agrandotech Jun 2, 2022

Choose a reason for hiding this comment

singer-bot commented Jun 2, 2022

edgarrmondragon commented Apr 16, 2024