Parent-child detection metrics in the case of multiple parents configuration #290

mohamedgy · 2023-01-02T21:22:03Z

Problem Description

The current version of the parent-child detection metric works well when applied on the denormalized data with linear parent-child relationship scheme. However, we think that the process of the evaluation of the denormalized data can be improved when applied for the multi parent-child relationship.
To illustrate this case we will use the biodegradability dataset as an example.
The bond table has two parent tables (Table atom duplicated twice). The current version of parent-child detection proceeds by iterating the denormalization process for each parent table separately from each other parent. That is, the parent-child detection will:

Denormalize the bond table using the atom_id foreign key as the join field to obtain the denormalized table and then will compute the first detection metric score (s1).
Denormalize the original bond table using only the second foreign key atom_id2 to obtain the denormalized table and then will compute the second detection metric score (s2).
Compute after that, the mean score of s1 and s2.
This computation method successfully evaluates separately the relationship between each parent table and the child table but we can identify two drawbacks:

The evaluation include the foreign key of the second parent table at each iteration
This method doesn’t take into account the indirect relationship between the two parent tables that they may have via the child table
For this reason, we think that denormalizing parent and child tables in a single table is more relevant. For example, for the previous database the denormalized table will be constructed in a single step and gives the following table that can evaluate also the indirect relationship between the parents:

type_bond	type_atom	type_atom2
1	c	h
2	o	n
2	n	o
1	n	c
7	c	c

npatki · 2023-01-04T20:13:13Z

Hi @mohamedgy, thanks for filing this feature request. Definitely seems like we refine the parent-child detection metrics a bit more. A few of my own thoughts:

Scope: I believe the current metric(s) were only scoped for a single parent-child relationship due to potential issues that may arise in performance and accuracy. The biodegradability dataset is small, but I imagine if both parents had 100s of columns, then the resulting denormalized table will have many columns -- which isn't always the best for computation or predictive accuracy.

Schema: Seems like it's not just a multi-parent scenario that may run into this problem. If you have schema of higher depth such as grandparent --> parent --> child, then denormalizing all 3 tables may also provide some unique insights (correlations between grandparent and child). But this goes back to the scoping problem above.

Let's keep this issue open as we figure out how best to support these tradeoffs. It may involve new metrics, or parameters where users can control the denormalization.

Workaround

At least for now, there is a workaround where you can denormalize the tables yourself before applying the detection metrics.

mohamedgy added feature request Request for a new feature new Label applied to new issues labels Jan 2, 2023

npatki mentioned this issue Jan 3, 2023

Does removing foreign keys in detection metrics for multi-tables make sense? #285

Closed

npatki added under discussion Issue is currently being discussed data:multi-table Related to multi-table, relational datasets feature:metrics Related to any of the individual metrics and removed new Label applied to new issues labels Jan 4, 2023

npatki mentioned this issue Jan 4, 2023

Missing documentation for ParentChild Detection Metrics #293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parent-child detection metrics in the case of multiple parents configuration #290

Parent-child detection metrics in the case of multiple parents configuration #290

mohamedgy commented Jan 2, 2023

npatki commented Jan 4, 2023 •

edited

Parent-child detection metrics in the case of multiple parents configuration #290

Parent-child detection metrics in the case of multiple parents configuration #290

Comments

mohamedgy commented Jan 2, 2023

Problem Description

npatki commented Jan 4, 2023 • edited

Workaround

npatki commented Jan 4, 2023 •

edited