Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent-child detection metrics in the case of multiple parents configuration #290

Open
mohamedgy opened this issue Jan 2, 2023 · 1 comment
Labels
data:multi-table Related to multi-table, relational datasets feature:metrics Related to any of the individual metrics feature request Request for a new feature under discussion Issue is currently being discussed

Comments

@mohamedgy
Copy link

Problem Description

The current version of the parent-child detection metric works well when applied on the denormalized data with linear parent-child relationship scheme. However, we think that the process of the evaluation of the denormalized data can be improved when applied for the multi parent-child relationship.
To illustrate this case we will use the biodegradability dataset as an example.
The bond table has two parent tables (Table atom duplicated twice). The current version of parent-child detection proceeds by iterating the denormalization process for each parent table separately from each other parent. That is, the parent-child detection will:

  • Denormalize the bond table using the atom_id foreign key as the join field to obtain the denormalized table and then will compute the first detection metric score (s1).
  • Denormalize the original bond table using only the second foreign key atom_id2 to obtain the denormalized table and then will compute the second detection metric score (s2).
  • Compute after that, the mean score of s1 and s2.
    This computation method successfully evaluates separately the relationship between each parent table and the child table but we can identify two drawbacks:
  1. The evaluation include the foreign key of the second parent table at each iteration
  2. This method doesn’t take into account the indirect relationship between the two parent tables that they may have via the child table
    For this reason, we think that denormalizing parent and child tables in a single table is more relevant. For example, for the previous database the denormalized table will be constructed in a single step and gives the following table that can evaluate also the indirect relationship between the parents:
type_bond type_atom type_atom2
1 c h
2 o n
2 n o
1 n c
7 c c
@npatki
Copy link
Contributor

npatki commented Jan 4, 2023

Hi @mohamedgy, thanks for filing this feature request. Definitely seems like we refine the parent-child detection metrics a bit more. A few of my own thoughts:

Scope: I believe the current metric(s) were only scoped for a single parent-child relationship due to potential issues that may arise in performance and accuracy. The biodegradability dataset is small, but I imagine if both parents had 100s of columns, then the resulting denormalized table will have many columns -- which isn't always the best for computation or predictive accuracy.

Schema: Seems like it's not just a multi-parent scenario that may run into this problem. If you have schema of higher depth such as grandparent --> parent --> child, then denormalizing all 3 tables may also provide some unique insights (correlations between grandparent and child). But this goes back to the scoping problem above.

Let's keep this issue open as we figure out how best to support these tradeoffs. It may involve new metrics, or parameters where users can control the denormalization.

Workaround

At least for now, there is a workaround where you can denormalize the tables yourself before applying the detection metrics.

@npatki npatki added under discussion Issue is currently being discussed data:multi-table Related to multi-table, relational datasets feature:metrics Related to any of the individual metrics and removed new Label applied to new issues labels Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:multi-table Related to multi-table, relational datasets feature:metrics Related to any of the individual metrics feature request Request for a new feature under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants