Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Always use get_src_tbl_names() to fetch tbl Ids within dm_from_con() #1934

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

owenjonesuob
Copy link
Contributor

According to the docs of dm_from_con(), the table_names parameter allows the user to include only specific tables in the returned dm object:

#' @param table_names
#'   A character vector of the names of the tables to include.

Currently, we use two different methods to fetch tbls:

  • If table_names was not provided, use dm:::get_src_tbl_names() to fetch tbl Ids, which are later passed to tbl()
  • If table_names was provided, pass the character-type names directly to tbl()

But the second method does not work with non-default schema names, or with multiple schemas; and we also create additional problems when we try to set names for the list of tbls (#1933).

I think we can actually use the first method in all cases, i.e. use get_src_tbl_names() to return a named list of tbl Ids.

Then we can use table_names to take a subset from that list, before passing to tbl().


Fixes #1933.

Copy link
Member

@TSchiefer TSchiefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me, thank you for your contribution!

I tried using this for SQL Server, trying to access tables in schemas, making use of the schema argument passed on in the ellipsis. This did not work, therefore I rewrote some of your code slightly, cf. comment.
Also, I think the function should fail in case specifically requested tables cannot be found, which is how it was supposed to behave before. I also changed that in the suggested code.
Does this make sense to you, could you please adapt?

set_names(src_tbl_names, nms) %>%
quote_ids(con) %>%
map(possibly(tbl, NULL), src = src)
tbls <- map(tbl_ids, possibly(tbl, NULL), src = src)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using schemas this code fails as well, since table_names normally does not include the schema, yet names(tbl_ids) does.
Therefore, I would suggest to replace lines 96-109 with the following:

  tbl_ids <- get_src_tbl_names(src, ..., names = .names)
  # Fetch only the tbls which were specifically requested
  if (!is.null(table_names)) {
    nms <- purrr::map_chr(tbl_ids, ~ .x@name[["table"]])
    bad_table_names <- table_names[!(table_names %in% nms)]
    tbl_ids <- tbl_ids[which(nms %in% table_names)]
  }
  tbls <- map(tbl_ids, possibly(tbl, NULL), src = src)
  bad <- map_lgl(tbls, is_null)
  if (any(bad) || (exists("bad_table_names", inherits = FALSE) && length(bad_table_names) > 0)) {
    if (is_null(table_names)) {
      warn_tbl_access(names(tbls)[bad])
      tbls <- tbls[!bad]
    } else {
      abort_tbl_access(union(names(tbls)[bad], bad_table_names))
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for this!! I hadn't spotted the *_tbl_access() functions, those look super handy 😁 I'll certainly adjust to include those.

Regarding table_names, I've just been thinking about the following situation:

Suppose there is a table name which exists in multiple schemas, e.g. suppose schema1.table1 and schema2.table1 both exist.

Now if a user asks for:

dm_from_con(con, table_names = "table1", schemas = c("schema1", "schema2"))

What should we return in such a case?

One option would be for the user to provide "schema.table"-format names in table_names. The documentation for table_names would perhaps need some adjustment - but:

Or maybe we should include both schema1.table1 and schema2.table1 in the returned dm object; which I think is what your suggested changes will result in? I'm conflicted about this, because I feel like once we start dealing with multiple schemas, we should always be explicit about those schemas - but maybe that's just me being too strict with SQL ideals 😛

To offer one more alternative, we could include only the first table found with that name, and then raising a warning, as described here? #1789 (comment)

I'll happily proceed with whichever you think is best!

Copy link
Member

@TSchiefer TSchiefer Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, good catch about the same table name in different schemas.

In my opinion, if the user provides 1 table name and more than 1 schema name, as in your example, this should lead to an error: either schema is not provided, has length 1 or needs to have a length matching the length of table_names.

names(tbl_ids) is a vector of the resulting names in the dm according to the specification in the .names argument. In case schema is given with multiple schema names, the default for this argument -- as you know -- is "{.schema}.{.table}", which leads to dm-tables just like you suggested (schema1.table1 and schema2.table1), if the user gave table_names = c("table1", "table1") and schema = c("schema1", "schema2").
In general names(tbl_ids) could be anything though and for this reason it cannot be used in any conditions.

I think the error messages need to be adapted to include the schema names of the tables that could not be accessed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okidoki! So based on these comments so far, I think we might need to modify how table_names works, such that it can accept a list in the multiple-schemas case - something like:

# No `schema` arg => use default schema, i.e. same as one-schema case (next example)
dm_from_con(con, table_names = c("table1", "table2"))

# One specified schema => `table_names` is a single vector of names
# Returned dm object contains: schema1.table1, schema1.table2
dm_from_con(con, schema = "schema1", table_names = c("table1", "table2"))

# Multiple schemas => `table_names` is a list of vectors (must it be named?)
# Returned dm object contains: schema1.table1, schema1.table2, schema2.table1, schema2.table3
dm_from_con(
  con,
  schema = c("schema1", "schema2"),
  table_names = list(
    "schema1" = c("table1", "table2"),
    "schema2" = c("table1", "table3")
  )
)

I'll make one more case for leaving things as they are currently implemented, i.e. based on names(tbl_ids), because it avoids adding complexity to table_names.

In general names(tbl_ids) could be anything though

True - but names are always constructed using the .names pattern, which is either the default (which multi-schema users must be aware of already), or specified by the user!

So the user does always know how elements of the returned dm object will be named; and therefore they know how the strings in table_names should look.

# Default schema / one schema
dm_from_con(con, table_names = c("table1", "table2"))
dm_from_con(con, schema = "schema1", table_names = c("table1", "table2"))

# Multiple schemas, default `.names`
dm_from_con(
  con,
  schemas = c("schema1", "schema2"),
  table_names = c("schema1.table1", "schema1.table2", "schema2.table1", "schema2.table3")
)

# Multiple schemas, custom `.names` - user knows that `table_names` should match `.names`
dm_from_con(
  con,
  schemas = c("schema1", "schema2"),
  # The user chooses `.names`...
  .names = "{.schema}__{.table}",
  # ... so, they know what `table_names` should look like!
  table_names = c("schema1__table1", "schema1__table2", "schema2__table1", "schema2__table3")
)

Thinking back to before multiple schema functionality was added, I think one interpretation of table_names was:

"I only want the returned dm object to contain elements with names like I've specified in table_names."

And I think that with this implementation, that interpretation is still valid now in the multi-schema case.


What do you think? For either of those approaches, I reckon I now have a solid specification to go away and implement 😇 many thanks for your patience and guidance with this one!

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 14, 2023

Thanks for your contribution! What's the next action here? Can we add a test for this change?

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 14, 2023

I think we want the following:

  • Try dm_meta(simple = FALSE), works for select databases and allows learning keys
  • If this fails, try dm_meta(simple = TRUE), this should work for all databases (we need to do some work to support SQLite)

This gets rid of a lot of legacy code and simplifies things. Let me tinker with that a bit, no action needed here for now.

@krlmlr krlmlr marked this pull request as draft August 15, 2023 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Naming pattern from .names ignored in dm_from_con()
3 participants