Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineagex #966

Merged
merged 4 commits into from
Aug 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,14 +184,36 @@ read_sql("postgresql://username:password@server:port/database", "SELECT * FROM l

Check out [here](https://github.com/sfu-db/connector-x#supported-sources--destinations) for supported databases and dataframes and more examples usages.


## Lineage
A Column Level Lineage Graph for SQL. This tool is intended to help you by creating an interactive graph on a webpage to explore the column level lineage among them.

### The lineage module offers:
A general introduction of the project can be found in this [blog post](https://medium.com/@shz1/lineagex-the-python-library-for-your-lineage-needs-5e51b77a0032).
- **Automatic dependency creation**: When there are dependency among the SQL files, and those tables are not yet in the database, the lineage module will automatically tries to find the dependency table and creates it.
- **Clean and simple but very interactive user interface**: The user interface is very simple to use with minimal clutters on the page while showing all of the necessary information.
- **Variety of SQL statements**: The lineage module supports a variety of SQL statements, aside from the typical `SELECT` statement, it also supports `CREATE TABLE/VIEW [IF NOT EXISTS]` statement as well as the `INSERT` and `DELETE` statement.
- **[dbt](https://docs.getdbt.com/) support**: The lineage module is also implemented in the [dbt-LineageX](https://github.com/sfu-db/dbt-lineagex), it is added into a dbt project and by using the dbt library [fal](https://github.com/fal-ai/fal), it is able to reuse the Python core and create the similar output from the dbt project.

### Uses and Demo
The interactive graph looks like this:
<img src="https://raw.githubusercontent.com/sfu-db/lineagex/main/docs/example.gif"/>
Here is a [live demo](https://zshandy.github.io/lineagex-demo/) with the [mimic-iv concepts_postgres](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iv/concepts_postgres) files([navigation instructions](https://sfu-db.github.io/lineagex/output.html#how-to-navigate-the-webpage)) and that is created with one line of code:
```python
from dataprep.lineage import lineagex
lineagex(sql=path/to/sql, target_schema="schema1", conn_string="postgresql://username:password@server:port/database", search_path_schema="schema1, public")
```
Check out more detailed usage and examples [here](https://sfu-db.github.io/lineagex/intro.html).

## Documentation

The following documentation can give you an impression of what DataPrep can do:

- [Connector](https://docs.dataprep.ai/user_guide/connector/introduction.html)
- [EDA](https://docs.dataprep.ai/user_guide/eda/introduction.html)
- [Clean](https://docs.dataprep.ai/user_guide/clean/introduction.html)

- [Lineage](https://sfu-db.github.io/lineagex/intro.html)
-
## Contribute

There are many ways to contribute to DataPrep.
Expand Down
1 change: 1 addition & 0 deletions dataprep/lineage/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .lx import lineagex
40 changes: 40 additions & 0 deletions dataprep/lineage/lx.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
"""
This module contains the method of lineagex.
It is a wrapper on lineagex.lineagex function.
"""
from typing import Optional, Union, List

try:
import lineagex as lx

_WITH_LX = True
except ImportError:
_WITH_LX = False


def lineagex(
sql: Optional[Union[List, str]] = None,
target_schema: Optional[str] = "",
conn_string: Optional[str] = None,
search_path_schema: Optional[str] = "",
) -> dict:
"""
Produce the lineage information.
Please check out https://github.com/sfu-db/lineagex for more details.
:param sql: The input of the SQL files, it can be a path to a file, a path to a folder containing SQL files, a list of SQLs or a list of view names and/or schemas
:param target_schema: The schema where the SQL files would be created, defaults to public, or the first schema in the search_path_schema if provided
:param conn_string: The postgres connection string in the format postgresql://username:password@server:port/database, defaults to None
:param search_path_schema: The SET search_path TO ... schemas, defaults to public or the target_schema if provided
:return:
"""

if _WITH_LX:
output_dict = lx.lineagex(
sql=sql,
target_schema=target_schema,
conn_string=conn_string,
search_path_schema=search_path_schema,
).output_dict
return output_dict
else:
raise ImportError("lineagex is not installed." "Please run pip install lineagex")
Empty file.
2 changes: 2 additions & 0 deletions dataprep/tests/lineage/dependency_example/a_table.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SELECT subject_id, gender
FROM `physionet-data.mimiciii_derived.no_dob`;
6 changes: 6 additions & 0 deletions dataprep/tests/lineage/dependency_example/aa_table.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
CREATE VIEW aa_table AS
SELECT a.subject_id, b.gender
FROM `physionet-data.mimiciii_derived.a_table` a, `physionet-data.mimiciii_derived.no_dob` b;
CREATE TABLE a_table AS
SELECT subject_id, gender
FROM `physionet-data.mimiciii_derived.no_dob`;
11 changes: 11 additions & 0 deletions dataprep/tests/lineage/dependency_example/basic_patient_info.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
-- ------------------------------------------------------------------
-- Title: Retrieves basic patient information from the patients table
-- Notes: this query does not specify a schema. To run it on your local
-- MIMIC schema, run the following command:
-- SET SEARCH_PATH TO mimiciii;
-- Where "mimiciii" is the name of your schema, and may be different.
-- ------------------------------------------------------------------


SELECT subject_id, gender, dob
FROM `physionet-data.mimiciii_clinical.patients`;
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SELECT * FROM aa_table;
2 changes: 2 additions & 0 deletions dataprep/tests/lineage/dependency_example/no_dob.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SELECT subject_id, gender, dob
FROM `physionet-data.mimiciii_derived.basic_patient_info`;
32 changes: 32 additions & 0 deletions dataprep/tests/lineage/test_lineagex.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# type: ignore
from os import environ
import os
import pytest

from ...lineage import lineagex


@pytest.mark.skipif(
environ.get("DB_URL", "") == "",
reason="Skip tests that requires database setup and sql query specified",
)
def test_read_sql() -> None:
db_url = environ["DB_URL"]
sql = os.path.join(os.getcwd(), 'dependency_example')
lx = lineagex(
sql,
"mimiciii_derived",
db_url,
"mimiciii_clinical, public"
)
print("dependency test with database connection", lx)
lx = lineagex(
sql=sql,
target_schema="mimiciii_derived",
search_path_schema="mimiciii_clinical, public"
)
print("dependency test without database connection", lx)


if __name__ == "__main__":
test_read_sql()