Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial pull #981

Closed
wants to merge 61 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
ca90e89
Implement CI manual trigger
fatbuddy Jan 12, 2024
81c7f2f
Update poetry dependency
fatbuddy Jan 12, 2024
93d0b19
Fix the format of test_lineagex.py
fatbuddy Jan 12, 2024
8b0aa9f
Fixed key error
fatbuddy Jan 12, 2024
8ec3f80
Create a empty set(str) if the key is not found
fatbuddy Jan 12, 2024
b7fde5f
Fixing the test_levenshtein_clusters
fatbuddy Jan 12, 2024
7062e54
Add eralchemy and remove levenshtein test case
Jan 29, 2024
578e0d8
Modify the workflow dispatch arguments
Jan 29, 2024
316a02e
Try to push
WenbinLiworks Feb 1, 2024
1acc90d
Update CI for graphviz install
Feb 4, 2024
e59b2df
Merge branch 'develop' of github.com:fatbuddy/dataprep into develop
Feb 4, 2024
f3ea9b5
Graphviz installation update
Feb 4, 2024
c747c17
Graphviz installation update #2
Feb 4, 2024
5094cd4
Remove eralchemy from poetry
Feb 4, 2024
1789a2c
Upgrade eralchemy to v2
Feb 4, 2024
952e677
Attempt to install ts-graphviz/setup-graphviz
Feb 4, 2024
f275b70
Downgrade ERAlchemy to v1
Feb 4, 2024
349f2ef
Revert "Downgrade ERAlchemy to v1"
Feb 5, 2024
f113891
Fix attribute problem on sql_metadata.py:364
Feb 5, 2024
7fbf1aa
Fix the compatbility issue on sqlalchemy 2.0.25
Feb 5, 2024
2d89a1e
Fix the code style issue
Feb 5, 2024
d26368c
Windows dependency fix
fatbuddy Feb 10, 2024
5d13750
Add version control on graphviz
fatbuddy Feb 10, 2024
adde380
Windows dependency fix #2
fatbuddy Feb 10, 2024
5fbaa40
Windows dependency fix #3
fatbuddy Feb 10, 2024
4bae7e5
Windows dependency fix #4
fatbuddy Feb 10, 2024
9c5b95e
Windows dependency fix #5
fatbuddy Feb 10, 2024
426f182
Windows dependency fix #6
fatbuddy Feb 10, 2024
b22e6af
Windows dependency fix #7
fatbuddy Feb 10, 2024
ecc7792
Windows dependency fix #8
fatbuddy Feb 10, 2024
e427379
Windows dependency fix #9
fatbuddy Feb 10, 2024
aae0b84
Windows dependency fix #10
fatbuddy Feb 10, 2024
0128e0d
Windows dependency fix #11
fatbuddy Feb 10, 2024
80e064f
Windows dependency fix #12
fatbuddy Feb 10, 2024
d5a6f88
Windows dependency fix #13
fatbuddy Feb 10, 2024
1186461
Windows dependency fix #14
fatbuddy Feb 10, 2024
d612b59
Windows dependency fix #15
fatbuddy Feb 10, 2024
d008aff
Windows dependency fix #16
fatbuddy Feb 10, 2024
568c047
Docs build fix #1
fatbuddy Feb 10, 2024
dbc61ff
Docs build fix #2
fatbuddy Feb 10, 2024
a8f5ed1
Added support on python 3.10 and 3.11
fatbuddy Feb 16, 2024
a043783
Update version support to python 3.11
fatbuddy Feb 24, 2024
8400cba
Remove Ray Package from the project
fatbuddy Feb 29, 2024
7bcd5bd
Update version in pyproject.toml
fatbuddy Feb 29, 2024
a0a67ac
Fixed unclosed array problem
fatbuddy Feb 29, 2024
c3782e5
Update lock file
fatbuddy Feb 29, 2024
91d5559
Support on Python 3.12.x
fatbuddy Mar 13, 2024
f627ee6
Support on Python 3.12.x #2
fatbuddy Mar 13, 2024
2339f07
Support on Python 3.12.x #3
fatbuddy Mar 13, 2024
6c8268c
Support on Python 3.12.x #4
fatbuddy Mar 13, 2024
7f7102d
Support on Python 3.12.x #5
fatbuddy Mar 15, 2024
c5ecbfb
Support on Python 3.12.x #6
fatbuddy Mar 15, 2024
d398af0
Merge pull request #1 from fatbuddy/dev
fatbuddy Mar 16, 2024
b073754
Fix dependency package + python 3.12 support
fatbuddy Mar 30, 2024
1ec2245
Fix the styling issues caused by black
fatbuddy Mar 30, 2024
5b873b6
Merge branch 'pr' into develop
fatbuddy Mar 30, 2024
1cb426c
Fix the styling issues caused by black lib
fatbuddy Mar 30, 2024
726a9ed
Merge branch 'pr' into develop
fatbuddy Mar 30, 2024
9103155
Fix dependency package + python 3.12 support
fatbuddy Mar 30, 2024
8eca88d
Fix the styling issues caused by black lib
fatbuddy Mar 30, 2024
ddd043e
Merge branch 'pr2' into develop
fatbuddy Mar 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
56 changes: 43 additions & 13 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,21 @@
name: CI

on:
workflow_dispatch:
inputs:
logLevel:
description: 'Log level'
required: true
default: 'warning'
type: choice
options:
- info
- warning
- debug
push:
branches:
- develop
- dev
- release
pull_request:
branches:
Expand All @@ -15,26 +27,39 @@ jobs:
strategy:
fail-fast: false
matrix:
python: ["3.8", "3.9"]
python: ["3.8", "3.9", "3.10", "3.11", "3.12"]
os: [ubuntu-latest, macos-latest, windows-latest]
include:
- os: ubuntu-latest
install_graphviz:
sudo apt install graphviz graphviz-dev
- os: macos-latest
install_graphviz: brew install graphviz
- os: windows-latest
install_graphviz:
choco install graphviz --version=2.48.0;
poetry run pip install --global-option=build_ext --global-option="-IC:\Program Files\Graphviz\include" --global-option="-LC:\Program Files\Graphviz\lib" pygraphviz;
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v2
- name: Checkout
uses: actions/checkout@v2

- uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python }}

- name: "Windows Graphviz install"
if: runner.os == 'Windows'
uses: crazy-max/ghaction-chocolatey@v3
with:
args: -h

- name: Install Graphviz for Windows
if: runner.os == 'Windows'
run: |
choco install graphviz --version=2.49.3

- name: Install pygraphviz for Windows
if: runner.os == 'Windows'
run: |
python -m pip install --use-pep517 --config-settings="--global-option=build_ext" --config-settings="--global-option=-IC:\\Program Files\\Graphviz\\include" --config-settings="--global-option=-LC:\\Program Files\\Graphviz\\lib" pygraphviz

- name: Install Graphviz for other platforms
if: runner.os != 'Windows'
uses: ts-graphviz/setup-graphviz@v2
with:
macos-skip-brew-update: 'true'

- name: Cache venv
uses: actions/cache@v2
with:
Expand All @@ -47,7 +72,7 @@ jobs:
${{ matrix.install_graphviz }}
echo "Cache Version ${{ secrets.CACHE_VERSION }}"
poetry install
poetry run pip install ERAlchemy
poetry run pip install ERAlchemy2
poetry config --list

- name: Print tool versions
Expand Down Expand Up @@ -95,6 +120,9 @@ jobs:
steps:
- uses: actions/checkout@v2

- name: Setup Graphviz
uses: ts-graphviz/setup-graphviz@v1.2.0

- name: Install dependencies
run: |
pip install poetry
Expand All @@ -110,6 +138,8 @@ jobs:
run: |
pip install poetry
poetry install
poetry run pip install ERAlchemy2


- name: Build docs
run: poetry run sphinx-build -M html docs/source docs/build
Expand Down
1 change: 1 addition & 0 deletions dataprep/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

Dataprep let you prepare your data using a single library with a few lines of code.
"""

import logging

DEFAULT_PARTITIONS = 1
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/address_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Constants used by the clean_address() and validate_address() functions
"""

# pylint: disable=C0301, C0302, E1101

from builtins import zip
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ad_nrt.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing
Andorra NRT (Número de Registre Tributari, Andorra tax number).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument, E1101, E1133
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_al_nipt.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing
NIPT (Numri i Identifikimit për Personin e Tatueshëm, Albanian VAT number).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument, E1101, E1133
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ar_cbu.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing
CBU (Clave Bancaria Uniforme, Argentine bank account number).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument, E1101, E1133
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ar_cuit.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing
CUIT (Código Único de Identificación Tributaria, Argentinian tax number).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument, E1101, E1133
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ar_dni.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing
DNI (Documento Nacional de Identidad, Argentinian national identity nr.).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument, E1101, E1133
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_at_uid.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing
UID (Umsatzsteuer-Identifikationsnummer, Austrian VAT number).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument, E1101, E1133
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_at_vnr.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing
VNR, SVNR, VSNR (Versicherungsnummer, Austrian social security number).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument, E1101, E1133
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_au_abn.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Australian Business Numbers (ABNs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_au_acn.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Australian Company Numbers (ACNs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_au_tfn.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Australian Tax File Numbers (TFNs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_be_iban.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Belgian IBANs.
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_be_vat.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Belgian VAT numbers (VATs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_bg_egn.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Bulgarian national identification numbers (EGNs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_bg_pnf.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Bulgarian personal number of a foreigner.
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_bg_vat.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Bulgarian VAT numbers (VATs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_bic.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing ISO 9362 Business identifier codes.
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_bitcoin.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Bitcoin Addresses.
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_br_cnpj.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing CNPJ numbers, Brazilian company identifier.
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_br_cpf.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing CPF numbers, Brazilian national identifier.
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_by_unp.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Belarusian UNP numbers (UNPs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ca_bn.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Canadian Business Numbers (BNs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ca_sin.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Canadian Social Insurance Numbers(SINs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_casrn.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing CAS Registry Numbers.
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches, unused-argument
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ch_esr.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Swiss EinzahlungsSchein mit Referenznummer (ESRs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ch_ssn.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Swiss social security numbers (SSNs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ch_uid.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Swiss business identifiers (UIDs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_ch_vat.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Swiss VAT numbers (VATs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_cl_rut.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Chile RUT/RUN numbers (RUTs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_cn_ric.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Chinese Resident Identity Card Number (RICs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_cn_uscc.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Clean and validate a DataFrame column containing Chinese Unified Social Credit Code
(China tax number) (USCCs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_co_nit.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Colombian identity codes (NITs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down
5 changes: 2 additions & 3 deletions dataprep/clean/clean_country.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing country names.
"""

from functools import lru_cache
from operator import itemgetter
from os import path
Expand Down Expand Up @@ -371,9 +372,7 @@ def _get_format_if_allowed(input_format: str, allowed_formats: Tuple[str, ...])
return (
"name"
if "name" in allowed_formats
else "official"
if "official" in allowed_formats
else None
else "official" if "official" in allowed_formats else None
)

return input_format if input_format in allowed_formats else None
Expand Down
1 change: 1 addition & 0 deletions dataprep/clean/clean_cr_cpf.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Clean and validate a DataFrame column containing Costa Rica physical person ID number (CPFs).
"""

# pylint: disable=too-many-lines, too-many-arguments, too-many-branches
from typing import Any, Union
from operator import itemgetter
Expand Down