Validation schema design #7

lchen-2101 · 2023-12-04T19:46:03Z

Persist results from data-validator to DB

Schema design things to consider:

context (associated with submission id)
groupings of validations (e.g. error 1 associated with failures for records 2, 5, 7, etc.)

hkeeler · 2023-12-14T07:09:51Z

I started to look at PR #10, but I thought I would take a couple steps back and try to get my thoughts down on the overall schema for this database, not just the submission portion. Hopefully it can be a conversation starter, and we can do a few iterations on this, and it'll help us map out the remaining work and the order we want to take it on.

I also took a stab at what the filing and submissions state machines might look like.

Hope this is useful. Let's sync up on all this soon. Probably early next week.

Also, this is my first time playing with the inline Mermaid diagrams. It's not my favorite style-wise, but it's hard to beat having it just work right in a GitHub comment. ✨

---
title: Filing API Database Schema
---
erDiagram
    filing_type {
        varchar id PK
        varchar name UK
    }

    filing_period {
        varchar id PK
        varchar name UK
        date start_date
        date end_date
        date due_date
        varchar filing_type_id FK
    }
    filing_period }|--|| filing_type : ""

    filing_state {
        varchar id PK
        varchar name UK
    }

    filing {
        integer id PK "auto-increment"
        varchar lei "pseudo FK to `institution` table"
        varchar filing_period_id PK,FK
        varchar state FK
        varchar institution_snaphot_id "This needs some thought"
    }
    filing }|--|| filing_period : ""
    filing }|--|| filing_state : ""

    submission_state {
        varchar id PK
        varchar name UK        
    }

    submission {
        integer id PK 
        integer filing_id FK
        integer attempt "Q: Do we need this AND `time`?"
        timestamp time
        varchar user_id "Login.gov user id"
        varchar state FK
        varchar validation_ruleset_version "Useful if we update validation logic mid filing season"
        jsonb findings "JSONB alternative to additional `*_findings` tables"
    }
    submission }|--|| filing : ""
    submission }|--|| submission_state : ""

    submission_finding {
        integer submission_id FK
        varchar validation_code "Q: Do we need additional validation metadata?"
    }
    submission_finding }|--|| submission : ""

    submission_finding_record {
        integer record_number
        varchar field_name
        varchar field_value
    }
    submission_finding_record }|--|| submission_finding : ""

---
title: Filing State Machine
---
stateDiagram
    [*] --> filing_started
    filing_started --> institution_data_set
    institution_data_set --> submissions_in_progress
    submissions_in_progress --> filing_complete
    filing_complete --> submissions_in_progress
    filing_complete --> must_refile
    must_refile --> submissions_in_progress
    filing_complete --> [*]

---
title: Submission State Machine
---
stateDiagram
    classDef failure fill:#f00,color:white
    classDef error fill:#ff0


    [*] --> submission_started
    submission_started --> uploading
    uploading --> upload_failed:::failure
    uploading --> uploaded
    uploaded --> validating
    validating --> system_failure:::failure
    validating --> invalid_file:::error
    validating --> validated_with_errors:::error
    validating --> validated_with_warnings
    validating --> validated_clean
    validated_with_warnings --> signed
    validated_clean --> signed
    signed --> [*]

jcadam14 · 2023-12-14T16:00:15Z

Wow, mermaid is cool. I used to just throw stuff together in Visio but wasn't sure how somewhat high level designs were done here so figured I'd let the code be the design. Like the mermaid thing!

Read through comments and diagram. I'm putting my thoughts down here so I don't forget them later.

I normally don't like to store data like "count", "attempt", etc since those are easily derived from data.
Same with "state". If there's a single finding with severity "error" (which is a simple filter), the state of hard constraints, validated_with_errors, etc is known versus needing to be stored. But it looks like there's the potential for filings to be paused or what not during this process? (there's more to it than upload file, get back results?)
I stored the name of the field for an error at the same level as the error code because it's one to one. Records are a list of "rows" within a single submission that threw a particular error and they will all have the same field. So it made sense in my brain to move that up a level (unless I'm thinking about that wrong).
I didn't store validation code metadata other than what we might need to search over (error/warning code, severity). Things like description could be pulled from the phase_validation schema itself. It did make me wonder if we wanted to store all that in a static table to reference "just in case". Wasn't sure what all was needed.
The JSONB concept is nice, simplifies the schema a bit (I've worked with data that had deep parent/child relations and they can turn into problems to maintain and performance), but it makes it more procedural if we need to query on, say, "All submissions that resulted in error code E1234" or "All submissions that had just warnings". But if that sort of query isn't really needed, storing the results as JSON is a good way to go.

I'm scheduling a chat for Monday if that works.

lchen-2101 · 2023-12-15T16:21:39Z

Maybe #6 should've been worked on first. That would open up the big picture of how the flows would look like; original thought of this ticket was only about storing the validation results. I think the filing state and things aside from the validation should be worked out there. I've linked the API flow from HMDA in that ticket: https://ffiec.cfpb.gov/documentation/api/filing/platform#endpoints, shares similarities with the state machine mentioned here aside from the confirm institution data part not being present on the HMDA end.

jcadam14 · 2023-12-27T23:02:19Z

---
title: Filing API Database Classes
---
classDiagram
    class `FilingDAO(AuditMixin, Base):`{
      __tablename__ = "filing"
      id: Mapped[int] = mapped_column#40;primary_key=True, autoincrement=True#41;
      lei: Mapped[str]
      state: Mapped[FilingState] = mapped_column#40;Enum#40;FilingState#41;#41;
      filing_period = Mapped[str] = mapped_column(ForeignKey#40;"filing_period.id"#41;#41;
      institution_snapshot_id = Mapped[str] # not sure what this is
   }

class `FilingPeriodDAO(AuditMixin, Base):`{
    __tablename__ = "filing_period"
    id: Mapped[int] = mapped_column#40;primary_key=True, autoincrement=True#41;
    name: Mapped[str]
    start: Mapped[datetime]
    end: Mapped[datetime]
    due: Mapped[datetime]
    filing_type: Mapped[FilingType] = mapped_column#40;Enum#40;FilingType#41;#41;
}

class `SubmissionDAO(AuditMixin, Base):`{
    __tablename__ = "submission"
    id: Mapped[str] = mapped_column#40;index=True, primary_key=True#41;
    submitter: Mapped[str] # user_id instead?
    results: Mapped[List["FindingDAO"]] = relationship#40;back_populates="submission"#41;
    state: Mapped[SubmissionState] = mapped_column#40;Enum#40;SubmissionState#41;#41;
    validation_ruleset_version: Mapped[str]
    json_dump: Mapped[dict[str, Any]] = mapped_column#40;JSON, nullable=True#41;
    filing: Mapped[str] = mapped_column#40;ForeignKey#40;"filing.id"#41;#41;
}

class `FindingDAO(AuditMixin, Base):`{
    __tablename__ = "submission_finding"
    id: Mapped[int] = mapped_column#40;primary_key=True, autoincrement=True#41;
    submission_id: Mapped[str] = mapped_column(ForeignKey#40;"submission.id"#41;#41;
    submission: Mapped["SubmissionDAO"] = relationship#40;back_populates="results"#41; # if we care about bidirectional
    validation_code: Mapped[str]
    severity: Mapped[Severity] = mapped_column#40;Enum#40;*get_args#40;Severity#41;#41;#41;
    records: Mapped[List["RecordDAO"]] = relationship#40;back_populates="result"#41;
}
note for `FindingDAO(AuditMixin, Base):` "Severity = Literal['error', 'warning']"

class `RecordDAO(AuditMixin, Base):`{
    __tablename__ = "submission_finding_record"
    id: Mapped[int] = mapped_column#40;primary_key=True, autoincrement=True#41;
    result_id: Mapped[str] = mapped_column#40;ForeignKey#40;"submission_finding.id"#41;#41;
    result: Mapped["FindingDAO"] = relationship#40;back_populates="records"#41;  # if we care about bidirectional
    record: Mapped[int]
    field_name: Mapped[str]
    field_value: Mapped[str]
}
`FilingPeriodDAO(AuditMixin, Base):` *-- `FilingDAO(AuditMixin, Base):`
`FilingDAO(AuditMixin, Base):` *-- `SubmissionDAO(AuditMixin, Base):`
`SubmissionDAO(AuditMixin, Base):` *-- `FindingDAO(AuditMixin, Base):`
`FindingDAO(AuditMixin, Base):` *-- `RecordDAO(AuditMixin, Base):`

---
title: State Enums
---
classDiagram
note for `SubmissionState(Enum):` "SUBMISSION_UPLOAD_FAILED seems to only be 
applicable if the storage to the S3 fails.  
Otherwise, if there is a failure to send the
 file to the initial upload endpoint, the endpoint 
wouldn't know about it.  Does not being able to 
store to S3 indicate an overall processing failure?"
class `SubmissionState(Enum):`{
    SUBMISSION_UPLOADED = 1
    SUBMISSION_UPLOAD_FAILED = 2
    SUBMISSION_WITH_ERRORS = 3  
    SUBMISSION_WITH_WARNINGS = 4
    SUBMISSION_SUCCESSFUL = 5
    SUBMISSION_SIGNED = 6
}
    
class `FilingState(Enum):`{
    FILING_STARTED = 1
    FILING_IN_PROGRESS = 2
    FILING_COMPLETE = 3
}
class `FilingType(Enum):`{
    TYPE_A = "Type_A"
    TYPE_B = "Type_B"
}

---
title: Filing State Machine
---
stateDiagram
    [*] --> filing_started
    filing_started --> filing_in_progress
    filing_in_progress --> filing_in_progress
    filing_in_progress --> filing_complete
    filing_complete --> [*]

---
title: Submission State Machine
---
stateDiagram
    classDef failure fill:#f00,color:white
    classDef warning fill:#ff0


    [*] --> submission_uploaded
    submission_uploaded --> submission_upload_failed:::failure
    submission_uploaded --> submission_with_errors:::failure
    submission_uploaded --> submission_with_warnings:::warning
    submission_uploaded --> submission_successful
    submission_with_warnings --> submission_signed
    submission_successful --> submission_signed
    submission_signed --> [*]

…rom #7) Merged in branch 7-validation-schema-design-wip to support latest DAOs Updated alembic scripts for new tables Updated tests to test alembic migration, and also updated the repo test since I changed the FilingType enum to be MANUAL instead of TYPE_A and TYPE_B

…s branch

Added pytests to test the update_submission fuction. Please note the comment on the _generator_function in test_submission_repo.py Needed to merge in parts from story #7

Closes #13 Created a new branch that focuses purely on the alembic script needs instead of a full merge of #7 and #18 into the #13 branch. Makes it so only the code related to the migrations is needed for review. Originally I merged them all together to test starting the service in poetry to make sure the tables are created correctly in Postgres but that made the PR have a lot of files that were already covered in other PRs. Still the case with the entities/models directory but couldn't get around that. ## Additions - Added db_revision folder using alembic init and updated files for SBL specifics (using user-fi as an example) - Added migration files using alembic revision for each submission table (submission, filing, filing_period) - Updated migration files for table specifics - Added pytest migrations tests for new tables, columns, and fks - Had to merge in the entities/models dir since the migrations use the enumerations ## Removals - ## Changes - ## Testing 1. Run pytest in poetry shell (or poetry run pytest) and verify the tests pass ## Screenshots ## Notes - ## Todos - ## Checklist - [ ] PR has an informative and human-readable title - [ ] Changes are limited to a single goal (no scope creep) - [ ] Code can be automatically merged (no conflicts) - [ ] Code follows the standards laid out in the [development playbook](https://github.com/cfpb/development) - [ ] Passes all existing automated tests - [ ] Any _change_ in functionality is tested - [ ] New functions are documented (with a description, list of inputs, and expected output) - [ ] Placeholder code is flagged / future todos are captured in comments - [ ] Visually tested in supported browsers and devices (see checklist below 👇) - [ ] Project documentation has been updated (including the "Unreleased" section of the CHANGELOG) - [ ] Reviewers requested with the [Reviewers tool](https://help.github.com/articles/requesting-a-pull-request-review/) :arrow_right: ## Testing checklist ### Browsers - [ ] Chrome - [ ] Firefox - [ ] Safari - [ ] Internet Explorer 8, 9, 10, and 11 - [ ] Edge - [ ] iOS Safari - [ ] Chrome for Android ### Accessibility - [ ] Keyboard friendly - [ ] Screen reader friendly ### Other - [ ] Is useable without CSS - [ ] Is useable without JS - [ ] Flexible from small to large screens - [ ] No linting errors or warnings - [ ] JavaScript tests are passing

jcadam14 · 2024-01-24T15:50:10Z

Closing this as the models are merged into main under other stories. As the filing process matures we'll write new stories to capture those updates.

jcadam14 self-assigned this Dec 8, 2023

jcadam14 linked a pull request Dec 12, 2023 that will close this issue

7 validation schema design wip #10

Closed

25 tasks

jcadam14 mentioned this issue Dec 12, 2023

Add filer api tables to init.sql script for submission, validation_result, and validation_result_record cfpb/sbl-project#61

Closed

jcadam14 changed the title ~~Validation schema design (WIP)~~ Validation schema design Dec 12, 2023

jcadam14 mentioned this issue Jan 3, 2024

7 validation schema design wip #10

Closed

25 tasks

jcadam14 added a commit that referenced this issue Jan 9, 2024

Cleaned up branch to no longer be a full merge of #7 and #18 into thi…

142d896

…s branch

jcadam14 mentioned this issue Jan 9, 2024

Cleaned up branch to no longer be a full merge of #7 and #18 #20

Merged

25 tasks

This was referenced Jan 10, 2024

Add json "dump" to SubmissionDAO #12

Closed

21 add update submission function to submission repo #25

Closed

jcadam14 closed this as completed Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation schema design #7

Validation schema design #7

lchen-2101 commented Dec 4, 2023

hkeeler commented Dec 14, 2023

jcadam14 commented Dec 14, 2023

lchen-2101 commented Dec 15, 2023

jcadam14 commented Dec 27, 2023

jcadam14 commented Jan 24, 2024

Validation schema design #7

Validation schema design #7

Comments

lchen-2101 commented Dec 4, 2023

hkeeler commented Dec 14, 2023

jcadam14 commented Dec 14, 2023

lchen-2101 commented Dec 15, 2023

jcadam14 commented Dec 27, 2023

jcadam14 commented Jan 24, 2024