Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug CORE-4089: Onedrive partitioning fails - datetime formatting error #2638

Merged
merged 15 commits into from Mar 15, 2024

Conversation

potter-potter
Copy link
Contributor

@potter-potter potter-potter commented Mar 11, 2024

Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products)
#2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called their api

@ahmetmeleq
Copy link
Contributor

ahmetmeleq commented Mar 14, 2024

ensure_isoformat_datetime already unit tested here: ...

Can we add a comment to the file such as # this also tests ensure_isoformat_datetime under the hood

so that when we search for a test for this function it comes up? Also it'll guarantee that if a change is made to the test in the future, the author will be aware that their change might affect test coverage

Edit: As a note, ideally we should decouple the function's unit test from Sharepoint since we decouple the function from the connector in this PR

Comment on lines +77 to +78
caller_name=integration_name,
caller_version=integration_version,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Astra changes relevant to our PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they got in unexpectedly, let's remove; if intentional, let's add explanations in the PR description and changelog

@ahmetmeleq
Copy link
Contributor

ahmetmeleq commented Mar 14, 2024

Also where have we spotted the Onedrive dates breaking (being inconsistent and failing ingest tests, or something else), or is this PR preventive in case they'll break in the future?

@potter-potter
Copy link
Contributor Author

ensure_isoformat_datetime already unit tested here: ...

Can we add a comment to the file such as # this also tests ensure_isoformat_datetime under the hood

so that when we search for a test for this function it comes up? Also it'll guarantee that if a change is made to the test in the future, the author will be aware that their change might affect test coverage

Edit: As a note, ideally we should decouple the function's unit test from Sharepoint since we decouple the function from the connector in this PR

I just made new tests that cover ensure_isoformat_datetime

@potter-potter
Copy link
Contributor Author

Also where have we spotted the Onedrive dates breaking (being inconsistent and failing ingest tests, or something else), or is this PR preventive in case they'll break in the future?

Improved the PR overview. We are seeing this issue live when processing docs.

Copy link
Contributor

@ahmetmeleq ahmetmeleq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@potter-potter potter-potter added this pull request to the merge queue Mar 15, 2024
Merged via the queue into main with commit 5b92e0b Mar 15, 2024
46 checks passed
@potter-potter potter-potter deleted the CORE-4089/OnedriveFailsPartitioning branch March 15, 2024 14:36
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
Unstructured-IO#2638)

Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
Unstructured-IO#2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:

https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called
their api
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
Unstructured-IO#2638)

Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
Unstructured-IO#2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:

https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called
their api
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants