Skip to content

Commit

Permalink
bug CORE-4089: Onedrive partitioning fails - datetime formatting error (
Browse files Browse the repository at this point in the history
Unstructured-IO#2638)

Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
Unstructured-IO#2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:

https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called
their api
  • Loading branch information
potter-potter authored and kaaloo committed Apr 8, 2024
1 parent fb4874f commit 1558aeb
Show file tree
Hide file tree
Showing 14 changed files with 145 additions and 106 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
@@ -1,4 +1,4 @@
## 0.12.7-dev3
## 0.12.7-dev4

### Enhancements

Expand All @@ -9,6 +9,8 @@
### Fixes

* **Clarify IAM Role Requirement for GCS Platform Connectors**. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
* **Fix OneDrive dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
* **Adds tracking for AstraDB** Adds tracking info so AstraDB can see what source called their api.

## 0.12.6

Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/astra.txt
Expand Up @@ -8,7 +8,7 @@ anyio==3.7.1
# via
# -c ingest/../constraints.in
# httpx
astrapy==0.7.6
astrapy==0.7.7
# via -r ingest/astra.in
cassandra-driver==3.29.0
# via cassio
Expand Down
Expand Up @@ -3,8 +3,8 @@
"element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:09",
"date_modified": "2023-08-24T03:00:09",
"date_created": "2023-08-24T03:00:09+00:00",
"date_modified": "2023-08-24T03:00:09+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -24,8 +24,8 @@
"element_id": "a9d4657034aa3fdb5177f1325e912362",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:09",
"date_modified": "2023-08-24T03:00:09",
"date_created": "2023-08-24T03:00:09+00:00",
"date_modified": "2023-08-24T03:00:09+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -45,8 +45,8 @@
"element_id": "9c218520320f238595f1fde74bdd137d",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:09",
"date_modified": "2023-08-24T03:00:09",
"date_created": "2023-08-24T03:00:09+00:00",
"date_modified": "2023-08-24T03:00:09+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -66,8 +66,8 @@
"element_id": "39a3ae572581d0f1fe7511fd7b3aa414",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:09",
"date_modified": "2023-08-24T03:00:09",
"date_created": "2023-08-24T03:00:09+00:00",
"date_modified": "2023-08-24T03:00:09+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -87,8 +87,8 @@
"element_id": "fc1adcb8eaceac694e500a103f9f698f",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:09",
"date_modified": "2023-08-24T03:00:09",
"date_created": "2023-08-24T03:00:09+00:00",
"date_modified": "2023-08-24T03:00:09+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -108,8 +108,8 @@
"element_id": "0b61e826b1c4ab05750184da72b89f83",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:09",
"date_modified": "2023-08-24T03:00:09",
"date_created": "2023-08-24T03:00:09+00:00",
"date_modified": "2023-08-24T03:00:09+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand Down
Expand Up @@ -3,8 +3,8 @@
"element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:27",
"date_modified": "2023-08-24T03:00:27",
"date_created": "2023-08-24T03:00:27+00:00",
"date_modified": "2023-08-24T03:00:27+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/nested/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -24,8 +24,8 @@
"element_id": "a9d4657034aa3fdb5177f1325e912362",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:27",
"date_modified": "2023-08-24T03:00:27",
"date_created": "2023-08-24T03:00:27+00:00",
"date_modified": "2023-08-24T03:00:27+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/nested/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -45,8 +45,8 @@
"element_id": "9c218520320f238595f1fde74bdd137d",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:27",
"date_modified": "2023-08-24T03:00:27",
"date_created": "2023-08-24T03:00:27+00:00",
"date_modified": "2023-08-24T03:00:27+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/nested/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -66,8 +66,8 @@
"element_id": "39a3ae572581d0f1fe7511fd7b3aa414",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:27",
"date_modified": "2023-08-24T03:00:27",
"date_created": "2023-08-24T03:00:27+00:00",
"date_modified": "2023-08-24T03:00:27+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/nested/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -87,8 +87,8 @@
"element_id": "fc1adcb8eaceac694e500a103f9f698f",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:27",
"date_modified": "2023-08-24T03:00:27",
"date_created": "2023-08-24T03:00:27+00:00",
"date_modified": "2023-08-24T03:00:27+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/nested/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -108,8 +108,8 @@
"element_id": "0b61e826b1c4ab05750184da72b89f83",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:27",
"date_modified": "2023-08-24T03:00:27",
"date_created": "2023-08-24T03:00:27+00:00",
"date_modified": "2023-08-24T03:00:27+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/nested/fake-text.txt",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand Down
Expand Up @@ -3,8 +3,8 @@
"element_id": "a5c9668a6055bca2865ea5e6d16ea1e0",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -26,8 +26,8 @@
"element_id": "1d34c23ff08573afa07b42842b41277a",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -49,8 +49,8 @@
"element_id": "05440c6ca94cb55f6d185d8bd92ce9d6",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -72,8 +72,8 @@
"element_id": "e39c724f1b09a4c3286b6368538e05fc",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -95,8 +95,8 @@
"element_id": "1d34c23ff08573afa07b42842b41277a",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -118,8 +118,8 @@
"element_id": "85ada878f2345c23b8a74a931d2e20a4",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -141,8 +141,8 @@
"element_id": "0e570ca6fabe24f94e52c1833f3ffd25",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -164,8 +164,8 @@
"element_id": "4cf4ff5597274d0c1ce8ae5a17ead4df",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -187,8 +187,8 @@
"element_id": "dd167905de0defcaf72de673ee44c074",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -210,8 +210,8 @@
"element_id": "5f9d7b40d332fef76efdd0a97bcb8617",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -233,8 +233,8 @@
"element_id": "2b5c3d26721ae9c350cf3009318b626f",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -256,8 +256,8 @@
"element_id": "53d2273ac70fc31640cc45af840dbd42",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -279,8 +279,8 @@
"element_id": "4efca0d10c5feb8e9b35eb1d994f2905",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand All @@ -302,8 +302,8 @@
"element_id": "4c9720f1540cc84d33e30e09aca8c077",
"metadata": {
"data_source": {
"date_created": "2023-08-24T03:00:43",
"date_modified": "2023-08-24T03:00:43",
"date_created": "2023-08-24T03:00:43+00:00",
"date_modified": "2023-08-24T03:00:43+00:00",
"record_locator": {
"server_relative_path": "utic-test-ingest-fixtures/tests-example.xls",
"user_pname": "devops@unstructuredio.onmicrosoft.com"
Expand Down
36 changes: 35 additions & 1 deletion test_unstructured_ingest/unit/test_utils.py
@@ -1,10 +1,14 @@
import json
import typing as t
from dataclasses import dataclass, field
from datetime import datetime

import pytest
import pytz

from unstructured.ingest.cli.utils import extract_config
from unstructured.ingest.interfaces import BaseConfig
from unstructured.ingest.utils.string_utils import json_to_dict
from unstructured.ingest.utils.string_and_date_utils import ensure_isoformat_datetime, json_to_dict


@dataclass
Expand Down Expand Up @@ -128,3 +132,33 @@ def test_json_to_dict_path():
expected_result = "/path/to/file.json"
assert json_to_dict(json_string) == expected_result
assert isinstance(json_to_dict(json_string), str)


def test_ensure_isoformat_datetime_for_datetime():
dt = ensure_isoformat_datetime(datetime(2021, 1, 1, 12, 0, 0))
assert dt == "2021-01-01T12:00:00"


def test_ensure_isoformat_datetime_for_datetime_with_tz():
dt = ensure_isoformat_datetime(datetime(2021, 1, 1, 12, 0, 0, tzinfo=pytz.UTC))
assert dt == "2021-01-01T12:00:00+00:00"


def test_ensure_isoformat_datetime_for_string():
dt = ensure_isoformat_datetime("2021-01-01T12:00:00")
assert dt == "2021-01-01T12:00:00"


def test_ensure_isoformat_datetime_for_string2():
dt = ensure_isoformat_datetime("2021-01-01T12:00:00+00:00")
assert dt == "2021-01-01T12:00:00+00:00"


def test_ensure_isoformat_datetime_fails_on_string():
with pytest.raises(ValueError):
ensure_isoformat_datetime("bad timestamp")


def test_ensure_isoformat_datetime_fails_on_int():
with pytest.raises(TypeError):
ensure_isoformat_datetime(1111)
2 changes: 1 addition & 1 deletion unstructured/__version__.py
@@ -1 +1 @@
__version__ = "0.12.7-dev3" # pragma: no cover
__version__ = "0.12.7-dev4" # pragma: no cover
7 changes: 6 additions & 1 deletion unstructured/ingest/connector/astra.py
Expand Up @@ -2,6 +2,8 @@
import typing as t
from dataclasses import dataclass, field

from unstructured import __name__ as integration_name
from unstructured.__version__ import __version__ as integration_version
from unstructured.ingest.enhanced_dataclass import enhanced_field
from unstructured.ingest.enhanced_dataclass.core import _asdict
from unstructured.ingest.error import DestinationConnectionError, SourceConnectionNetworkError
Expand Down Expand Up @@ -67,10 +69,13 @@ def astra_db_collection(self) -> "AstraDBCollection":
if self._astra_db_collection is None:
from astrapy.db import AstraDB

# Build the Astra DB object
# Build the Astra DB object.
# caller_name/version for AstraDB tracking
self._astra_db = AstraDB(
api_endpoint=self.connector_config.access_config.api_endpoint,
token=self.connector_config.access_config.token,
caller_name=integration_name,
caller_version=integration_version,
)

# Create and connect to the newly created collection
Expand Down
2 changes: 1 addition & 1 deletion unstructured/ingest/connector/fsspec/gcs.py
Expand Up @@ -13,7 +13,7 @@
from unstructured.ingest.enhanced_dataclass import enhanced_field
from unstructured.ingest.error import SourceConnectionError
from unstructured.ingest.interfaces import AccessConfig
from unstructured.ingest.utils.string_utils import json_to_dict
from unstructured.ingest.utils.string_and_date_utils import json_to_dict
from unstructured.utils import requires_dependencies


Expand Down

0 comments on commit 1558aeb

Please sign in to comment.