{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":541798154,"defaultBranch":"main","name":"unstructured","ownerLogin":"Unstructured-IO","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2022-09-26T21:53:41.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/108372208?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1714612343.0","currentOid":""},"activityList":{"items":[{"before":"7de1c9ab13694bda017c65e6c07c1fcfb18a7b9d","after":"5aec3cde8c23aabbd927853d5a758520e535d94d","ref":"refs/heads/scanny/fix-doc-tempfile-test","pushedAt":"2024-05-02T03:10:36.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"485fb3fb435500d5d1863659ac030d0912ecbf62","after":"79b064205906460bd39a5a015e637b6a894ebb56","ref":"refs/heads/scanny/tidy-doc-and-ppt-tests","pushedAt":"2024-05-02T03:10:11.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"15430c27fe552b9b4b71896c078df5dc12d74aff","after":null,"ref":"refs/heads/scanny/fix-docx-short-table-rows","pushedAt":"2024-05-02T01:12:23.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"}},{"before":"601594d373dbf43f50a0792bbe5369cb6d32913f","after":null,"ref":"refs/heads/gh-readonly-queue/main/pr-2943-eff84afe243dc38295d4be263ed31028f0f5572b","pushedAt":"2024-05-02T01:12:22.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":"eff84afe243dc38295d4be263ed31028f0f5572b","after":"601594d373dbf43f50a0792bbe5369cb6d32913f","ref":"refs/heads/main","pushedAt":"2024-05-02T01:12:21.000Z","pushType":"merge_queue_merge","commitsCount":1,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"fix(docx): fix short-row DOCX table (#2943)\n\n**Summary**\nThe DOCX format allows a table row to start late and/or end early,\nmeaning cells at the beginning or end of a row can be omitted. While\nthere are legitimate uses for this capability, using it in practice is\nrelatively rare. However, it can happen unintentionally when adjusting\ncell borders with the mouse. Accommodate this case and generate accurate\n`.text` and `.metadata.text_as_html` for these tables.","shortMessageHtmlLink":"fix(docx): fix short-row DOCX table (#2943)"}},{"before":null,"after":"601594d373dbf43f50a0792bbe5369cb6d32913f","ref":"refs/heads/gh-readonly-queue/main/pr-2943-eff84afe243dc38295d4be263ed31028f0f5572b","pushedAt":"2024-05-02T00:46:08.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"fix(docx): fix short-row DOCX table (#2943)\n\n**Summary**\nThe DOCX format allows a table row to start late and/or end early,\nmeaning cells at the beginning or end of a row can be omitted. While\nthere are legitimate uses for this capability, using it in practice is\nrelatively rare. However, it can happen unintentionally when adjusting\ncell borders with the mouse. Accommodate this case and generate accurate\n`.text` and `.metadata.text_as_html` for these tables.","shortMessageHtmlLink":"fix(docx): fix short-row DOCX table (#2943)"}},{"before":"49b431124ac89c900c7ea79ac97a8244a1061692","after":"15430c27fe552b9b4b71896c078df5dc12d74aff","ref":"refs/heads/scanny/fix-docx-short-table-rows","pushedAt":"2024-05-01T23:53:41.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"25e29cde9bf1f8c087f8c1cff1b8b2cd2d68118c","after":"49b431124ac89c900c7ea79ac97a8244a1061692","ref":"refs/heads/scanny/fix-docx-short-table-rows","pushedAt":"2024-05-01T23:46:16.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"9f62c42123f83b4a7b00b695e4607df204d033c6","after":"7de1c9ab13694bda017c65e6c07c1fcfb18a7b9d","ref":"refs/heads/scanny/fix-doc-tempfile-test","pushedAt":"2024-05-01T23:35:41.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"ab897da028db760132c654416da260e801666964","after":"485fb3fb435500d5d1863659ac030d0912ecbf62","ref":"refs/heads/scanny/tidy-doc-and-ppt-tests","pushedAt":"2024-05-01T23:32:12.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":null,"after":"ab897da028db760132c654416da260e801666964","ref":"refs/heads/scanny/tidy-doc-and-ppt-tests","pushedAt":"2024-05-01T23:29:26.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"rfctr: tidy up ppt+doc tests","shortMessageHtmlLink":"rfctr: tidy up ppt+doc tests"}},{"before":"23554a6886302215c78facfef5354f5bfdbd503e","after":"9f62c42123f83b4a7b00b695e4607df204d033c6","ref":"refs/heads/scanny/fix-doc-tempfile-test","pushedAt":"2024-05-01T23:25:51.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":null,"after":"23554a6886302215c78facfef5354f5bfdbd503e","ref":"refs/heads/scanny/fix-doc-tempfile-test","pushedAt":"2024-05-01T23:23:35.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"d3c9c7f4054d6614a711f1ec456d09032f7fc07a","after":null,"ref":"refs/heads/scanny/bump-docx-dependency","pushedAt":"2024-05-01T22:16:17.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"}},{"before":"eff84afe243dc38295d4be263ed31028f0f5572b","after":null,"ref":"refs/heads/gh-readonly-queue/main/pr-2952-542d442699602ff5bdd5a9cc80d4795933166b4c","pushedAt":"2024-05-01T22:16:16.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":"542d442699602ff5bdd5a9cc80d4795933166b4c","after":"eff84afe243dc38295d4be263ed31028f0f5572b","ref":"refs/heads/main","pushedAt":"2024-05-01T22:16:15.000Z","pushType":"merge_queue_merge","commitsCount":1,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"chore: update python-docx version dependency (#2952)\n\n**Summary**\n`unstructured` will use table features added in the most recent version\nof `python-docx`.\n\nAlso update the `lxml` version constraint because `lxml>4.9.2` will not\ninstall on Apple Silicon\n(https://github.com/Unstructured-IO/unstructured/issues/1707).\n\n`python-docx` requires `lxml` although other file formats require it as\nwell.","shortMessageHtmlLink":"chore: update python-docx version dependency (#2952)"}},{"before":null,"after":"eff84afe243dc38295d4be263ed31028f0f5572b","ref":"refs/heads/gh-readonly-queue/main/pr-2952-542d442699602ff5bdd5a9cc80d4795933166b4c","pushedAt":"2024-05-01T21:36:46.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"chore: update python-docx version dependency (#2952)\n\n**Summary**\n`unstructured` will use table features added in the most recent version\nof `python-docx`.\n\nAlso update the `lxml` version constraint because `lxml>4.9.2` will not\ninstall on Apple Silicon\n(https://github.com/Unstructured-IO/unstructured/issues/1707).\n\n`python-docx` requires `lxml` although other file formats require it as\nwell.","shortMessageHtmlLink":"chore: update python-docx version dependency (#2952)"}},{"before":"ce866d5377e9f2ae47586a16f86c44c1f8424672","after":"d3c9c7f4054d6614a711f1ec456d09032f7fc07a","ref":"refs/heads/scanny/bump-docx-dependency","pushedAt":"2024-05-01T20:55:12.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"1b32eb63d249cf43ba90d8782e0d5eb1eb3b3275","after":"ce866d5377e9f2ae47586a16f86c44c1f8424672","ref":"refs/heads/scanny/bump-docx-dependency","pushedAt":"2024-04-30T16:35:00.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"7edda3ce739788d99fe68667dbabd9e966be2cda","after":"1b32eb63d249cf43ba90d8782e0d5eb1eb3b3275","ref":"refs/heads/scanny/bump-docx-dependency","pushedAt":"2024-04-30T16:24:42.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"75a2017e1124537990b482910f77617b6730121a","after":null,"ref":"refs/heads/yuming/remove_html_page_numer_metadata_field","pushedAt":"2024-04-30T16:08:47.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"yuming-long","name":"Yuming Long","path":"/yuming-long","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/63475068?s=80&v=4"}},{"before":"542d442699602ff5bdd5a9cc80d4795933166b4c","after":null,"ref":"refs/heads/gh-readonly-queue/main/pr-2942-0d80886578dad40a557010e8d90506f61aedb0a9","pushedAt":"2024-04-30T16:08:47.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"}},{"before":"0d80886578dad40a557010e8d90506f61aedb0a9","after":"542d442699602ff5bdd5a9cc80d4795933166b4c","ref":"refs/heads/main","pushedAt":"2024-04-30T16:08:46.000Z","pushType":"merge_queue_merge","commitsCount":1,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"chore CORE-4775: remove html page number metadata field (#2942)\n\n### Summary\n\nRip off page_number metadata fields until we have page counting for all\nkinds of html files (not just limited to news articles with multiple\n`
` tag)\n\n### Test\nUnit tests\n`test_add_chunking_strategy_on_partition_html_respects_multipage` and\n`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`\nremoved since they relay on the `page_number` fields from the SEC html\nfile - now test moved to mock test for chunk_by_title -> revisit those\ntests when we find test file for this\n\nAlso changed the element ids from partition outputs for html files -\nelement id change due to page number change (in element id hashing) ->\ntodo ticket: update other deterministic element id tests per crag's\ncomment\n\n---------\n\nCo-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>\nCo-authored-by: yuming-long ","shortMessageHtmlLink":"chore CORE-4775: remove html page number metadata field (#2942)"}},{"before":null,"after":"542d442699602ff5bdd5a9cc80d4795933166b4c","ref":"refs/heads/gh-readonly-queue/main/pr-2942-0d80886578dad40a557010e8d90506f61aedb0a9","pushedAt":"2024-04-30T15:20:42.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"github-merge-queue[bot]","name":null,"path":"/apps/github-merge-queue","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9919?s=80&v=4"},"commit":{"message":"chore CORE-4775: remove html page number metadata field (#2942)\n\n### Summary\n\nRip off page_number metadata fields until we have page counting for all\nkinds of html files (not just limited to news articles with multiple\n`
` tag)\n\n### Test\nUnit tests\n`test_add_chunking_strategy_on_partition_html_respects_multipage` and\n`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`\nremoved since they relay on the `page_number` fields from the SEC html\nfile - now test moved to mock test for chunk_by_title -> revisit those\ntests when we find test file for this\n\nAlso changed the element ids from partition outputs for html files -\nelement id change due to page number change (in element id hashing) ->\ntodo ticket: update other deterministic element id tests per crag's\ncomment\n\n---------\n\nCo-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>\nCo-authored-by: yuming-long ","shortMessageHtmlLink":"chore CORE-4775: remove html page number metadata field (#2942)"}},{"before":"fce70aa4f1b16f28cc7171905f9357e131dcda38","after":"75a2017e1124537990b482910f77617b6730121a","ref":"refs/heads/yuming/remove_html_page_numer_metadata_field","pushedAt":"2024-04-30T14:48:09.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"yuming-long","name":"Yuming Long","path":"/yuming-long","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/63475068?s=80&v=4"},"commit":{"message":"no need for release","shortMessageHtmlLink":"no need for release"}},{"before":"3361d295bd997ca2a18ac12e9dddce4bb2339850","after":"fce70aa4f1b16f28cc7171905f9357e131dcda38","ref":"refs/heads/yuming/remove_html_page_numer_metadata_field","pushedAt":"2024-04-30T14:31:17.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"yuming-long","name":"Yuming Long","path":"/yuming-long","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/63475068?s=80&v=4"},"commit":{"message":"Merge branch 'main' into yuming/remove_html_page_numer_metadata_field","shortMessageHtmlLink":"Merge branch 'main' into yuming/remove_html_page_numer_metadata_field"}},{"before":"03a626e14b92a4a435d86229628e4f9aea4e0821","after":"7edda3ce739788d99fe68667dbabd9e966be2cda","ref":"refs/heads/scanny/bump-docx-dependency","pushedAt":"2024-04-30T06:35:33.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}},{"before":"7ce80df017b8a6d99c08198d24d84162e89b1acb","after":"25e29cde9bf1f8c087f8c1cff1b8b2cd2d68118c","ref":"refs/heads/scanny/fix-docx-short-table-rows","pushedAt":"2024-04-30T06:31:25.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"spike: fix short-row DOCX table","shortMessageHtmlLink":"spike: fix short-row DOCX table"}},{"before":"7720e724240b2a1ea1f431bc2ec6bb28ac377659","after":"0d80886578dad40a557010e8d90506f61aedb0a9","ref":"refs/heads/main","pushedAt":"2024-04-30T05:53:44.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"cragwolfe","name":null,"path":"/cragwolfe","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/28578599?s=80&v=4"},"commit":{"message":"fix: parse URL response Content-Type according to RFC 9110 (#2950)\n\nCurrently, `file_and_type_from_url()` does not correctly handle the\r\n`Content-Type` header. Specifically, it assumes that the header contains\r\nonly the mime-type (e.g. `text/html`), however, [RFC\r\n9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows\r\nfor additional directives — specifically the `charset` — to be returned\r\nin the header. This leads to a `ValueError` when loading a URL with a\r\nresponse Content-Type header such as `text/html; charset=UTF-8`.\r\n\r\nTo reproduce the issue:\r\n\r\n```python\r\nfrom unstructured.partition.auto import partition\r\n\r\nurl = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\"\r\npartition(url=url)\r\n```\r\n\r\nWhich will result in the following exception:\r\n\r\n```python\r\n{\r\n\t\"name\": \"ValueError\",\r\n\t\"message\": \"Invalid file. The FileType.UNK file type is not supported in partition.\",\r\n\t\"stack\": \"---------------------------------------------------------------------------\r\nValueError Traceback (most recent call last)\r\nCell In[1], line 4\r\n 1 from unstructured.partition.auto import partition\r\n 3 url = \\\"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\\\"\r\n----> 4 partition(url=url)\r\n\r\nFile ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs)\r\n 539 else:\r\n 540 msg = \\\"Invalid file\\\" if not filename else f\\\"Invalid file {filename}\\\"\r\n--> 541 raise ValueError(f\\\"{msg}. The {filetype} file type is not supported in partition.\\\")\r\n 543 for element in elements:\r\n 544 element.metadata.url = url\r\n\r\nValueError: Invalid file. The FileType.UNK file type is not supported in partition.\"\r\n}\r\n```\r\n\r\nThis PR fixes the issue by parsing the mime-type out of the\r\n`Content-Type` header string.\r\n\r\n\r\nCloses #2257","shortMessageHtmlLink":"fix: parse URL response Content-Type according to RFC 9110 (#2950)"}},{"before":"d31408685e9fd0ae99e47fcbecc6ef1a7d8d515a","after":"03a626e14b92a4a435d86229628e4f9aea4e0821","ref":"refs/heads/scanny/bump-docx-dependency","pushedAt":"2024-04-30T04:36:58.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"scanny","name":"Steve Canny","path":"/scanny","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/2062718?s=80&v=4"},"commit":{"message":"chore: bump CHANGELOG + __version__","shortMessageHtmlLink":"chore: bump CHANGELOG + __version__"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEP2I78QA","startCursor":null,"endCursor":null}},"title":"Activity · Unstructured-IO/unstructured"}