Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

end to end table understanding using python API with demo on WD tooling #245

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Monireh2
Copy link

Made the table understanding end to end using the WD python SDK, included a video tutorial to show how someone can use the WD tooling up to querying the project.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:46Z
----------------------------------------------------------------

Web link to Cloud Pak for Data is not rendering properly on ReviewNB. Is there a typo in the Markdown?


Monireh2 commented on 2022-02-02T02:32:45Z
----------------------------------------------------------------

The link was working on my local machine and here in ReviewNB for me when I was clicking. I think it was not working because of the new line in the start of the url link. Just fixed it. Thanks for pointing that out.

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:47Z
----------------------------------------------------------------

"We start" ==> "Allison starts"


Monireh2 commented on 2022-02-02T02:33:40Z
----------------------------------------------------------------

fixed, thanks!

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:48Z
----------------------------------------------------------------

There's no need to embed Python code to display the video. You can directly embed the video file into the Markdown in the previous cell. Syntax:

<video controls src="./images/Table_Understanding.mp4'">Creating a collection in IBM Watson Discovery</video>

Documentation here: https://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html#Local-files

Overall the video looks good, but I do have some suggestions:

* You need to blur/black out the PII -- user names, people's names, account names. Instructions here: https://www.youtube.com/watch?v=54KYsEVJlWQ.

* If you have time to re-record the clip, I think it would work better if you shrunk the browser window to a smaller size and just recorded the window (Press command-shift-5 to select a portion of the screen to record).

* I recommend you edit out or speed up the parts where you're waiting for Discovery to perform an action.


Monireh2 commented on 2022-02-02T02:42:51Z
----------------------------------------------------------------

Thanks Fred for the pointer @frreiss. I actually tried to do so. But it will give me a black screen with the inactive play button. The only way I could resolve the issue was using the python snippet above. Regarding your other comments I will fix them.

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:48Z
----------------------------------------------------------------

I think it would be better to move this cell and the ones that follow (up to the heading, "Query the project") to a separate notebook file to avoid breaking up the flow. You can put a hyperlink to the other notebook file directly into your Markdown, i.e.

For more information, refer to [this additional notebook](./other_notebook.ipynb)

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:49Z
----------------------------------------------------------------

Can you truncate this output a bit? Maybe print out the first 20 lines, followed by something like [200 more lines ] ?


Monireh2 commented on 2022-02-02T16:43:53Z
----------------------------------------------------------------

done!

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:50Z
----------------------------------------------------------------

This table is rendering as empty (no body cells) in ReviewNB.


Monireh2 commented on 2022-02-03T00:27:23Z
----------------------------------------------------------------

That is weird. It is rendering for me over my local machine.

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:51Z
----------------------------------------------------------------

The data shown doesn't match the screenshot. The screenshot shows 2013-2014 data; the data here is for 2014-2014.


Monireh2 commented on 2022-02-03T00:28:36Z
----------------------------------------------------------------

Resolved. Had changed in final run!

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:51Z
----------------------------------------------------------------

Those error messages ('ERROR READING VALUE:"" Filling with <NA> ) shouldn't be there. Can you track down the root cause and open up a Github issue with code/data to reproduce


Monireh2 commented on 2022-02-03T00:43:15Z
----------------------------------------------------------------

@frreiss: The error does make sense to me. Whenever you get an empty value you are substitute the value with pd.NA and print the above error. I can open an issue on that if you think the code should get changed:

See line 229-231 here please:

https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/io/watson/tables.py

except ValueError: 
  ans = pd.NA 
  print(f"ERROR READING VALUE:\"{val}\"\t Filling with <NA>")

Here the value for "Major markets", "Growth Markets" and "BRIC countries" is empty.

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:52Z
----------------------------------------------------------------

Several incorrect values are present in this table: "of intellectual property", "Licensing/royalty-based fees", "Custom development income", "2009. The increase in total expense and other", "Examples of the company's investments include:", NaN, "Industry sales skills to support Smarter Planet".

These incorrect values most likely come from incorrect JSON input from Watson Discovery. Can you please trace these incorrect values back to the corresponding portions of the Watson Discovery output please? If there is a bug in Discovery, we should submit a bug report. If there's a bug in our Text Extensions for Pandas code it needs to be fixed.


Monireh2 commented on 2022-02-03T01:36:03Z
----------------------------------------------------------------

{'section_title': {'location': {'end': 627943, 'begin': 627925},
   'text': 'Geographic Revenue'},
  'row_headers': [{'column_index_begin': 0,
    'row_index_begin': 0,
    'location': {'end': 703825, 'begin': 703796},
    'text': 'Total consolidated research,',
    'row_index_end': 0,
    'cell_id': 'rowHeader-703796-703825',
    'column_index_end': 0,
    'text_normalized': 'Total consolidated research,'},
   {'column_index_begin': 0,
    'row_index_begin': 1,
    'location': {'end': 704313, 'begin': 704285},
    'text': 'development and engineering',
    'row_index_end': 1,
    'cell_id': 'rowHeader-704285-704313',
    'column_index_end': 0,
    'text_normalized': 'development and engineering'},
   {'column_index_begin': 0,
    'row_index_begin': 2,
    'location': {'end': 705414, 'begin': 705389},
    'text': 'Non-operating adjustment',
    'row_index_end': 2,
    'cell_id': 'rowHeader-705389-705414',
    'column_index_end': 0,
    'text_normalized': 'Non-operating adjustment'},
   {'column_index_begin': 0,
    'row_index_begin': 3,
    'location': {'end': 705914, 'begin': 705881},
    'text': 'Non-operating retirement-related',
    'row_index_end': 3,
    'cell_id': 'rowHeader-705881-705914',
    'column_index_end': 0,
    'text_normalized': 'Non-operating retirement-related'},
   {'column_index_begin': 0,
    'row_index_begin': 4,
    'location': {'end': 706394, 'begin': 706379},
    'text': '(costs)/income',
    'row_index_end': 4,
    'cell_id': 'rowHeader-706379-706394',
    'column_index_end': 0,
    'text_normalized': '(costs)/income'},
   {'column_index_begin': 0,
    'row_index_begin': 5,
    'location': {'end': 707502, 'begin': 707471},
    'text': 'Operating (non-GAAP) research,',
    'row_index_end': 5,
    'cell_id': 'rowHeader-707471-707502',
    'column_index_end': 0,
    'text_normalized': 'Operating (non-GAAP) research,'},
   {'column_index_begin': 0,
    'row_index_begin': 6,
    'location': {'end': 707990, 'begin': 707962},
    'text': 'development and engineering',
    'row_index_end': 6,
    'cell_id': 'rowHeader-707962-707990',
    'column_index_end': 0,
    'text_normalized': 'development and engineering'}],
  'table_headers': [],
  'location': {'end': 708798, 'begin': 703796},
  'text': 'Total consolidated research,       development and engineering $5,247 $5,437 (3.5)%\nNon-operating adjustment\n      Non-operating retirement-related       (costs)/income (48) 77 NM\nOperating (non-GAAP) research,\n      development and engineering $5,200 $5,514 (5.7)%\n',
  'body_cells': [{'row_header_ids': ['rowHeader-703796-703825'],
    'column_index_begin': 1,
    'row_index_begin': 0,
    'row_header_texts': ['Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 703903, 'begin': 703902},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Total consolidated research,'],
    'cell_id': 'bodyCell-703902-703903'},
   {'row_header_ids': ['rowHeader-703796-703825'],
    'column_index_begin': 2,
    'row_index_begin': 0,
    'row_header_texts': ['Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 703968, 'begin': 703967},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Total consolidated research,'],
    'cell_id': 'bodyCell-703967-703968'},
   {'row_header_ids': ['rowHeader-703796-703825'],
    'column_index_begin': 3,
    'row_index_begin': 0,
    'row_header_texts': ['Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 704033, 'begin': 704032},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Total consolidated research,'],
    'cell_id': 'bodyCell-704032-704033'},
   {'row_header_ids': ['rowHeader-704285-704313', 'rowHeader-703796-703825'],
    'column_index_begin': 1,
    'row_index_begin': 1,
    'row_header_texts': ['development and engineering',
     'Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 704581, 'begin': 704574},
    'attributes': [{'location': {'end': 704580, 'begin': 704574},
      'text': '$5,247',
      'type': 'Currency'}],
    'text': '$5,247',
    'row_index_end': 1,
    'row_header_texts_normalized': ['development and engineering',
     'Total consolidated research,'],
    'cell_id': 'bodyCell-704574-704581'},
   {'row_header_ids': ['rowHeader-704285-704313', 'rowHeader-703796-703825'],
    'column_index_begin': 2,
    'row_index_begin': 1,
    'row_header_texts': ['development and engineering',
     'Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 704848, 'begin': 704841},
    'attributes': [{'location': {'end': 704847, 'begin': 704841},
      'text': '$5,437',
      'type': 'Currency'}],
    'text': '$5,437',
    'row_index_end': 1,
    'row_header_texts_normalized': ['development and engineering',
     'Total consolidated research,'],
    'cell_id': 'bodyCell-704841-704848'},
   {'row_header_ids': ['rowHeader-704285-704313', 'rowHeader-703796-703825'],
    'column_index_begin': 3,
    'row_index_begin': 1,
    'row_header_texts': ['development and engineering',
     'Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705118, 'begin': 705111},
    'attributes': [{'location': {'end': 705115, 'begin': 705112},
      'text': '3.5',
      'type': 'Number'}],
    'text': '(3.5)%',
    'row_index_end': 1,
    'row_header_texts_normalized': ['development and engineering',
     'Total consolidated research,'],
    'cell_id': 'bodyCell-705111-705118'},
   {'row_header_ids': ['rowHeader-705389-705414'],
    'column_index_begin': 1,
    'row_index_begin': 2,
    'row_header_texts': ['Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705492, 'begin': 705491},
    'attributes': [],
    'text': '',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Non-operating adjustment'],
    'cell_id': 'bodyCell-705491-705492'},
   {'row_header_ids': ['rowHeader-705389-705414'],
    'column_index_begin': 2,
    'row_index_begin': 2,
    'row_header_texts': ['Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705557, 'begin': 705556},
    'attributes': [],
    'text': '',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Non-operating adjustment'],
    'cell_id': 'bodyCell-705556-705557'},
   {'row_header_ids': ['rowHeader-705389-705414'],
    'column_index_begin': 3,
    'row_index_begin': 2,
    'row_header_texts': ['Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705622, 'begin': 705621},
    'attributes': [],
    'text': '',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Non-operating adjustment'],
    'cell_id': 'bodyCell-705621-705622'},
   {'row_header_ids': ['rowHeader-705881-705914', 'rowHeader-705389-705414'],
    'column_index_begin': 1,
    'row_index_begin': 3,
    'row_header_texts': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705992, 'begin': 705991},
    'attributes': [],
    'text': '',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'cell_id': 'bodyCell-705991-705992'},
   {'row_header_ids': ['rowHeader-705881-705914', 'rowHeader-705389-705414'],
    'column_index_begin': 2,
    'row_index_begin': 3,
    'row_header_texts': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706057, 'begin': 706056},
    'attributes': [],
    'text': '',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'cell_id': 'bodyCell-706056-706057'},
   {'row_header_ids': ['rowHeader-705881-705914', 'rowHeader-705389-705414'],
    'column_index_begin': 3,
    'row_index_begin': 3,
    'row_header_texts': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706122, 'begin': 706121},
    'attributes': [],
    'text': '',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'cell_id': 'bodyCell-706121-706122'},
   {'row_header_ids': ['rowHeader-706379-706394',
     'rowHeader-705389-705414',
     'rowHeader-705881-705914'],
    'column_index_begin': 1,
    'row_index_begin': 4,
    'row_header_texts': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706662, 'begin': 706657},
    'attributes': [{'location': {'end': 706660, 'begin': 706658},
      'text': '48',
      'type': 'Number'}],
    'text': '(48)',
    'row_index_end': 4,
    'row_header_texts_normalized': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'cell_id': 'bodyCell-706657-706662'},
   {'row_header_ids': ['rowHeader-706379-706394',
     'rowHeader-705389-705414',
     'rowHeader-705881-705914'],
    'column_index_begin': 2,
    'row_index_begin': 4,
    'row_header_texts': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706929, 'begin': 706926},
    'attributes': [{'location': {'end': 706928, 'begin': 706926},
      'text': '77',
      'type': 'Number'}],
    'text': '77',
    'row_index_end': 4,
    'row_header_texts_normalized': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'cell_id': 'bodyCell-706926-706929'},
   {'row_header_ids': ['rowHeader-706379-706394',
     'rowHeader-705389-705414',
     'rowHeader-705881-705914'],
    'column_index_begin': 3,
    'row_index_begin': 4,
    'row_header_texts': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707196, 'begin': 707193},
    'attributes': [],
    'text': 'NM',
    'row_index_end': 4,
    'row_header_texts_normalized': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'cell_id': 'bodyCell-707193-707196'},
   {'row_header_ids': ['rowHeader-707471-707502'],
    'column_index_begin': 1,
    'row_index_begin': 5,
    'row_header_texts': ['Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707580, 'begin': 707579},
    'attributes': [],
    'text': '',
    'row_index_end': 5,
    'row_header_texts_normalized': ['Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-707579-707580'},
   {'row_header_ids': ['rowHeader-707471-707502'],
    'column_index_begin': 2,
    'row_index_begin': 5,
    'row_header_texts': ['Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707645, 'begin': 707644},
    'attributes': [],
    'text': '',
    'row_index_end': 5,
    'row_header_texts_normalized': ['Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-707644-707645'},
   {'row_header_ids': ['rowHeader-707471-707502'],
    'column_index_begin': 3,
    'row_index_begin': 5,
    'row_header_texts': ['Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707710, 'begin': 707709},
    'attributes': [],
    'text': '',
    'row_index_end': 5,
    'row_header_texts_normalized': ['Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-707709-707710'},
   {'row_header_ids': ['rowHeader-707962-707990', 'rowHeader-707471-707502'],
    'column_index_begin': 1,
    'row_index_begin': 6,
    'row_header_texts': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 708259, 'begin': 708252},
    'attributes': [{'location': {'end': 708258, 'begin': 708252},
      'text': '$5,200',
      'type': 'Currency'}],
    'text': '$5,200',
    'row_index_end': 6,
    'row_header_texts_normalized': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-708252-708259'},
   {'row_header_ids': ['rowHeader-707962-707990', 'rowHeader-707471-707502'],
    'column_index_begin': 2,
    'row_index_begin': 6,
    'row_header_texts': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 708527, 'begin': 708520},
    'attributes': [{'location': {'end': 708526, 'begin': 708520},
      'text': '$5,514',
      'type': 'Currency'}],
    'text': '$5,514',
    'row_index_end': 6,
    'row_header_texts_normalized': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-708520-708527'},
   {'row_header_ids': ['rowHeader-707962-707990', 'rowHeader-707471-707502'],
    'column_index_begin': 3,
    'row_index_begin': 6,
    'row_header_texts': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 708798, 'begin': 708791},
    'attributes': [{'location': {'end': 708795, 'begin': 708792},
      'text': '5.7',
      'type': 'Number'}],
    'text': '(5.7)%',
    'row_index_end': 6,
    'row_header_texts_normalized': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-708791-708798'}],
  'contexts': [{'location': {'end': 702649, 'begin': 702242},
    'text': 'For the year ended December 31: 2015 2014'},
   {'location': {'end': 702865, 'begin': 702855}, 'text': 'Yr.-to-Yr.'},
   {'location': {'end': 703240, 'begin': 703046}, 'text': 'Percent\nChange'},
   {'location': {'end': 709029, 'begin': 709012}, 'text': 'NM-Not meaningful'},
   {'location': {'end': 709287, 'begin': 709227},
    'text': 'Research, development and engineering (RD&E) expense was'},
   {'location': {'end': 709775, 'begin': 709521},
    'text': '6.4 percent of revenue in 2015 and 5.9 percent of revenue in 2014.'},
   {'location': {'end': 710189, 'begin': 709945},
    'text': 'RD&E expense decreased 3.5 percent in 2015 versus 2014 primarily driven by:'}],
  'key_value_pairs': [{'value': [{'location': {'end': 704580, 'begin': 704574},
      'text': '$5,247',
      'cell_id': 'bodyCell-704574-704581'}],
    'key': {'location': {'end': 704312, 'begin': 704285},
     'text': 'development and engineering',
     'cell_id': 'rowHeader-704285-704313'}},
   {'value': [{'location': {'end': 708258, 'begin': 708252},
      'text': '$5,200',
      'cell_id': 'bodyCell-708252-708259'}],
    'key': {'location': {'end': 707989, 'begin': 707962},
     'text': 'development and engineering',
     'cell_id': 'rowHeader-707962-707990'}}],
  'title': {},
  'column_headers': []},
 {'section_title': {'location': {'end': 627943, 'begin': 627925},
   'text': 'Geographic Revenue'},
  'row_headers': [{'column_index_begin': 0,
    'row_index_begin': 0,
    'location': {'end': 714975, 'begin': 714949},
    'text': 'Sales and other transfers',
    'row_index_end': 0,
    'cell_id': 'rowHeader-714949-714975',
    'column_index_end': 0,
    'text_normalized': 'Sales and other transfers'},
   {'column_index_begin': 0,
    'row_index_begin': 1,
    'location': {'end': 715466, 'begin': 715441},
    'text': 'of intellectual property',
    'row_index_end': 1,
    'cell_id': 'rowHeader-715441-715466',
    'column_index_end': 0,
    'text_normalized': 'of intellectual property'},
   {'column_index_begin': 0,
    'row_index_begin': 2,
    'location': {'end': 716567, 'begin': 716538},
    'text': 'Licensing/royalty-based fees',
    'row_index_end': 2,
    'cell_id': 'rowHeader-716538-716567',
    'column_index_end': 0,
    'text_normalized': 'Licensing/royalty-based fees'},
   {'column_index_begin': 0,
    'row_index_begin': 3,
    'location': {'end': 717670, 'begin': 717644},
    'text': 'Custom development income',
    'row_index_end': 3,
    'cell_id': 'rowHeader-717644-717670',
    'column_index_end': 0,
    'text_normalized': 'Custom development income'},
   {'column_index_begin': 0,
    'row_index_begin': 4,
    'location': {'end': 718753, 'begin': 718747},
    'text': 'Total',
    'row_index_end': 4,
    'cell_id': 'rowHeader-718747-718753',
    'column_index_end': 0,
    'text_normalized': 'Total'}],
  'table_headers': [],
  'location': {'end': 719555, 'begin': 714949},
  'text': 'Sales and other transfers       of intellectual property $303 $283 7.1%\nLicensing/royalty-based fees 117 129 (9.8)\nCustom development income 262 330 (20.5)\nTotal $682 $742 (8.1)%\n',
  'body_cells': [{'row_header_ids': ['rowHeader-714949-714975'],
    'column_index_begin': 1,
    'row_index_begin': 0,
    'row_header_texts': ['Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715053, 'begin': 715052},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Sales and other transfers'],
    'cell_id': 'bodyCell-715052-715053'},
   {'row_header_ids': ['rowHeader-714949-714975'],
    'column_index_begin': 2,
    'row_index_begin': 0,
    'row_header_texts': ['Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715118, 'begin': 715117},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Sales and other transfers'],
    'cell_id': 'bodyCell-715117-715118'},
   {'row_header_ids': ['rowHeader-714949-714975'],
    'column_index_begin': 3,
    'row_index_begin': 0,
    'row_header_texts': ['Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715183, 'begin': 715182},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Sales and other transfers'],
    'cell_id': 'bodyCell-715182-715183'},
   {'row_header_ids': ['rowHeader-715441-715466', 'rowHeader-714949-714975'],
    'column_index_begin': 1,
    'row_index_begin': 1,
    'row_header_texts': ['of intellectual property',
     'Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715734, 'begin': 715729},
    'attributes': [{'location': {'end': 715733, 'begin': 715729},
      'text': '$303',
      'type': 'Currency'}],
    'text': '$303',
    'row_index_end': 1,
    'row_header_texts_normalized': ['of intellectual property',
     'Sales and other transfers'],
    'cell_id': 'bodyCell-715729-715734'},
   {'row_header_ids': ['rowHeader-715441-715466', 'rowHeader-714949-714975'],
    'column_index_begin': 2,
    'row_index_begin': 1,
    'row_header_texts': ['of intellectual property',
     'Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 716001, 'begin': 715996},
    'attributes': [{'location': {'end': 716000, 'begin': 715996},
      'text': '$283',
      'type': 'Currency'}],
    'text': '$283',
    'row_index_end': 1,
    'row_header_texts_normalized': ['of intellectual property',
     'Sales and other transfers'],
    'cell_id': 'bodyCell-715996-716001'},
   {'row_header_ids': ['rowHeader-715441-715466', 'rowHeader-714949-714975'],
    'column_index_begin': 3,
    'row_index_begin': 1,
    'row_header_texts': ['of intellectual property',
     'Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 716269, 'begin': 716264},
    'attributes': [{'location': {'end': 716268, 'begin': 716264},
      'text': '7.1%',
      'type': 'Percentage'}],
    'text': '7.1%',
    'row_index_end': 1,
    'row_header_texts_normalized': ['of intellectual property',
     'Sales and other transfers'],
    'cell_id': 'bodyCell-716264-716269'},
   {'row_header_ids': ['rowHeader-716538-716567'],
    'column_index_begin': 1,
    'row_index_begin': 2,
    'row_header_texts': ['Licensing/royalty-based fees'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 716835, 'begin': 716831},
    'attributes': [{'location': {'end': 716834, 'begin': 716831},
      'text': '117',
      'type': 'Number'}],
    'text': '117',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Licensing/royalty-based fees'],
    'cell_id': 'bodyCell-716831-716835'},
   {'row_header_ids': ['rowHeader-716538-716567'],
    'column_index_begin': 2,
    'row_index_begin': 2,
    'row_header_texts': ['Licensing/royalty-based fees'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 717104, 'begin': 717100},
    'attributes': [{'location': {'end': 717103, 'begin': 717100},
      'text': '129',
      'type': 'Number'}],
    'text': '129',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Licensing/royalty-based fees'],
    'cell_id': 'bodyCell-717100-717104'},
   {'row_header_ids': ['rowHeader-716538-716567'],
    'column_index_begin': 3,
    'row_index_begin': 2,
    'row_header_texts': ['Licensing/royalty-based fees'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 717374, 'begin': 717368},
    'attributes': [{'location': {'end': 717372, 'begin': 717369},
      'text': '9.8',
      'type': 'Number'}],
    'text': '(9.8)',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Licensing/royalty-based fees'],
    'cell_id': 'bodyCell-717368-717374'},
   {'row_header_ids': ['rowHeader-717644-717670'],
    'column_index_begin': 1,
    'row_index_begin': 3,
    'row_header_texts': ['Custom development income'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 717939, 'begin': 717935},
    'attributes': [{'location': {'end': 717938, 'begin': 717935},
      'text': '262',
      'type': 'Number'}],
    'text': '262',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Custom development income'],
    'cell_id': 'bodyCell-717935-717939'},
   {'row_header_ids': ['rowHeader-717644-717670'],
    'column_index_begin': 2,
    'row_index_begin': 3,
    'row_header_texts': ['Custom development income'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 718208, 'begin': 718204},
    'attributes': [{'location': {'end': 718207, 'begin': 718204},
      'text': '330',
      'type': 'Number'}],
    'text': '330',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Custom development income'],
    'cell_id': 'bodyCell-718204-718208'},
   {'row_header_ids': ['rowHeader-717644-717670'],
    'column_index_begin': 3,
    'row_index_begin': 3,
    'row_header_texts': ['Custom development income'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 718473, 'begin': 718466},
    'attributes': [{'location': {'end': 718471, 'begin': 718467},
      'text': '20.5',
      'type': 'Number'}],
    'text': '(20.5)',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Custom development income'],
    'cell_id': 'bodyCell-718466-718473'},
   {'row_header_ids': ['rowHeader-718747-718753'],
    'column_index_begin': 1,
    'row_index_begin': 4,
    'row_header_texts': ['Total'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 719022, 'begin': 719017},
    'attributes': [{'location': {'end': 719021, 'begin': 719017},
      'text': '$682',
      'type': 'Currency'}],
    'text': '$682',
    'row_index_end': 4,
    'row_header_texts_normalized': ['Total'],
    'cell_id': 'bodyCell-719017-719022'},
   {'row_header_ids': ['rowHeader-718747-718753'],
    'column_index_begin': 2,
    'row_index_begin': 4,
    'row_header_texts': ['Total'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 719287, 'begin': 719282},
    'attributes': [{'location': {'end': 719286, 'begin': 719282},
      'text': '$742',
      'type': 'Currency'}],
    'text': '$742',
    'row_index_end': 4,
    'row_header_texts_normalized': ['Total'],
    'cell_id': 'bodyCell-719282-719287'},
   {'row_header_ids': ['rowHeader-718747-718753'],
    'column_index_begin': 3,
    'row_index_begin': 4,
    'row_header_texts': ['Total'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 719555, 'begin': 719548},
    'attributes': [{'location': {'end': 719552, 'begin': 719549},
      'text': '8.1',
      'type': 'Number'}],
    'text': '(8.1)%',
    'row_index_end': 4,
    'row_header_texts_normalized': ['Total'],
    'cell_id': 'bodyCell-719548-719555'}],
  'contexts': [{'location': {'end': 713812, 'begin': 713406},
    'text': 'For the year ended December 31: 2015 2014'},
   {'location': {'end': 714027, 'begin': 714017}, 'text': 'Yr.-to-Yr.'},
   {'location': {'end': 714400, 'begin': 714207}, 'text': 'Percent Change'},
   {'location': {'end': 720499, 'begin': 719763},
    'text': 'The timing and amount of Sales and other transfers of IP may vary significantly from period to period depending upon the timing of divestitures, economic conditions, industry consolidation and the timing of new patents and know-how development.'},
   {'location': {'end': 720730, 'begin': 720500},
    'text': 'There were no material individual IP transactions in 2015 or 2014.'},
   {'location': {'end': 720953, 'begin': 720927},
    'text': 'Other (Income) and Expense'}],
  'key_value_pairs': [{'value': [{'location': {'end': 719021, 'begin': 719017},
      'text': '$682',
      'cell_id': 'bodyCell-719017-719022'}],
    'key': {'location': {'end': 718752, 'begin': 718747},
     'text': 'Total',
     'cell_id': 'rowHeader-718747-718753'}}],
  'title': {},
  'column_headers': []},

If you look at the json you can see it covers pther tables under Geographic Revenue section title as well: please check page 56-57 in IBM_Annual_Report_2015. I would say it is neither error with Text Extension for Pandas nor with WD.

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:53Z
----------------------------------------------------------------

This table contains more duplicates than it did before. Why is that happening? Is the latest version of Watson Discovery returning multiple copies of the same table?


Monireh2 commented on 2022-02-03T21:25:50Z
----------------------------------------------------------------

I checked this carefully and you can see for example for 2012-2011 we have two geographic revenue tables and that is the same for 2011-2010. So we will have 4 values for America for 2011. You just need to search for 44,944 to validate this. Checking that in the IBM_Annual_Report_2012.pdf I can see two Geographic Revenues tables one for 2012-2011 and another for 2011-2010 which explains why we have 4 values for each region for each year and Watson Discovery has listed both tables for each document correctly.

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 1, 2022

View / edit / reply to this conversation on ReviewNB

frreiss commented on 2022-02-01T00:39:54Z
----------------------------------------------------------------

Data from 2018 and 2019 is no longer here. What happened to it?


Monireh2 commented on 2022-02-04T00:36:56Z
----------------------------------------------------------------

checking why the data from 2019.pdf has not processed; I can see some results has been returned by WD for 2019.pdf!

Monireh2 commented on 2022-02-10T23:12:12Z
----------------------------------------------------------------

The column_header_texts for the 2018-2019 table is empty from WD discovery's json output, that is why we were not retain the rows for 2019-2018 table:

"column_header_texts": [
        "",
        "",
        "",
        "",
        ""
      ],

	text	row_header_texts_0	column_header_texts	attributes.type	value
0	2019	For the year ended December 31:		[DateTime]	2019
1	2018	For the year ended December 31:		[Number]	2018

Monireh2 commented on 2022-02-10T23:14:39Z
----------------------------------------------------------------

I am just changing the retaining condition or copy the from the text column into the column_header_texts when the text follows the \d4 regex pattern to include the 2018-2019 info as well.

Monireh2 commented on 2022-02-11T18:02:11Z
----------------------------------------------------------------

Created an issue with the Discovery's team: https://github.ibm.com/Watson-Discovery/disco-issue-tracker/issues/10974

Copy link
Author

Monireh2 commented Feb 2, 2022

The link was working on my local machine and here in ReviewNB for me when I was clicking. I think it was not working because of the new line in the start of the url link. Just fixed it. Thanks for pointing that out.


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 2, 2022

fixed, thanks!


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 2, 2022

Thanks Fred for the pointer @frreiss. I actually tried to do so. But it will give me a black screen with the inactive play button. The only way I could resolve the issue was using the python snippet above. Regarding your other comments I will fix them.


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 2, 2022

done!


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 3, 2022

That is weird. It is rendering for me over my local machine.


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 3, 2022

Resolved. Had changed in final run!


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 3, 2022

@frreiss: The error does make sense to me. Whenever you get an empty value you are substitute the value with pd.NA and print the above error. I can open an issue on that if you think the code should get changed:

See line 229-231 here please:

https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/io/watson/tables.py

except ValueError: ans = pd.NA print(f"ERROR READING VALUE:\"{val}\"\t Filling with <NA>")


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 3, 2022

{'section_title': {'location': {'end': 627943, 'begin': 627925},
   'text': 'Geographic Revenue'},
  'row_headers': [{'column_index_begin': 0,
    'row_index_begin': 0,
    'location': {'end': 703825, 'begin': 703796},
    'text': 'Total consolidated research,',
    'row_index_end': 0,
    'cell_id': 'rowHeader-703796-703825',
    'column_index_end': 0,
    'text_normalized': 'Total consolidated research,'},
   {'column_index_begin': 0,
    'row_index_begin': 1,
    'location': {'end': 704313, 'begin': 704285},
    'text': 'development and engineering',
    'row_index_end': 1,
    'cell_id': 'rowHeader-704285-704313',
    'column_index_end': 0,
    'text_normalized': 'development and engineering'},
   {'column_index_begin': 0,
    'row_index_begin': 2,
    'location': {'end': 705414, 'begin': 705389},
    'text': 'Non-operating adjustment',
    'row_index_end': 2,
    'cell_id': 'rowHeader-705389-705414',
    'column_index_end': 0,
    'text_normalized': 'Non-operating adjustment'},
   {'column_index_begin': 0,
    'row_index_begin': 3,
    'location': {'end': 705914, 'begin': 705881},
    'text': 'Non-operating retirement-related',
    'row_index_end': 3,
    'cell_id': 'rowHeader-705881-705914',
    'column_index_end': 0,
    'text_normalized': 'Non-operating retirement-related'},
   {'column_index_begin': 0,
    'row_index_begin': 4,
    'location': {'end': 706394, 'begin': 706379},
    'text': '(costs)/income',
    'row_index_end': 4,
    'cell_id': 'rowHeader-706379-706394',
    'column_index_end': 0,
    'text_normalized': '(costs)/income'},
   {'column_index_begin': 0,
    'row_index_begin': 5,
    'location': {'end': 707502, 'begin': 707471},
    'text': 'Operating (non-GAAP) research,',
    'row_index_end': 5,
    'cell_id': 'rowHeader-707471-707502',
    'column_index_end': 0,
    'text_normalized': 'Operating (non-GAAP) research,'},
   {'column_index_begin': 0,
    'row_index_begin': 6,
    'location': {'end': 707990, 'begin': 707962},
    'text': 'development and engineering',
    'row_index_end': 6,
    'cell_id': 'rowHeader-707962-707990',
    'column_index_end': 0,
    'text_normalized': 'development and engineering'}],
  'table_headers': [],
  'location': {'end': 708798, 'begin': 703796},
  'text': 'Total consolidated research,       development and engineering $5,247 $5,437 (3.5)%\nNon-operating adjustment\n      Non-operating retirement-related       (costs)/income (48) 77 NM\nOperating (non-GAAP) research,\n      development and engineering $5,200 $5,514 (5.7)%\n',
  'body_cells': [{'row_header_ids': ['rowHeader-703796-703825'],
    'column_index_begin': 1,
    'row_index_begin': 0,
    'row_header_texts': ['Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 703903, 'begin': 703902},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Total consolidated research,'],
    'cell_id': 'bodyCell-703902-703903'},
   {'row_header_ids': ['rowHeader-703796-703825'],
    'column_index_begin': 2,
    'row_index_begin': 0,
    'row_header_texts': ['Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 703968, 'begin': 703967},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Total consolidated research,'],
    'cell_id': 'bodyCell-703967-703968'},
   {'row_header_ids': ['rowHeader-703796-703825'],
    'column_index_begin': 3,
    'row_index_begin': 0,
    'row_header_texts': ['Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 704033, 'begin': 704032},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Total consolidated research,'],
    'cell_id': 'bodyCell-704032-704033'},
   {'row_header_ids': ['rowHeader-704285-704313', 'rowHeader-703796-703825'],
    'column_index_begin': 1,
    'row_index_begin': 1,
    'row_header_texts': ['development and engineering',
     'Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 704581, 'begin': 704574},
    'attributes': [{'location': {'end': 704580, 'begin': 704574},
      'text': '$5,247',
      'type': 'Currency'}],
    'text': '$5,247',
    'row_index_end': 1,
    'row_header_texts_normalized': ['development and engineering',
     'Total consolidated research,'],
    'cell_id': 'bodyCell-704574-704581'},
   {'row_header_ids': ['rowHeader-704285-704313', 'rowHeader-703796-703825'],
    'column_index_begin': 2,
    'row_index_begin': 1,
    'row_header_texts': ['development and engineering',
     'Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 704848, 'begin': 704841},
    'attributes': [{'location': {'end': 704847, 'begin': 704841},
      'text': '$5,437',
      'type': 'Currency'}],
    'text': '$5,437',
    'row_index_end': 1,
    'row_header_texts_normalized': ['development and engineering',
     'Total consolidated research,'],
    'cell_id': 'bodyCell-704841-704848'},
   {'row_header_ids': ['rowHeader-704285-704313', 'rowHeader-703796-703825'],
    'column_index_begin': 3,
    'row_index_begin': 1,
    'row_header_texts': ['development and engineering',
     'Total consolidated research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705118, 'begin': 705111},
    'attributes': [{'location': {'end': 705115, 'begin': 705112},
      'text': '3.5',
      'type': 'Number'}],
    'text': '(3.5)%',
    'row_index_end': 1,
    'row_header_texts_normalized': ['development and engineering',
     'Total consolidated research,'],
    'cell_id': 'bodyCell-705111-705118'},
   {'row_header_ids': ['rowHeader-705389-705414'],
    'column_index_begin': 1,
    'row_index_begin': 2,
    'row_header_texts': ['Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705492, 'begin': 705491},
    'attributes': [],
    'text': '',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Non-operating adjustment'],
    'cell_id': 'bodyCell-705491-705492'},
   {'row_header_ids': ['rowHeader-705389-705414'],
    'column_index_begin': 2,
    'row_index_begin': 2,
    'row_header_texts': ['Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705557, 'begin': 705556},
    'attributes': [],
    'text': '',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Non-operating adjustment'],
    'cell_id': 'bodyCell-705556-705557'},
   {'row_header_ids': ['rowHeader-705389-705414'],
    'column_index_begin': 3,
    'row_index_begin': 2,
    'row_header_texts': ['Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705622, 'begin': 705621},
    'attributes': [],
    'text': '',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Non-operating adjustment'],
    'cell_id': 'bodyCell-705621-705622'},
   {'row_header_ids': ['rowHeader-705881-705914', 'rowHeader-705389-705414'],
    'column_index_begin': 1,
    'row_index_begin': 3,
    'row_header_texts': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 705992, 'begin': 705991},
    'attributes': [],
    'text': '',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'cell_id': 'bodyCell-705991-705992'},
   {'row_header_ids': ['rowHeader-705881-705914', 'rowHeader-705389-705414'],
    'column_index_begin': 2,
    'row_index_begin': 3,
    'row_header_texts': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706057, 'begin': 706056},
    'attributes': [],
    'text': '',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'cell_id': 'bodyCell-706056-706057'},
   {'row_header_ids': ['rowHeader-705881-705914', 'rowHeader-705389-705414'],
    'column_index_begin': 3,
    'row_index_begin': 3,
    'row_header_texts': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706122, 'begin': 706121},
    'attributes': [],
    'text': '',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Non-operating retirement-related',
     'Non-operating adjustment'],
    'cell_id': 'bodyCell-706121-706122'},
   {'row_header_ids': ['rowHeader-706379-706394',
     'rowHeader-705389-705414',
     'rowHeader-705881-705914'],
    'column_index_begin': 1,
    'row_index_begin': 4,
    'row_header_texts': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706662, 'begin': 706657},
    'attributes': [{'location': {'end': 706660, 'begin': 706658},
      'text': '48',
      'type': 'Number'}],
    'text': '(48)',
    'row_index_end': 4,
    'row_header_texts_normalized': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'cell_id': 'bodyCell-706657-706662'},
   {'row_header_ids': ['rowHeader-706379-706394',
     'rowHeader-705389-705414',
     'rowHeader-705881-705914'],
    'column_index_begin': 2,
    'row_index_begin': 4,
    'row_header_texts': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 706929, 'begin': 706926},
    'attributes': [{'location': {'end': 706928, 'begin': 706926},
      'text': '77',
      'type': 'Number'}],
    'text': '77',
    'row_index_end': 4,
    'row_header_texts_normalized': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'cell_id': 'bodyCell-706926-706929'},
   {'row_header_ids': ['rowHeader-706379-706394',
     'rowHeader-705389-705414',
     'rowHeader-705881-705914'],
    'column_index_begin': 3,
    'row_index_begin': 4,
    'row_header_texts': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707196, 'begin': 707193},
    'attributes': [],
    'text': 'NM',
    'row_index_end': 4,
    'row_header_texts_normalized': ['(costs)/income',
     'Non-operating adjustment',
     'Non-operating retirement-related'],
    'cell_id': 'bodyCell-707193-707196'},
   {'row_header_ids': ['rowHeader-707471-707502'],
    'column_index_begin': 1,
    'row_index_begin': 5,
    'row_header_texts': ['Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707580, 'begin': 707579},
    'attributes': [],
    'text': '',
    'row_index_end': 5,
    'row_header_texts_normalized': ['Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-707579-707580'},
   {'row_header_ids': ['rowHeader-707471-707502'],
    'column_index_begin': 2,
    'row_index_begin': 5,
    'row_header_texts': ['Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707645, 'begin': 707644},
    'attributes': [],
    'text': '',
    'row_index_end': 5,
    'row_header_texts_normalized': ['Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-707644-707645'},
   {'row_header_ids': ['rowHeader-707471-707502'],
    'column_index_begin': 3,
    'row_index_begin': 5,
    'row_header_texts': ['Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 707710, 'begin': 707709},
    'attributes': [],
    'text': '',
    'row_index_end': 5,
    'row_header_texts_normalized': ['Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-707709-707710'},
   {'row_header_ids': ['rowHeader-707962-707990', 'rowHeader-707471-707502'],
    'column_index_begin': 1,
    'row_index_begin': 6,
    'row_header_texts': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 708259, 'begin': 708252},
    'attributes': [{'location': {'end': 708258, 'begin': 708252},
      'text': '$5,200',
      'type': 'Currency'}],
    'text': '$5,200',
    'row_index_end': 6,
    'row_header_texts_normalized': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-708252-708259'},
   {'row_header_ids': ['rowHeader-707962-707990', 'rowHeader-707471-707502'],
    'column_index_begin': 2,
    'row_index_begin': 6,
    'row_header_texts': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 708527, 'begin': 708520},
    'attributes': [{'location': {'end': 708526, 'begin': 708520},
      'text': '$5,514',
      'type': 'Currency'}],
    'text': '$5,514',
    'row_index_end': 6,
    'row_header_texts_normalized': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-708520-708527'},
   {'row_header_ids': ['rowHeader-707962-707990', 'rowHeader-707471-707502'],
    'column_index_begin': 3,
    'row_index_begin': 6,
    'row_header_texts': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 708798, 'begin': 708791},
    'attributes': [{'location': {'end': 708795, 'begin': 708792},
      'text': '5.7',
      'type': 'Number'}],
    'text': '(5.7)%',
    'row_index_end': 6,
    'row_header_texts_normalized': ['development and engineering',
     'Operating (non-GAAP) research,'],
    'cell_id': 'bodyCell-708791-708798'}],
  'contexts': [{'location': {'end': 702649, 'begin': 702242},
    'text': 'For the year ended December 31: 2015 2014'},
   {'location': {'end': 702865, 'begin': 702855}, 'text': 'Yr.-to-Yr.'},
   {'location': {'end': 703240, 'begin': 703046}, 'text': 'Percent\nChange'},
   {'location': {'end': 709029, 'begin': 709012}, 'text': 'NM-Not meaningful'},
   {'location': {'end': 709287, 'begin': 709227},
    'text': 'Research, development and engineering (RD&E) expense was'},
   {'location': {'end': 709775, 'begin': 709521},
    'text': '6.4 percent of revenue in 2015 and 5.9 percent of revenue in 2014.'},
   {'location': {'end': 710189, 'begin': 709945},
    'text': 'RD&E expense decreased 3.5 percent in 2015 versus 2014 primarily driven by:'}],
  'key_value_pairs': [{'value': [{'location': {'end': 704580, 'begin': 704574},
      'text': '$5,247',
      'cell_id': 'bodyCell-704574-704581'}],
    'key': {'location': {'end': 704312, 'begin': 704285},
     'text': 'development and engineering',
     'cell_id': 'rowHeader-704285-704313'}},
   {'value': [{'location': {'end': 708258, 'begin': 708252},
      'text': '$5,200',
      'cell_id': 'bodyCell-708252-708259'}],
    'key': {'location': {'end': 707989, 'begin': 707962},
     'text': 'development and engineering',
     'cell_id': 'rowHeader-707962-707990'}}],
  'title': {},
  'column_headers': []},
 {'section_title': {'location': {'end': 627943, 'begin': 627925},
   'text': 'Geographic Revenue'},
  'row_headers': [{'column_index_begin': 0,
    'row_index_begin': 0,
    'location': {'end': 714975, 'begin': 714949},
    'text': 'Sales and other transfers',
    'row_index_end': 0,
    'cell_id': 'rowHeader-714949-714975',
    'column_index_end': 0,
    'text_normalized': 'Sales and other transfers'},
   {'column_index_begin': 0,
    'row_index_begin': 1,
    'location': {'end': 715466, 'begin': 715441},
    'text': 'of intellectual property',
    'row_index_end': 1,
    'cell_id': 'rowHeader-715441-715466',
    'column_index_end': 0,
    'text_normalized': 'of intellectual property'},
   {'column_index_begin': 0,
    'row_index_begin': 2,
    'location': {'end': 716567, 'begin': 716538},
    'text': 'Licensing/royalty-based fees',
    'row_index_end': 2,
    'cell_id': 'rowHeader-716538-716567',
    'column_index_end': 0,
    'text_normalized': 'Licensing/royalty-based fees'},
   {'column_index_begin': 0,
    'row_index_begin': 3,
    'location': {'end': 717670, 'begin': 717644},
    'text': 'Custom development income',
    'row_index_end': 3,
    'cell_id': 'rowHeader-717644-717670',
    'column_index_end': 0,
    'text_normalized': 'Custom development income'},
   {'column_index_begin': 0,
    'row_index_begin': 4,
    'location': {'end': 718753, 'begin': 718747},
    'text': 'Total',
    'row_index_end': 4,
    'cell_id': 'rowHeader-718747-718753',
    'column_index_end': 0,
    'text_normalized': 'Total'}],
  'table_headers': [],
  'location': {'end': 719555, 'begin': 714949},
  'text': 'Sales and other transfers       of intellectual property $303 $283 7.1%\nLicensing/royalty-based fees 117 129 (9.8)\nCustom development income 262 330 (20.5)\nTotal $682 $742 (8.1)%\n',
  'body_cells': [{'row_header_ids': ['rowHeader-714949-714975'],
    'column_index_begin': 1,
    'row_index_begin': 0,
    'row_header_texts': ['Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715053, 'begin': 715052},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Sales and other transfers'],
    'cell_id': 'bodyCell-715052-715053'},
   {'row_header_ids': ['rowHeader-714949-714975'],
    'column_index_begin': 2,
    'row_index_begin': 0,
    'row_header_texts': ['Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715118, 'begin': 715117},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Sales and other transfers'],
    'cell_id': 'bodyCell-715117-715118'},
   {'row_header_ids': ['rowHeader-714949-714975'],
    'column_index_begin': 3,
    'row_index_begin': 0,
    'row_header_texts': ['Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715183, 'begin': 715182},
    'attributes': [],
    'text': '',
    'row_index_end': 0,
    'row_header_texts_normalized': ['Sales and other transfers'],
    'cell_id': 'bodyCell-715182-715183'},
   {'row_header_ids': ['rowHeader-715441-715466', 'rowHeader-714949-714975'],
    'column_index_begin': 1,
    'row_index_begin': 1,
    'row_header_texts': ['of intellectual property',
     'Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 715734, 'begin': 715729},
    'attributes': [{'location': {'end': 715733, 'begin': 715729},
      'text': '$303',
      'type': 'Currency'}],
    'text': '$303',
    'row_index_end': 1,
    'row_header_texts_normalized': ['of intellectual property',
     'Sales and other transfers'],
    'cell_id': 'bodyCell-715729-715734'},
   {'row_header_ids': ['rowHeader-715441-715466', 'rowHeader-714949-714975'],
    'column_index_begin': 2,
    'row_index_begin': 1,
    'row_header_texts': ['of intellectual property',
     'Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 716001, 'begin': 715996},
    'attributes': [{'location': {'end': 716000, 'begin': 715996},
      'text': '$283',
      'type': 'Currency'}],
    'text': '$283',
    'row_index_end': 1,
    'row_header_texts_normalized': ['of intellectual property',
     'Sales and other transfers'],
    'cell_id': 'bodyCell-715996-716001'},
   {'row_header_ids': ['rowHeader-715441-715466', 'rowHeader-714949-714975'],
    'column_index_begin': 3,
    'row_index_begin': 1,
    'row_header_texts': ['of intellectual property',
     'Sales and other transfers'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 716269, 'begin': 716264},
    'attributes': [{'location': {'end': 716268, 'begin': 716264},
      'text': '7.1%',
      'type': 'Percentage'}],
    'text': '7.1%',
    'row_index_end': 1,
    'row_header_texts_normalized': ['of intellectual property',
     'Sales and other transfers'],
    'cell_id': 'bodyCell-716264-716269'},
   {'row_header_ids': ['rowHeader-716538-716567'],
    'column_index_begin': 1,
    'row_index_begin': 2,
    'row_header_texts': ['Licensing/royalty-based fees'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 716835, 'begin': 716831},
    'attributes': [{'location': {'end': 716834, 'begin': 716831},
      'text': '117',
      'type': 'Number'}],
    'text': '117',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Licensing/royalty-based fees'],
    'cell_id': 'bodyCell-716831-716835'},
   {'row_header_ids': ['rowHeader-716538-716567'],
    'column_index_begin': 2,
    'row_index_begin': 2,
    'row_header_texts': ['Licensing/royalty-based fees'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 717104, 'begin': 717100},
    'attributes': [{'location': {'end': 717103, 'begin': 717100},
      'text': '129',
      'type': 'Number'}],
    'text': '129',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Licensing/royalty-based fees'],
    'cell_id': 'bodyCell-717100-717104'},
   {'row_header_ids': ['rowHeader-716538-716567'],
    'column_index_begin': 3,
    'row_index_begin': 2,
    'row_header_texts': ['Licensing/royalty-based fees'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 717374, 'begin': 717368},
    'attributes': [{'location': {'end': 717372, 'begin': 717369},
      'text': '9.8',
      'type': 'Number'}],
    'text': '(9.8)',
    'row_index_end': 2,
    'row_header_texts_normalized': ['Licensing/royalty-based fees'],
    'cell_id': 'bodyCell-717368-717374'},
   {'row_header_ids': ['rowHeader-717644-717670'],
    'column_index_begin': 1,
    'row_index_begin': 3,
    'row_header_texts': ['Custom development income'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 717939, 'begin': 717935},
    'attributes': [{'location': {'end': 717938, 'begin': 717935},
      'text': '262',
      'type': 'Number'}],
    'text': '262',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Custom development income'],
    'cell_id': 'bodyCell-717935-717939'},
   {'row_header_ids': ['rowHeader-717644-717670'],
    'column_index_begin': 2,
    'row_index_begin': 3,
    'row_header_texts': ['Custom development income'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 718208, 'begin': 718204},
    'attributes': [{'location': {'end': 718207, 'begin': 718204},
      'text': '330',
      'type': 'Number'}],
    'text': '330',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Custom development income'],
    'cell_id': 'bodyCell-718204-718208'},
   {'row_header_ids': ['rowHeader-717644-717670'],
    'column_index_begin': 3,
    'row_index_begin': 3,
    'row_header_texts': ['Custom development income'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 718473, 'begin': 718466},
    'attributes': [{'location': {'end': 718471, 'begin': 718467},
      'text': '20.5',
      'type': 'Number'}],
    'text': '(20.5)',
    'row_index_end': 3,
    'row_header_texts_normalized': ['Custom development income'],
    'cell_id': 'bodyCell-718466-718473'},
   {'row_header_ids': ['rowHeader-718747-718753'],
    'column_index_begin': 1,
    'row_index_begin': 4,
    'row_header_texts': ['Total'],
    'column_header_texts': [],
    'column_index_end': 1,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 719022, 'begin': 719017},
    'attributes': [{'location': {'end': 719021, 'begin': 719017},
      'text': '$682',
      'type': 'Currency'}],
    'text': '$682',
    'row_index_end': 4,
    'row_header_texts_normalized': ['Total'],
    'cell_id': 'bodyCell-719017-719022'},
   {'row_header_ids': ['rowHeader-718747-718753'],
    'column_index_begin': 2,
    'row_index_begin': 4,
    'row_header_texts': ['Total'],
    'column_header_texts': [],
    'column_index_end': 2,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 719287, 'begin': 719282},
    'attributes': [{'location': {'end': 719286, 'begin': 719282},
      'text': '$742',
      'type': 'Currency'}],
    'text': '$742',
    'row_index_end': 4,
    'row_header_texts_normalized': ['Total'],
    'cell_id': 'bodyCell-719282-719287'},
   {'row_header_ids': ['rowHeader-718747-718753'],
    'column_index_begin': 3,
    'row_index_begin': 4,
    'row_header_texts': ['Total'],
    'column_header_texts': [],
    'column_index_end': 3,
    'column_header_ids': [],
    'column_header_texts_normalized': [],
    'location': {'end': 719555, 'begin': 719548},
    'attributes': [{'location': {'end': 719552, 'begin': 719549},
      'text': '8.1',
      'type': 'Number'}],
    'text': '(8.1)%',
    'row_index_end': 4,
    'row_header_texts_normalized': ['Total'],
    'cell_id': 'bodyCell-719548-719555'}],
  'contexts': [{'location': {'end': 713812, 'begin': 713406},
    'text': 'For the year ended December 31: 2015 2014'},
   {'location': {'end': 714027, 'begin': 714017}, 'text': 'Yr.-to-Yr.'},
   {'location': {'end': 714400, 'begin': 714207}, 'text': 'Percent Change'},
   {'location': {'end': 720499, 'begin': 719763},
    'text': 'The timing and amount of Sales and other transfers of IP may vary significantly from period to period depending upon the timing of divestitures, economic conditions, industry consolidation and the timing of new patents and know-how development.'},
   {'location': {'end': 720730, 'begin': 720500},
    'text': 'There were no material individual IP transactions in 2015 or 2014.'},
   {'location': {'end': 720953, 'begin': 720927},
    'text': 'Other (Income) and Expense'}],
  'key_value_pairs': [{'value': [{'location': {'end': 719021, 'begin': 719017},
      'text': '$682',
      'cell_id': 'bodyCell-719017-719022'}],
    'key': {'location': {'end': 718752, 'begin': 718747},
     'text': 'Total',
     'cell_id': 'rowHeader-718747-718753'}}],
  'title': {},
  'column_headers': []},

If you look at the json you can see it covers multiple tables under Geographic Revenue section: please check page 41-43 in IBM_Annual_Report_2016


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 3, 2022

I checked this carefully and you can see for example for 2012-2011 we have two geographic revenue tables and that is the same for 2011-2010. So we will have 4 values for America for 2011. You just need to search for 44,944 to validate this. Checking that in the IBM_Annual_Report_2012.pdf I can see two Geographic Revenues tables one for 2012-2011 and another for 2011-2010 which explains why we have 4 values for each region for each year and Watson Discovery has listed both tables for each document correctly.


View entire conversation on ReviewNB

Copy link
Author

Monireh2 commented Feb 4, 2022

checking why the data from 2019.pdf has not processed; I can see some results has been returned by WD for 2019.pdf!


View entire conversation on ReviewNB

Copy link
Author

The column_header_texts for the 2018-2019 table is empty from WD discovery's json output, that is why we were not retain the rows for 2019-2018 table:

"column_header_texts": [
        "",
        "",
        "",
        "",
        ""
      ],

	text	row_header_texts_0	column_header_texts	attributes.type	value
0	2019	For the year ended December 31:		[DateTime]	2019
1	2018	For the year ended December 31:		[Number]	2018


View entire conversation on ReviewNB

Copy link
Author

I am just changing the retaining condition or copy the from the text column into the column_header_texts when the text follows the \d4 regex pattern to include the 2018-2019 info as well.


View entire conversation on ReviewNB

Copy link
Author

Created an issue with the Discovery's team: https://github.ibm.com/Watson-Discovery/disco-issue-tracker/issues/10974


View entire conversation on ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant