New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Process to disambiguate author affiliations using gpt-4 #40

Open

dtgupta wants to merge 2 commits into dspinellis:main from dtgupta:aa-process-llm

Contributor

dtgupta commented Jan 23, 2024

This process adds a different dimension to author affiliation disambiguation. It uses NLP to extract affiliations from text. This process produces better results than the currently implemented matching strategy. It has a matching rate of 81.24% as compared to 36.73% of the currently implemented algorithm. It also performs much better to match authors to multiple affiliations. This process adds value to the project as it helps researchers with more accurate affiliation results.


          process to disambiguate author affiliations using gpt-4

88d2a1e

dspinellis requested changes

View reviewed changes

Owner

dspinellis left a comment

This is very interesting work; well done! I added some comments as a first approximation to code that can be merged. Please also add tests for each process. Thank you again for the PR.

src/alexandria3k/processes/distinguish_affiliations.py

		@@ -0,0 +1,172 @@
		""""This process is used to distinguish the affiliations mentioned in the crossref dataset."""

Owner

dspinellis Jan 23, 2024

Please start with a license comment, identifying you as the contributor.

Contributor Author

dtgupta Jan 23, 2024

Is there a specific approach to writing a license or can I copy the license comment from the other link_aa_base_ror.py file?

Owner

dspinellis Jan 23, 2024

You shouldn't invent new licenses 😃 Nor is it a good practice to mix different ones. So just copy-paste the existing text, replacing your name and setting the year to 2024.

src/alexandria3k/processes/distinguish_affiliations.py Outdated

+                  """
+                  This process is used to distinguish the affiliations mentioned in the crossref dataset. It
+                  uses the GPT-4 model to extract the affiliation and city from the affiliation text and
+                  match it to the ROR database based on levenshtein distance.

Owner

dspinellis Jan 23, 2024

Levenschtein

Contributor Author

dtgupta Jan 23, 2024

The library used to calculate this distance has the spelling mentioned in the comment (without the 'c'). Should I still change it?

Owner

dspinellis Jan 23, 2024

The c was my mistake, sorry. The comment was for you to capitalize it here, because it's proper name.

src/alexandria3k/processes/distinguish_affiliations.py Outdated

+              def process(database_path):
+                  """
+                  This process is used to distinguish the affiliations mentioned in the crossref dataset. It

Owner

dspinellis Jan 23, 2024

Please model the comment after the existing one in link_aa_base_ror.py. It's not bad to copy-paste in this case. Be very clear regarding which ROR level you're linking to.

Consider changing the existing link_aa comments so as to clarify which method each process is using.

src/alexandria3k/processes/distinguish_affiliations.py Outdated

+                      if not mentioned_name:
+                          continue
+                      # Prompt for the GPT-4 model to extract the affiliation and city from the affiliation text
+                      prompt = (

Owner

dspinellis Jan 23, 2024

Place the prompt in a constant at the beginning of the file as a multi-line string.

src/alexandria3k/processes/distinguish_affiliations.py Outdated

+                  ensure_table_exists(database, "research_organizations")
+                  select_cursor = database.cursor()
+                  select_cursor_2 = database.cursor()

Owner

dspinellis Jan 23, 2024

Please use more descriptive names.

src/alexandria3k/processes/distinguish_affiliations.py Outdated

+              def find_best_ror(gpt_org, select_cursor_2):
+                  """
+                  This function is used to find the best affiliation match based on levenshtein distance.

Owner

dspinellis Jan 23, 2024

Do not start your comments with "This function…" Just write in imperative voice what the function does. (In all functions.)

src/alexandria3k/processes/distinguish_affiliations.py Outdated

+                  try:
+                      # Extract the affiliation and city from the provided textual affiliation
+                      completion = client.chat.completions.create(
+                          model="gpt-4-1106-preview",

Owner

dspinellis Jan 23, 2024

Please put the model in a constant at the top-level of the file.

src/alexandria3k/processes/link_aa_llm.py Outdated

+                  ) in select_cursor.execute(
+                      """
+                      SELECT id, mentioned_name, gpt_name, city, ror_id FROM distinct_affiliations
+                      WHERE ror_id != 20233

Owner

dspinellis Jan 23, 2024

What is this number? Please document it and use it as a constant.

Contributor Author

dtgupta Jan 24, 2024

This ror_id 20233 corresponds to the research organization 2B based in Italy. Some organizations cannot be identified by gpt-4 (returns an empty string in response). Levenshtein distance comparison between empty string "" and 2B assigned empty strings to this ror_id. We filter out the organizations in this category to make fewer comparisons.

Owner

dspinellis Jan 24, 2024

Thank you for the explanation. This approach sounds very brittle. Shouldn't we address the general case of empty strings?


          remove trailing whitespace

36047b9

dspinellis force-pushed the main branch 2 times, most recently from b4eb879 to bd775b6 Compare

February 1, 2024 10:18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment