Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(getTextContent()): inaccurate spacing on current version vs older (v2.0.550) #17839

Open
ryparker opened this issue Mar 25, 2024 · 0 comments
Open

Comments

@ryparker
Copy link

ryparker commented Mar 25, 2024

I've noticed that older versions of pdf.js have more accurate spacing on the pdfs i'm parsing. I'm only using pdf.js to extract the text. I'd like to use the latest version of pdf.js however the spacing is too inaccurate for me to fix post-process. Are there any options that will allow me to fine-tune the spacing?

Things i've tried:

  • .getDocument({... , disableFontFace: true})
  • .getTextContent({disableNormalization: true})

Attach (recommended) or Link to PDF file here: congressional-daily-record-170-13.pdf

Configuration:

  • Web browser and its version: Only extracting text in Node.js env.
  • Operating system and its version: macOS Version 14.4
  • PDF.js version: 4.0.379
  • Is a browser extension: No

Steps to reproduce the problem:

  1. Extract the text using v4.0.379 and v2.0.550. For example:
import * as fs from 'fs';
import * as path from 'path';

const pdfjsPaths = {
	v4_0_379: path.join(__dirname, 'pdfjs-versions', 'v4.0.379', 'build', 'pdf.mjs'),
	v2_0_550: path.join(__dirname, 'pdfjs-versions', 'v2.0.550', 'build', 'pdf.js'),
}

async function extractText(version: keyof typeof pdfjsPaths, pdfPath: string) {
	const PDFJS = await import (pdfjsPaths[version])
	const pdfBuffer = await fs.promises.readFile(pdfPath)
	const pdfArrBuffer = new Uint8Array(pdfBuffer);
	const doc = await PDFJS.getDocument({
		data: pdfArrBuffer,
	}).promise;
	let text = ''
	for (let i = 1; i <= doc.numPages; i++) {
		const page = await doc.getPage(i);
		const { items } = await page.getTextContent();
		for (const item of items) {
			text += item.str;
		}
		text += '\n\n';
	}
	await fs.promises.writeFile(path.join(__dirname, `${version}.json`), {version, text})
	return text
}

const pdfPath = path.join(__dirname, 'congressional-daily-record-170-13.pdf');
await extractText('v4_0_379', pdfPath);
await extractText('v2_0_550', pdfPath);

v4.0.379 outputs the following (text is shortened for Github):

{
	"version": "4.0.379",
    "text": "Congressional RecordUNUMEPLURIBUSUnited Statesof America PROCEEDINGS AND DEBATES OF THE 118 th CONGRESS, SECOND SESSION∑ This ‘‘bullet’’ symbol identifies statements or insertions which are not spoken by a Member of the Senate on the floor..S227Vol. 170 WASHINGTON, WEDNESDAY, JANUARY 24, 2024 No. 13House of RepresentativesThe House was not in session today. Its next meeting will be held on Thursday, January 25, 2024, at 3 p.m.SenateWEDNESDAY, JANUARY 24, 2024The Senate met at 10 a.m. and wascalled to order by the Honorable PETERW ELCH, a Senator from the State ofVermont.fPRAYERThe Chaplain, Dr. Barry C. Black, of-fered the following prayer:Let us pray.Eternal God who rules the raging ofthe sea, draw our Senators to Youtoday by the cords of Your eternallove. Help them to strive to know You,cultivating a relationship of peacefultrust in Your prevailing providence.May the experience of being in Yourpresence enable them to better com-prehend the role You desire for them toplay in fulfilling Your purposes onEarth. Sharpen their vision to perceiveYour movements in our Nation andworld. Where there is anxiety, give ourlawmakers the poise that comes from aconfident faith in You.We pray in Your merciful Name.Amen.fPLEDGE OF ALLEGIANCEThe Presiding Officer led the Pledgeof Allegiance, as follows:I pledge allegiance to the Flag of theUnited States of America, and to the Repub-lic for which it stands, one nation under God,indivisible, with liberty and justice for all.fAPPOINTMENT OF ACTINGPRESIDENT PRO TEMPOREThe PRESIDING OFFICER. Theclerk will please read a communicationto the Senate from the President protempore (Mrs. M URRAY).The senior assistant legislative clerkread the following letter:U.S. SENATE,P RESIDENT PRO TEMPORE,Washington, DC, January 24, 2024.To the Senate:Under the provisions of rule I, paragraph 3,of the Standing Rules of the Senate, I herebyappoint the Honorable P ETER W ELCH, a Sen-ator from the State of Vermont, to performthe duties of the Chair.PATTY M URRAY,President pro tempore.Mr. WELCH thereupon assumed theChair as Acting President pro tempore.fRESERVATION OF LEADER TIMEThe ACTING PRESIDENT pro tem-pore. Under the previous order, theleadership time is reserved.fCONCLUSION OF MORNINGBUSINESSThe ACTING PRESIDENT pro tem-pore. Morning business is closed.fEXECUTIVE SESSIONEXECUTIVE CALENDARThe ACTING PRESIDENT pro tem-pore. Under the previous order, theSenate will proceed to executive ses-sion to resume consideration of the fol-lowing nomination, which the clerkwill report.The senior assistant legislative clerkread the nomination of Jacquelyn D.Austin, of South Carolina, to be UnitedStates District Judge for the Districtof South Carolina.RECOGNITION OF THE MAJORITY LEADERThe ACTING PRESIDENT pro tem-pore. The majority leader is recog-nized.SUPPLEMENTAL FUNDINGMr. SCHUMER. Mr. President, well,the latest round of Ukrainian securityassistance was a $250 million packagethat included 155mm rounds, Stingeranti-aircraft missiles, and other crit-ical weapons that have been crucial forUkraine on the battlefield. That an-nouncement was made on December 27.That is 28 days ago—4 weeks. Sincethen, no more aid—no more aid—hasbeen sent to Ukraine. And there won’tbe more unless Congress acts.In the meantime, it has been re-ported that Russia is beginning to re-stock its own supplies with help fromNorth Korea, including North Koreanmissiles.Right now, Senate negotiators onboth sides are working furiously to ap-prove another round of Ukraine aid byfinalizing our national security supple-mental package. This package wouldnot only deliver a lifeline for Ukraine,it would secure our border, send aid toIsrael, provide humanitarian assistancefor innocent civilians in Gaza, andshore up security in the Indo-Pacific. …<SHORTENED FOR GITHUB ISSUE>"
}

Notice how spacing is missing. e.g.
WEDNESDAY, JANUARY 24, 2024 No. 13House of RepresentativesThe

Notice how spacing is in the wrong location. e.g.
called to order by the Honorable PETERW ELCH

v2.0.550 outputs the following (text is shortened for Github):

{
    "version": "2.0.550",
    "text": "Congressional  RecordUNUMEPLURIBUSUnited  Statesof AmericaPROCEEDINGS  AND  DEBATES  OF THE 118th  CONGRESS,  SECOND  SESSION∑ This  ‘‘bullet’’  symbol  identifies  statements  or  insertions  which  are  not  spoken  by  a  Member  of  the  Senate  on  the  floor..S227 Vol.  170 WASHINGTON,  WEDNESDAY,  JANUARY  24,  2024 No.  13 House  of  Representatives The House was not in session today. Its next meeting will be held on Thursday, January 25, 2024, at 3 p.m. Senate WEDNESDAY, JANUARY24, 2024 The  Senate  met  at  10  a.m.  and  was called to order by the Honorable PETER WELCH,  a  Senator  from  the  State  of Vermont. f PRAYER The Chaplain, Dr. Barry C. Black, of-fered the following prayer: Let us pray. Eternal  God  who  rules  the  raging  of the   sea,   draw   our   Senators   to   You today  by  the  cords  of  Your  eternal love. Help them to strive to know You, cultivating  a  relationship  of  peaceful trust  in  Your  prevailing  providence. May  the  experience  of  being  in  Your presence  enable  them  to  better  com-prehend the role You desire for them to play   in   fulfilling   Your   purposes   on Earth. Sharpen their vision to perceive Your  movements  in  our  Nation  and world. Where there is anxiety, give our lawmakers the poise that comes from a confident faith in You. We   pray   in   Your   merciful   Name. Amen. f PLEDGE  OF  ALLEGIANCE The  Presiding  Officer  led  the  Pledge of Allegiance, as follows: I  pledge  allegiance  to  the  Flag  of  the United States of America, and to the Repub-lic for which it stands, one nation under God, indivisible, with liberty and justice for all. f APPOINTMENT  OF  ACTING PRESIDENT  PRO  TEMPORE The    PRESIDING    OFFICER.    The clerk will please read a communication to  the  Senate  from  the  President  pro tempore (Mrs. MURRAY). The senior assistant legislative clerk read the following letter: U.S. SENATE, PRESIDENT PRO TEMPORE, Washington, DC, January 24, 2024. To the Senate: Under the provisions of rule I, paragraph 3, of the Standing Rules of the Senate, I hereby appoint  the  Honorable  PETERWELCH,  a  Sen-ator  from  the  State  of  Vermont,  to  perform the duties of the Chair. PATTYMURRAY, President pro tempore. Mr.  WELCH  thereupon  assumed  the Chair as Acting President pro tempore. f RESERVATION  OF  LEADER  TIME The  ACTING  PRESIDENT  pro  tem-pore.  Under  the  previous  order,  the leadership time is reserved. f CONCLUSION  OF  MORNING BUSINESS The  ACTING  PRESIDENT  pro  tem-pore. Morning business is closed. f EXECUTIVE  SESSION EXECUTIVE  CALENDAR The  ACTING  PRESIDENT  pro  tem-pore.  Under  the  previous  order,  the Senate  will  proceed  to  executive  ses-sion to resume consideration of the fol-lowing   nomination,   which   the   clerk will report. The senior assistant legislative clerk read  the  nomination  of  Jacquelyn  D. Austin, of South Carolina, to be United States  District  Judge  for  the  District of South Carolina. RECOGNITION OF THE MAJORITY LEADER The  ACTING  PRESIDENT  pro  tem-pore.   The   majority   leader   is   recog-nized. SUPPLEMENTAL FUNDING Mr.  SCHUMER.  Mr.  President,  well, the  latest  round  of  Ukrainian  security assistance  was  a  $250  million  package that  included  155mm  rounds,  Stinger anti-aircraft  missiles,  and  other  crit-ical weapons that have been crucial for Ukraine  on  the  battlefield.  That  an-nouncement was made on December 27. That  is  28  days  ago—4  weeks.  Since then,  no  more  aid—no  more  aid—has been  sent  to  Ukraine.  And  there  won’t be more unless Congress acts. In  the  meantime,  it  has  been  re-ported  that  Russia  is  beginning  to  re-stock  its  own  supplies  with  help  from North  Korea,  including  North  Korean missiles. Right   now,   Senate   negotiators   on both sides are working furiously to ap-prove  another  round  of  Ukraine  aid  by finalizing our national security supple-mental  package.  This  package  would not  only  deliver  a  lifeline  for  Ukraine, it would secure our border, send aid to Israel, provide humanitarian assistance for   innocent   civilians   in   Gaza,   and shore  up  security  in  the  Indo-Pacific. …<SHORTENED FOR GITHUB ISSUE>"
}

Notice how spacing is more accurate. e.g.
WEDNESDAY, JANUARY 24, 2024 No. 13 House of Representatives The

Notice how spacing is in the correct location. e.g.
called to order by the Honorable PETER WELCH

Here's a screenshot of the pdf (first page), i've highlighted the mentioned text in red:
CleanShot 2024-03-25 at 14 09 37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants