Skip to content

Extract text with whitespace and all new lines #383

Answered by topcat30
EspressoWillie asked this question in Q&A
Discussion options

You must be logged in to vote

Hi, this gets me most of the way. I usually use a GAP of .3 and OrigRow is true. The GAP is basically used to add space between each character, so you can play with it depending on the font size. Things get a bit messy if the fonts change size quite a lot.
Let me know if you find a better way.

oh and you can swap the output to be either spool or lines if you want an xml dump of the text data.

using System;
using System.Text;
using System.IO;
using Org.BouncyCastle.Cms;
using System.Collections.Generic;
using System.Xml;
using System.Xml.Linq;
using System.Linq;
using System.Collections;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
namespace PDFTools
{

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by EspressoWillie
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants