Skip to content
davebrokit edited this page May 27, 2024 · 33 revisions

This wiki contains more detail on various aspects of the public API and the PDF document format.

Features

  • Extracts the position and size of letters from any PDF document. This enables access to the text and words in a PDF document.
  • Allows the user to retrieve images from the PDF document.
  • Allows the user to read PDF annotations, PDF forms, embedded documents and hyperlinks from a PDF.
  • Provides access to metadata in the document.
  • Exposes the internal structure of the PDF document.
  • Creates PDF documents containing text and path operations.
  • Read content from encrypted files by providing the password.
  • Document Layout Analysis - PdfPig also comes with some tools for document layout analysis such as the Recursive XY Cut, Document Spectrum and Nearest Neighbour algorithms, along with others. It also provides support for exporting page contents to Alto, PageXML and hOcr format. See Document Layout Analysis
  • Tables are not directly supported but you can use Tabula Sharp or Camelot Sharp. As of 2023 Tabula-sharp is the most complete port source

This provides an alternative to the commercial libraries such as SpirePDF or copyleft alternatives such as iText 7 (AGPL) for some use-cases.

Things you can't do:

Getting Started

PdfPig aims to provide 2 main areas of functionality:

  • Extracting PDF content.
  • Creating PDFs.

The simplest usage of the library for extracting content involves opening a document and extracting the position and text of all words across all pages:

using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
	foreach (Page page in document.GetPages())
	{
		IEnumerable<Word> words = page.GetWords();
	}
}

Pages can also be accessed individually with an index starting at 1. You can also access the positions and sizes of the individual letters on a page:

using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
	Page page = document.GetPage(1);
	IReadOnlyList<Letter> letters = page.Letters;
}

For document creation a new document can be created using the Standard14 fonts which are included in the PDF specification:

PdfDocumentBuilder builder = new PdfDocumentBuilder();
PdfPageBuilder page = builder.AddPage(PageSize.A4);
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
page.AddText("Hello World!", 12, new PdfPoint(25, 520), font);
byte[] b = builder.Build();

The resulting bytes are a valid PDF document and can be saved to the file system, served from a web server, etc.

You can use document builder to visualise what pdf pig has done by copying the pdf and drawing rectangles around the words using bounding boxes information.

//using UglyToad.PdfPig;
//using UglyToad.PdfPig.Writer;

var sourcePdfPath = "";
var outputPath = "";

using (var document = PdfDocument.Open(sourcePdfPath))
{
    var builder = new PdfDocumentBuilder { };
    var pageBuilder = builder.AddPage(document, 1);
    pageBuilder.SetStrokeColor(255, 0, 0);
    var page = document.GetPage(1);
    foreach (var word in page.GetWords())
    {
        var box = word.BoundingBox;
        pageBuilder.DrawRectangle(box.BottomLeft, box.Width, box.Height);
    }

    byte[] fileBytes = builder.Build();
    File.WriteAllBytes(outputPath, fileBytes); // save to file
}

In this example a more advanced document extraction is performed

//using UglyToad.PdfPig.DocumentLayoutAnalysis.PageSegmenter;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.ReadingOrderDetector;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
//using UglyToad.PdfPig.Fonts.Standard14Fonts;

var sourcePdfPath = "";
var outputPath = "";
var pageNumber = 1;
using (var document = PdfDocument.Open(sourcePdfPath))
{
    var builder = new PdfDocumentBuilder { };
    PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
    var pageBuilder = builder.AddPage(document, pageNumber);
    pageBuilder.SetStrokeColor(0, 255, 0);
    var page = document.GetPage(pageNumber);
    foreach (var word in page.GetWords())
    {

        var letters = page.Letters; // no preprocessing

        // 1. Extract words
        var wordExtractor = NearestNeighbourWordExtractor.Instance;

        var words = wordExtractor.GetWords(letters);

        // 2. Segment page
        var pageSegmenter = DocstrumBoundingBoxes.Instance;

        var textBlocks = pageSegmenter.GetBlocks(words);

        // 3. Postprocessing
        var readingOrder = UnsupervisedReadingOrderDetector.Instance;
        var orderedTextBlocks = readingOrder.Get(textBlocks);

        // 4. Add debug info - Bounding boxes and reading order
        foreach (var block in orderedTextBlocks)
        {
            var bbox = block.BoundingBox;
            pageBuilder.DrawRectangle(bbox.BottomLeft, bbox.Width, bbox.Height);
            pageBuilder.AddText(block.ReadingOrder.ToString(), 8, bbox, font);
        }
    }

    // 5. Write result to a file
    byte[] fileBytes = builder.Build();
    File.WriteAllBytes(outputPath, fileBytes); // save to file
}

Contents

More details on the API can be found here.

Additional automated documentation from doc-comments can be found on DotNetApis.

Release Notes

Release notes as well as downloadable packages can be found on the releases page https://github.com/UglyToad/PdfPig/releases.