Skip to content

UniPDF Developer Guide

Gunnsteinn Hall edited this page May 3, 2020 · 7 revisions

UniPDF Developer Guide

The goal with this document is to define what conventions to follow when contributing to UniPDF. This helps unify the codebase and makes it easier for new contributors to get started. It can be referenced when reviewing Pull Requests.

It is a work in progress that will evolve and improve over time. If there are any comments, feel to contact us or file an issue.

Overview:

1. Follow recommended best practices

We generally try to follow https://github.com/golang/go/wiki/CodeReviewComments although there are a few exceptions, notably

  • We are flexible on the case of acronyms, for example we prefer "Pdf" in type names rather than "PDF", or "Id" rather than "ID"
  • Type names in unidoc should be explicit as the PDF standard has a plethora of data models that would get cluttered if trying to abbreviate each one. We believe we are following the approach of keeping it as simple as possible but no simpler than that.

2. Use goimports.

Run goimports to clean up imports and formatting within the entire project.

3. go test passing at all times

That is:

go test -v github.com/unidoc/unidoc/...

should pass (without exception).

4. go vet returning 0 issues

Go vet should be passing at all times. That is:

go vet github.com/unidoc/unidoc/...

should not return any issues.

5. golint not regressing.

Over time we hope to be fully (or at least mostly) compatible with golint, i.e. golint returning as few issues as possible.

Ensure that golint issue count is not increasing on new contributions. I.e.

golint github.com/unidoc/unidoc/... | wc -l

should be equal or lower as the base branch that the PR is made to (typically master).

6. Test coverage.

All code should have test coverage, aiming for 90% coverage, although flexible depending on complexity. We try to be practical: No need to test every single getter and setter.

To assess test coverage, run

go test -cover github.com/unidoc/unidoc/...

on both the feature branch and compare with output from master. Ensure that there is no regression. Packages should aim for 90% coverage.

In addition to the go test coverage, we have benchmarks that test a large number of real PDF files from a PDF corpus to avoid regression and test for a lot of cases that can occur in complex PDF files made by many different sources.

7. Keep PdfObject type conversions simple.

Use utility functions when converting PdfObject's to specific type and minimize use of TraceToDirectObject. (Applies to v3 only)

In v3 we have added type conversion functions that also perform tracing. Thus mostly eliminating the need to use TraceToDirectObject. For instance the following code (v2)

subtype, ok := core.TraceToDirectObject(obj).(*core.PdfObjectName)
if !ok {
	common.Log.Debug("Incompatibility ERROR: subtype not a name (%T) ", obj)
	return nil, errors.New("Type check error")
}

should be (v3)

subtype, ok := core.GetNameVal(d.Get("Subtype"))
if !ok {
	common.Log.Debug("ERROR: Font Incompatibility. Subtype (Required) missing")
	return nil, nil, ErrRequiredAttributeMissing
}

The previous syntax is still supported although the new cleaner syntax is preferred for readability.

More type conversion methods in package core (available in v3), function signatures:

func GetBool(obj PdfObject) (bo *PdfObjectBool, found bool) {}
func GetBoolVal(obj PdfObject) (b bool, found bool) {}
func GetInt(obj PdfObject) (into *PdfObjectInteger, found bool) {}
func GetIntVal(obj PdfObject) (val int, found bool) {}
func GetFloat(obj PdfObject) (fo *PdfObjectFloat, found bool) {}
func GetFloatVal(obj PdfObject) (val float64, found bool) {}
func GetString(obj PdfObject) (so *PdfObjectString, found bool) {}
func GetStringVal(obj PdfObject) (val string, found bool) {}
func GetStringBytes(obj PdfObject) (val []byte, found bool) {}
func GetName(obj PdfObject) (name *PdfObjectName, found bool) {}
func GetArray(obj PdfObject) (arr *PdfObjectArray, found bool) {}
func GetDict(obj PdfObject) (dict *PdfObjectDictionary, found bool) {}
func GetIndirect(obj PdfObject) (ind *PdfIndirectObject, found bool) {}
func GetStream(obj PdfObject) (stream *PdfObjectStream, found bool) {}
func GetNumberAsFloat(obj PdfObject) (float64, error) {}
func GetNumbersAsFloat(objects []PdfObject) (floats []float64, err error) {}
func GetNumberAsInt64(obj PdfObject) (int64, error) {}

8. License header in each go file

Ensure that the license header is present in the beginning of each go file before the package name. Should start with:

/*
 * This file is subject to the terms and conditions defined in
 * file 'LICENSE.md', which is part of this source code package.
 */

package ...

9. Avoid use of dot imports in new code.

Our goal is to eliminate use of dot imports in the code base over time. While dot imports between the model and core package are safe as we fully control both namespaces, it is not a recommended practice and will be refactored out progressively.

This also helps making the interface of the core package simple so that code without dot import can be readable and simple to work with.

10. Include references to the PDF standard

When implementing parts of the PDF standard, provide a reference to the section and page numbers. A short summary can also be provided where appropriate.

By default we refer to the PDF32000_2008 standard, accessible here.

For example:

// PdfFontDescriptor specifies metrics and other attributes of a font and 
// can refer to a FontFile for embedded fonts.
// 9.8 Font Descriptors (page 281)
type PdfFontDescriptor struct {
...
}

When referencing newer standard such as ISO32000-2 (aka. PDF 2.0), specify that also, e.g. (ISO32000-2) and possibly any other information needed to find the right document such that the section and page numbers are accurate.

11. Prefer type switches for multiple types.

Use .(type) switches in preference to multiple type assertions, except if there is only a single type check.

func GetNumberAsFloat(obj PdfObject) (float64, error) {
	switch t := obj.(type) {
	case *PdfObjectFloat:
		return float64(*t), nil
	case *PdfObjectInteger:
		return float64(*t), nil
	}
	return 0, ErrNotANumber
}

rather than

func GetNumberAsFloat(obj PdfObject) (float64, error) {
	if t, ok := obj.(*PdfObjectFloat); ok {
		return float64(*t), nil
      }
	if t, ok := obj.(*PdfObjectInteger); ok {
		return float64(*t), nil
      }
	return 0, ErrNotANumber
}

In general prefer switches to multiple if else as go checks for missed cases.

12. Doc comments

All exported names should have godoc-styled comment, as should any non-trivial unexported names. See https://golang.org/doc/effective_go.html#commentary for more information about commentary conventions.

When referencing a variable in the godoc-styled comment, place the variable name in ticks, example:

// newFontBaseFieldsFromPdfObject returns `fontObj` as a dictionary the common fields from that
// dictionary in the fontCommon return.  If there is a problem an error is returned.
// The fontCommon is the group of fields common to all PDF fonts.
func newFontBaseFieldsFromPdfObject(fontObj core.PdfObject) (*core.PdfObjectDictionary, *fontCommon,
	error) {

When documenting functions that implement an interface, place a comment indicating that the function implements the interface, e.g.

// ToPdfObject implements interface PdfModel.
func (stamp *PdfAnnotationStamp) ToPdfObject() core.PdfObject {
	stamp.PdfAnnotation.ToPdfObject()
	container := stamp.container
	d := container.PdfObject.(*core.PdfObjectDictionary)
	stamp.PdfAnnotationMarkup.appendToPdfDictionary(d)

	d.SetIfNotNil("Subtype", core.MakeName("Stamp"))
	d.SetIfNotNil("Name", stamp.Name)
	return container
}

If the function that implements the interface has any unusual behavior then add a note of that in the comments as well, e.g.

// ToPdfObject implements interface PdfModel.
// Note: Call the sub-annotation's ToPdfObject to set both the generic and non-generic information.
func (a *PdfAnnotation) ToPdfObject() core.PdfObject {
...
}

13. Logging conventions.

There are 3 primary log levels.

type Logger interface {
	// Error logs system or other critical errors. Intended to be used for system errors, rather than PDF
	// incompatibility errors (prefer Debug for that).
	Error(format string, args ...interface{})

	// Debug logs debug messages that are intended to assist with troubleshooting during development. Expected
	// to report messages infrequently (compared to Trace).
        // Should also indicate incompatibilities such as when documents not following the PDF standard.
	Debug(format string, args ...interface{})

	// Trace logs debug messages with a pretty high frequency. Intended for understanding program flow, when
	// debugging tricky PDF files. Typically leaves a large number of messages and trace logs can grow large.
	Trace(format string, args ...interface{})
}

Example logging a PDF incompatibility where flow is not stopped

if len(*name) > 127 {
	common.Log.Debug("INCOMPATIBLE: Name too long (%s)", *name)
}

or when flow is stopped (due to PDF file error)

encoder, err := NewEncoderFromStream(streamObj)
if err != nil {
	common.Log.Debug("ERROR: Stream decoding failed: %v", err)
	return nil, err
}

Error is intended to report critical errors, while we always pass the error upwards.

// Load the image with default handler.
img, err := model.ImageHandling.Read(imgReader)
if err != nil {
	common.Log.Error("Error loading image: %s", err)
	return nil, err
}

Trace outputs frequent information to help follow the flow, e.g.

bb, _ := parser.reader.Peek(100)
common.Log.Trace("OBJ peek \"%s\"", string(bb))

14. Test and supplementary files.

Test and supplementary data should be placed in a testdata folder under the relevant package. If used by more than one package, then can be placed in the testdata folder of the more fitting package or a more toplevel testdata folder, depending on what is the most fitting location for the files in question (e.g. what package uses those files the most etc).

15. TODO/FIXME/NOTE comments should refer to author by GitHub handle.

For example

// FIXME(gunnsth): Not working for case t = 3.

Also, let's use the tags TODO/FIXME/NOTE and avoid XXX as the meaning is more clear.

16. Error handling - don't panic

We don't use panic under any circumstances. The library should never crash due to a bogus input file.

Errors should generally be passed up the stack

err := doStuff()
if err != nil {
  common.Log.Debug("ERROR: doStuff failed: %v", err)
  return err
}

When an error occurs we log an error in Debug mode. We use Debug mode because the error is associated with the input PDF file (not a system error).

name, ok := op.Params[0].(*PdfObjectName)
if !ok {
	common.Log.Debug("ERROR: CS command with invalid parameter, skipping over")
	return errors.New("type check error")
}

Unfortunately, deviations from the PDF standard are quite common in PDF files. Thus, in some cases we may choose to ignore an error. In those cases we log a debug message and return out or continue (depending on context).

…
} else {
	common.Log.Debug("ERROR: --------INVALID TYPE XrefStm invalid?-------")
	// Continue, we do not define anything -> null object.
	// 7.5.8.3:
	//
	// In PDF 1.5 through PDF 1.7, only types 0, 1, and 2 are
	// allowed. Any other value shall be interpreted as a
	// reference to the null object, thus permitting new entry
	// types to be defined in the future.
	continue
}

In some cases we back out and use a default value.

fVal, err := strconv.ParseFloat(numStr, 64)
if err != nil {
	common.Log.Debug("Error parsing number %q err=%v. Using 0.0. Output may be incorrect", numStr, err)
 	fVal = 0.0
	err = nil
}
return core.MakeFloat(fVal), nil

This is never a preferred approach. We always start by handling errors heads on, i.e. passing them up the stack. Only when we find that a significant number of PDF files (that are considered acceptable by mainstream programs) fail as a result, we consider implementing this kind of compatibility hack/workaround.

17. Variable declarations.

A basic knowledge of type inference as per https://tour.golang.org/basics/14 is assumed (for int, float64).

When declaring and initializing variables, follow the convention of declaring what matters, e.g. make the value and type explicit if it matters.

i) If the initial value matters, make it explicit

i := 0

ii) If the type matters, make it explicit

var offset uint64

iii) If both initial value and type matters, make both explicit (except if obvious: bool, string, int, float64)

offset := uint64(0)
useX := false

Cases where it is obvious (type inference)

pi := 3.14  // float64
i := 0 // int
str := "string of runes" // string
useBorders := false // bool

iv) If the initial value does not matter declare using var.

var i int

v) When declaring multiple variables that are related, enclose within a var block. See also https://golang.org/doc/effective_go.html#commentary

vi) When declaring empty slices, prefer

var t []string

over

t := []string{}

as per https://github.com/golang/go/wiki/CodeReviewComments#declaring-empty-slices

vii) Empty maps When the map size is known a-priori, use

  m := make(map[x]y, 10)

otherwise prefer

  m := make(map[x]y)

over

  m := map[x]y{}

although in terms of performance it is equivalent and only a matter or preference. For consistency most of unidoc code uses the recommended approach.

For maps with pre-defined data, use a map literal:

commits := map[string]int{
    "rsc": 3711,
    "r":   2138,
    "gri": 1908,
    "adg": 912,
}

Examples

  1. Iterator - initial value matters.
for i := 0; i < 20; i++ { ... }
  1. Default value - can change based on logic - initial value matters, type matters, float64 - inferred.
tx := 2.0 // Default offset.
if condition {
    tx += 1.0
}
  1. Variable declaration where initial value does not matter.
var alignment quadding
if condition {
   alignment = quaddingLeft
} else {
  alignment = quaddingRight
}

although this could be better written as

alignment := quaddingRight
if condition {
  alignment = quaddingLeft
}

18. Avoid unnecessary dependencies.

Dependencies can be very useful, while they can also be a burden and require updating and testing to ensure non-regression. Thus, we generally prefer to avoid dependencies for any minimal tasks. As a rule of thumb that could mean 500+ lines of unidoc code although that is up for discussion for each case.

19. Import group styles.

The style used in imports for consistency is:

import (
    // standard library imports

    // external dependency imports

    // unipdf public imports

    // unipdf internal imports
)

For example:

import (
	"errors"
	"io/ioutil"
	"strings"

	"golang.org/x/text/unicode/norm"

	"github.com/unidoc/unipdf/v3/common"
	"github.com/unidoc/unipdf/v3/core"

	"github.com/unidoc/unipdf/v3/internal/textencoding"
	"github.com/unidoc/unipdf/v3/model/internal/fonts"
)

20. Blank lines.

Blank lines can be great to separate logic, but they should not be used randomly. Examples:

  1. Blank lines should not appear- at top of a block. Prefer
func x() {
    return
}

// and

func x() {
    if 2 > 1 {
        return
    }
}

rather than (bad)

func x() {

    return
}

// or

func x() {
    if 2 > 1 {

        return
    }
}
  1. Blank lines should not appear at bottom of blocks. Prefer
func negative(x float64) {
    return -x
}

func redundant() {
    if 3 < 4 {
        fmt.Printf("Great\n")
    }
}

over

func negative(x float64) {
    return -x

}

func redundant() {
    if 3 < 4 {
        fmt.Printf("Great\n")

    }
}
  1. Blank lines are great to separate logic within long blocks e.g.
if condition1 {
    // many lines of code ...
}

if condition2 {
    // many lines of code ...
}

21. No naked returns.

Named returns are great to indicate the meaning of the return values. Especially if there are more than 2 return values or if the name of the function does not indicate what it returns.

However, using them with a naked return makes the return statement less explicit.

Prefer:

func x() (err error) {
    return err
}

over

func x() (err error) {
    return
}

22. Section comments.

Section comments are comments that denote a section within a Go source code file. They typically come before a series of related declarations or functions.

Those should be declared starting with an empty line, and one line with just // then the comment describing the section/group and another // followed by an empty line.

The section comment itself is typically a title and does not need to end with a period.

Example:

// Context defines operations for rendering to a particular target.
type Context interface {
	//
	// Graphics state operations
	//

	// Push adds the current context state on the stack.
	Push()

	// Pop removes the most recent context state from the stack.
	Pop()

	//
	// Matrix operations
	//

	// Matrix returns the current transformation matrix.
	Matrix() transform.Matrix

	// SetMatrix modifies the transformation matrix.
	SetMatrix(m transform.Matrix)

If the section comment follows an opening bracket, a newline before can be omitted:

switch op.Operand {
	//
	// Graphics stage operators
	//

	// Push current graphics state to the stack.
	case "q":
		ctx.Push()
	// Pop graphics state from the stack.
	case "Q":
		ctx.Pop()

23. Package naming guidelines.

Package names are important for at least 2 reasons:

  • They are the first thing that the user sees when encountering a package
  • They are the text that the user needs to type every time when they use the package

Therefore:

  • They should give some context about what the package contains.
  • They should not be too short or generic that they conflict with variable names or that it is likely that there are many packages with the same name.
  • Since the namespace is only double layered, i.e. package.Export, it makes sense that the user should get some idea of the context from the package name.
  • Package names should not be too cumbersome that users would want to use package aliases.

In general, package naming is nontrivial and those only serve as guidelines, whereas a good package name comes from insight as well.

24. Avoid package aliases in imports.

This expands upon the previous section.

Avoid using package aliases in imports when importing packages with exactly the same name. E.g. only in cases such as:

import (
  aimage "a/image"
  bimage "b/image"
)

would be acceptable. If package under a/image was considered more default or more used, it would be recommended to do

import (
  "a/image"
  bimage "b/image"
)

Generally, prefer

import (
  "f/somepackage"
)

over

import (
  somepkg "f/somepackage"
)

We should rather create package names that are nice to import and use.

In the unlikely case that an external dependency package has a really generic name that conflicts with many variables in our code, we may consider a type alias. Or if it has a really bad name that says nothing about the package contents (although, I doubt we would ever use such a package).