file-converter

A module-based .NET application that converts files and generates documentation for archiving.

This application provides a framework for different conversion libraries/software to work together. It aims to promote a comprehensive open-source solution for file conversion, as opposed to the many paid options, which allows for multi-step conversion between different external libraries.

🗒️Table of Contents

Background
Install
Usage
Further Development
Acknowledgments
Contributing
Licensing

📖 Background

This project is part of a collaboration with the Innlandet County Archive and is a Bachelor's thesis project for a Bachelor's in Programming at the Norwegian University of Technology and Science (NTNU).

In Norway, the act of archiving is regulated by the Archives Act, which states that public bodies have a duty to register and preserve documents that are created as part of their activity ¹. As society is becoming more digitized so is information, and the documents that were previously physical and stored physically are now digital and stored digitally. The Innlandet County Archive is an inter-municipal archive cooperation, that registers and preserves documents from 48 municipalities. However, not all file types they receive are suitable for archiving as they run a risk of becoming obsolete. (For further reading see: Obsolescence: File Formats and Software) The Innlandet County Archive wished to streamline its conversion process into one application that could deal with a vast array of file formats. Furthermore, archiving is based on the principles of accessibility, accountability and integrity, which is why this application also provides documentation of all changes made to files.

Much like programmers and software developers, archivists believe in an open-source world. Therefore it would only be right for this program to be open-source.

⏬ Install

Install from source

Note

Cloning with the Git submodules is required for the application to work. If you did not clone the repository recursively or do not see the git submodules in your local repository we would suggest:

  git submodule init
  git submodule update

To download the application source code run:

 git clone --recursive https://github.com/larsmhaugland/file-converter.git

Build it using mingw32 make (For further instructions see: mingw Tutorial) from the command line using:

make build_win #Build for Windows
make build_linux #Build for Linux (not stable, use Windows build)

The resulting binaries will be located in a new "Windows" or "Linux" directory.

Warning

If you want to build using dotnet build or an IDE you need to build both file-converter-prog2900.csproj and GUI/ChangeConverterSettings/ChangeConverterSettings.csproj.

👪 Dependencies

OS	Dependencies	Needed for?
Windows and Linux	dotnet version 8.0	Needed to build and run the program.
Windows and Linux	Java JDK (Only JRE also works)	Needed for converting emails.
Windows	Ghostscript (download newest and follow installer instructions)	Needed for converting PostScript to PDF and PDF to image.
Windows	LibreOffice ( download version 7.6.6)	Required for converting Office documents.
Linux	Libreoffice should be already present on Linux. This can be checked with `soffice --version`. Otherwise, download from the link above.	Required for converting Office documents.
Linux	GhostScript. Should be installed on most distros, which can be checked by running `gs -version`.	Required for PostScript and PDF to image conversion.
Windows and Linux	wkhtmltopdf version 0.12.6	Needed for converting emails.
Linux	Siegfried	To identify files and keep track of the conversion process.
Linux	email-outlook-message-perl Can be installed with `sudo apt-get install libemail-outlook-message-perl`	Needed to convert msg files on Linux

Note

If you are on Linux see Installation for Linux for more info on Siegfried installation.

Further download instructions for LibreOffice and wkhtmltopdf

🪟 Windows

Libreoffice must be manually added to PATH on Windows for the program to convert office files. The deafult installation path to Libreoffice is "C:\Program Files\LibreOffice", but the entry needs to be "C:\Program Files\LibreOffice\program".

Open Settings -> Home -> About (scroll down on the left) -> Advanced system Settings (on the right) -> Environment variables.

Tip

Alternatively use the Windows key + R on the keyboard, then type in "sysdm.cpl" and hit enter. Thereafter, press Advanced and then Environment variables.

Locate the PATH variable and highlight it. Press Edit -> New -> copy the path to the program folder -> press Ok. This adds it to the users environment variables.

wkhtmltopdf must also be manually added to PATH. For windows, it can be done as described above, just swap "C:\Program Files\LibreOffice\program" with "C:\Program Files\wkhtmltopdf\bin".

🐧 Linux

LibreOffice should already be installed on Linux, but wkhtmltopdf needs to be added. For Linux the default installation directory is ...
To add it as an environment variable:

Open the file .bashrc using nano ~/.bashrc.
Navigate to the bottom of the file with the arrow keys and add this line at the end export PATH="$PATH:DefaultPathHere". Remember to save the file and exit.
To apply the changes immediately run the command source ~/.bashrc. Alternatively, log in and out.
To verify, run the command echo $PATH and the path added should be at the end of the output from the command.

External libraries and software

Libraries

iText7 under the GNU Affero General Public License v3.0.
BouncyCastle.NetCore under the MIT License.
iText7 Bouncycastle Adapter under the GNU Affero General Public License v3.0.
CommandLineParser under the MIT License.
SharpCompress under the MIT License.
Avalonia under the MIT License.

Software

GhostScript under the GNU Affero General Public License v3.0.
LibreOffice under the Mozilla Public License 2.0.
wkhtmltopdf under the GNU Lesser General Public License v3.0.
email-outlook-message-perl under the GNU Affero General Public License v3.0.
Rebex Mail Converter under Freeware.
email-to-pdf-converter under the Apache License 2.0.
Siegfried under the Apache License 2.0.

🪟 Installation for Windows

Download a pre-built binary from the Releases page and unzip it to a location in your system.

🐧 Installation for Linux

Download a pre-built binary from the Releases page and unzip it to a location in your system.

The application has been tested on the following Linux images:

Debian "bookworm" 12
Ubuntu Jammy Jellyfish 22.04 LTS
Fedora Workstation 39
Arch (kernel: Linux 6.7.7-arch1-1)

Running it on other distributions or other versions should be possible as long as it supports dotnet version 8.0.

Important

Although running our application on other distributions should be fine it may reduce the amount of supported external libraries and software.

Installing Siegfried on Linux

If you are using a Debian, Arch or Red Hat based distro the application will guide you through Siegfried installation if it isn't already installed.

Screenshot of guided installation of Siegfried on Linux

Please see the dependencies needed for installation below:

Distro	Dependency
Ubuntu/Debian	curl
Arch Linux	curl brew ²
Fedora/Red Hat	brew ²

If you are not using one of these distros please see the Siegfried GitHub for information on downloading Siegfried.

🚀 Usage

Main CLI application

🔨 Beta

Since the program is still in beta, the software contains some limitations or bugs. The program is mostly tested in Windows, so Linux-specific issues may not appear in the list.

Known bugs

GUI
- Starting GUI from the main program will crash the program on Linux
Office conversion (Linux)
- Office conversion using LibreOffice does not work correctly
PDF to Image
- Some files get an "IO security problem" error
- Signed documents get an "A generic error occurred in GDI+." error
Image to Image
- Documentation logs the Image to PDF and PDF to Image as two seperate files

CLI

To run in CLI navigate to the path of the executable in the terminal and run:

$ .\file-converter-prog2900.exe

Alternatively, one can run the program using dotnet run

Arguments

Note

All paths must be absolute or relative to executable.

Set custom input folder

Default: input

$ .\example -i "C:\Users\user\Downloads
$ .\example --input "C:\Users\user\Downloads

Set custom output folder

Default: output

$ .\example -o "C:\Users\user\Downloads
$ .\example --output "C:\Users\user\Downloads

Set custom settings file

Default: Settings.xml

$ .\example -s "C:\Users\user\custom_Settings.xml"
$ .\example --settings "C:\Users\user\custom_Settings.xml"

Accept all queries in CLI

$ .\example -y
$ .\example --yes

GUI

GUI-version of settings

The GUI provides a more user-friendly way of editing the settings of the application (see Settings for further information). Here one can set all the metadata for running the program and what PRONOM files should be converted to. A formats Default PRONOM is a list of all the PRONOM's belonging to that file format (i.e all PRONOM's associated with the PDF file format).

Settings

Warning

The program copies files from the input to the output directory.
The output directory is not cleared between runs and if a file already exists in the output directory, it will not be replaced.
Therefore, if you have updated a file that exists in both directories you will need to manually delete the file from the output directory.

Settings can be manually set in an xml file.

Setting run time arguments

    <Requester></Requester>                    <!-- Name of person requesting the conversion -->
    <Converter></Converter>                    <!-- Name of person doing the conversion -->
	<ChecksumHashing></ChecksumHashing>    <!-- SHA256 (standard) or MD5 -->
	<InputFolder></InputFolder>            <!-- Specify input folder, default is "input" -->
	<OutputFolder></OutputFolder>          <!-- Specify output folder, default is "output" -->
	<MaxThreads></MaxThreads>	       <!-- Write a number, deafult is cores*2 -->
	<Timeout></Timeout>       	       <!-- Timeout in minutes, default is 30min -->
	<MaxFileSize></MaxFileSize>	       <!-- Max total input bytes per file for merged files, default is 1GB.-->
<!--Note: output file size of a merged file may differ from the total filesize of the individual files that are merged -->

The first part of the XML file concerns arguments needed to run the program. The second part allows you to set up two things:

Global Settings stating that file format x should be converted to file format y.
Folder Settings stating that file format x should be converted to file format y in the specific folder folder.

Global Settings

<FileClass>
    <ClassName>pdf</ClassName>
    <Default>fmt/477</Default>  <!-- The target PRONOM code the class should be converted to -->
    <FileTypes>
           <Filename>pdf</Filename>
       <Pronoms>       <!-- List of all PRONOMs that should be converted to the target PRONOM -->
              fmt/95,fmt/354,fmt/476,fmt/477 ,fmt/478 ,fmt/479 ,fmt/480
       </Pronoms>
    <Default></Default>
    </FileTypes>
</FileClass>

Folder Settings

<FolderOverride>
	<FolderPath>apekatter</FolderPath>      <!-- Path after input folder example: /documents -->
	<Pronoms>fmt/41, fmt/42, fmt/43, fmt/44, x-fmt/398</Pronoms>
	<ConvertTo>fmt/14</ConvertTo>
	<MergeImages></MergeImages>             <!-- Yes, No -->
</FolderOverride>

Currently supported file formats

For a more extensive PRONOM-based overview you can click on the following image to be taken to a codesandbox code snippet:

The code snippet is based on data from the following JSON file: Supported Conversions.
For a more extensive overview for each external converter see the following TXT file: Supported Conversions per converter.

Documentation and logging

The .txtlog files use the following convention and is automatically generated each time the program is run:

Type | (Error) Message | Format | Filetype | Filename

All log files can be found in the logs folder.

Additionally, a documentation.json file is created which lists all files and their respective data.

{"Metadata": {
    "requester": "Name",
    "converter": "Name"
     "hashing": "SHA256"
  },
  "Files": [
    {
      "Filename": "output\\filename.pdf",
      "OriginalPronom": "fmt/14",
      "OriginalChecksum": "6c6458545d3a41967a5ef2f12b1b03ad6a6409641670f823635cfb766181f636",
      "OriginalSize": 513631,
      "TargetPronom": "fmt/477",
      "NewPronom": "fmt/477",
      "NewChecksum": "b462a8261d26ece8707fac7f6921cc0ddfb352165cb608a38fed92ed044a6a05",
      "NewSize": 519283,
      "Converter": [
	"iText7 8.0.3.0"
	],
      "IsConverted": true
    }]}

Adding a new converter

All source code for external converters is based on the same parent Converter class, located in \ConversionTools\Converter.cs.

Converter class

    public string Name;
    public string Version;  
    public string NameAndVersion;  
    public Dictionary<string, List<string>>? SupportedConversions;  
    public List<string> SupportedOperatingSystems;  
    public bool DependenciesExists;  
    public Dictionary<string, List<string>> BlockingConversions;  

    public virtual Dictionary<string, List<string>>? getListOfSupportedConversions(){ } 
    public virtual Dictionary<string, List<string>> GetListOfBlockingConversions(){ }  
    public virtual void SetNameAndVersion(){ }  
    public virtual void GetVersion(){ }  
    async public virtual Task ConvertFile(FileToConvert file, string pronom){ }  
    public virtual void CombineFiles(List<FileInfo2> files, string pronom){ }

All fields shown in the code block above must be included in the subclass for the new external converter to work properly. If you are adding a library-based converter we would suggest having a look at iText7.cs for examples on how to structure the subclass. For external converters where you want to parse arguments and use an executable in CLI we would suggest looking at GhostScript.cs.

Tip

If you are adding an executable file that you want to use it needs to be included in the .csproj file as such to be loaded properly at runtime:

<ItemGroup>
  <None Update="PathToExecutableFile">
     <CopyToOutputDirectory>Always</CopyToOutputDirectory>
  </None>
</ItemGroup>

This will make the executable file available at the path file-converter\bin\Debug\net8.0\PathToExecutableFile.

To add the converter to the list of converters, add the line converters.Add(new NameOfConverter()); in the AddConverter class. Assuming that the source code written for the converter is correct, and the settings are set correctly, the application should now use the new converter for the conversions it supports.

    public List<Converter> GetConverters(){
	if (Converters == null){
	        Converters = new List<Converter>();
	        converters.Add(new iText7());
	        converters.Add(new GhostscriptConverter());
		/*Add a new converter here!*/
                var currentOS = Environment.OSVersion.Platform.ToString();
                Converters.RemoveAll(c => c.SupportedOperatingSystems == null ||
                                          !c.SupportedOperatingSystems.Contains(currentOS) ||
                                          !c.DependenciesExists);
	}return converters;
}

Commenting scheme

All subclasses of Converter follow the same commenting scheme for consistency and ease when maintaining/debugging the application. It should state that it is a subclass of the Converterclass and which conversions it supports. Other functionalities of the converter, such as combining images, can be added after.

/// <summary>
/// iText7 is a subclass of the Converter class.                                                     <br></br>
///                                                                                                  <br></br>
/// iText7 supports the following conversions:                                                       <br></br>
/// - Image (jpg, png, gif, tiff, bmp) to PDF 1.0-2.0                                                <br></br>
/// - Image (jpg, png, gif, tiff, bmp) to PDF-A 1A-3B                                                <br></br>
/// - HTML to PDF 1.0-2.0                                                                            <br></br>
/// - PDF 1.0-2.0 to PDF-A 1A-3B                                                                     <br></br>                                                                          
///                                                                                                  <br></br>
/// iText7 can also combine the following file formats into one PDF (1.0-2.0) or PDF-A (1A-3B):      <br></br>
/// - Image (jpg, png, gif, tiff, bmp)                                                               <br></br>
///                                                                                                  <br></br>
/// </summary>

Adding a new conversion path (Multistep conversion)

Multistep conversion means that one can combine the functionality of several converters to convert a file to a file type that would not have been possible if you were using only one of the converters. For example, LibreOffice can convert Word documents to PDF and iText7 can convert PDF documents to PDF-A. Multistep conversion means that the functionalities can be combined so that a Word document can be converted to a PDF-A document.

To add a new multistep conversion you need to add a route in the initMap function in ConversionManager.cs following this convention:

private void initMap(){
	Converter1 converter1 = new Converter1();
	List<string> supportedConversionsConverter1 = new List<string>(converter1.SupportedConversions?.Keys);

	string firstPronom = "fmt-code1";
	string secondPronom = "fmt-code2";	
	string targetPronom = "fmt-code";

	foreach (FileInfo file in Managers.FileManager.Instance.files.Values){
	if(ConversionSettings.GetTargetPronom(file) == targetPronom && converter1.PRONOMList.Contains(file.OriginalPronom) && supportedConversionsConverter1.Contains(file.OriginalPronom)){
		ConversionMap.TryAdd(new KeyValuePair<string, string>(file.OriginalPronom, targetPronom), [firstPronom, secondPronom, targetPronom]);
	}}
}

The first converter in the path needs a new instance and a list of supported conversions. Then a new if-sentence can be added to the foreach loop. The second string in the ConversionMap list works as a route so all pronoms except work as stepping stones for the file from its originalPronom to targetPronom. You can add as many stepping stones as you want but they have to be added in the correct order from left to right.

Further Development

The PronomHelper.cs class has a static method string PronomToFullName(string pronom) to retrieve the full name of file formats based on data in the British National Archives PRONOM lookup tool. The method was created using a small C++ program. As the British National Archives publishes more PRONOM PUIDs the method must be updated. The program is located here, see the README in the repo for usage.

🌟 Acknowledgments

Our application makes use of several external libraries and software under their respective licenses, for further information see External libraries and software.

We would like to thank the Innlandet County Archive for giving us such an interesting task for our bachelor thesis. You have provided clear guidelines and invaluable feedback to us in the beta phase of our application.

Our bachelor thesis would also not have been possible without our supervisor Giorgio Trumpy. Thank you for keeping us on track, taking the initiative to connect us with archivists and librarians and delivering meaningful and constructive feedback.

🌍 Contributing

Important

We are currently not open for contributors, due to this being part of a bachelor thesis.
Hopefully, we will be able to open up for contributors after the thesis has been approved.

Contributors

This project exists thanks to these wonderful people:

Bachelor students from NTNU:

Aleksander Solhaug
Philip Alexander Sundt
Lars Martin Haugland
Aurora Skomsvold

📄 Licensing

This project is licensed under the GNU Affero General Public License v3.0. as listed on https://spdx.org/licenses/

Kultur- og likestillingsdepartementet. Lov om arkiv [arkivlova]. URL: https://lovdata.no/dokument/NL/lov/1992-12-04-126?q=arkivloven (visited on 17th Jan. 2024). ↩
Homebrew on Linux URL: https://docs.brew.sh/Homebrew-on-Linux (visited on 3rd Mar. 2024) ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 810 Commits
.github/workflows		.github/workflows
DevDocumentation		DevDocumentation
GUI		GUI
GhostscriptBinaryFiles @ ec767bf		GhostscriptBinaryFiles @ ec767bf
file-converter-prog2900Tests		file-converter-prog2900Tests
src		src
testfiles @ ba2c20a		testfiles @ ba2c20a
.gitignore		.gitignore
.gitmodules		.gitmodules
ConversionSettings.xml		ConversionSettings.xml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
exclude.txt		exclude.txt
file-converter-prog2900.csproj		file-converter-prog2900.csproj
file-converter-prog2900.sln		file-converter-prog2900.sln

License

larsmhaugland/file-converter

Folders and files

Latest commit

History

Repository files navigation

file-converter

🗒️Table of Contents

📖 Background

⏬ Install

Install from source

👪 Dependencies

Further download instructions for LibreOffice and wkhtmltopdf

🪟 Windows

🐧 Linux

External libraries and software

Libraries

Software

🪟 Installation for Windows

🐧 Installation for Linux

Installing Siegfried on Linux

🚀 Usage

🔨 Beta

Known bugs

CLI

Arguments

Set custom input folder

Set custom output folder

Set custom settings file

Accept all queries in CLI

GUI

Settings

Setting run time arguments

Global Settings

Folder Settings

Currently supported file formats

Documentation and logging

Adding a new converter

Converter class

Commenting scheme

Adding a new conversion path (Multistep conversion)

Further Development

🌟 Acknowledgments

🌍 Contributing

Contributors

📄 Licensing

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages