Sep 07, 2021 (by Jesse Stone)
Part of a series on Archiving Tips
1. Why PDF/A
Imagine that you finally get a PDF copy of a paper that you need. When you open it, you find that it’s password protected. Now, you have to hunt down the password, either by finding it on another website, or contacting the author or the journal. Once you open it, the paper has images that are missing, because the PDF needs to download them, but you’re on a plane, and can’t access the internet. On top of that, while you are reading the document, a video is playing on the bottom of the page, making it hard to focus on the text. Shouldn’t PDF documents be more like a printed page? Shouldn’t they be complete at the time you download them? These are some of the problems that PDF/A tries to address.
The PDS4 Requirement for PDF/A
PDS is dedicated to archiving data for the long term, not only for today’s users, but for future generations. This means that data in the PDS should stand on its own, without the need for support websites, or even the internet itself. This means that we need documents that stand on their own, as well, that will always be viewable, and that don’t require special software to understand. PDF/A, which is designed for archiving printed documents, is well aligned with these goals, since they are also designed to stand on their own, without the need for services that could disappear at any time.
2. PDF/A vs PDF
What does PDF/A guarantee?
- Color management
- Embedded fonts
What does PDF/A leave out?
Luckily the features that PDF/A leaves out doesn’t generally affect documents that are designed to be printed.
- External references (downloadable fonts, images, etc)
- Embedded files
The Dreaded Font Copyrights
- This is a surprising issue to many people. Nobody thinks about the copyright status of their fonts. PDF/A does. The fonts that you use for your document must be legally embeddable. Microsoft states that its software will only allow you to embed fonts that are legal to be embedded (https://docs.microsoft.com/en-us/typography/fonts/font-faq). As a general rule, the fonts that came with the software will be okay to embed; however, if you want full control, you can use fonts that are freely licensed, such as the Google Fonts, or the Liberation Fonts.
PDF/A comes in many varieties, referred to as profiles. Profiles are the specific sets of validation rules that each document conforms to. The profiles that the PDS uses are PDF/A 1-A and PDF/A 1-B.
PDF/A 1-A and 1-B
PDFA 1-A and 1-B are the earliest versions of the PDF/A standard that were drafted. They are also the simplest, and suitable to archiving most printable content. These are also the only files that are accepted by the PDS. 1-A is more restrictive than 1-B, but all 1-A files are also valid 1-B files.
1-A documents are technically similar to 1-B, but they have additional accessibility and semantic features.
There are other versions of PDF/A out there that add new features, but are not backwards-compatible with PDF/A 1-A. For instance, a PDF/A 2-A document may not be a valid PDF/A 1-A or 1-B document. Unfortunately, due to the additional features that these documents add, it may not even be possible to convert the documents down to 1-A or 1-B. These are not currently accepted by the PDS.
3. How to create a PDF/A
Creating a PDF/A from scratch
The Windows version of Microsoft Office supports PDF-A 1-B export. Choose
File -> Export -> Create PDF/XPS. Check the “ISO 19005-1 Compliant (PDF/A)” checkbox and click OK.
LibreOffice is a free and open source replacement for Microsoft Office. It is available on all major desktop platforms. To export a PDF/A, select
File -> Export... -> Export as PDF.... You will then see a menu with several PDF export options. Make sure that “Archive (PDF/A, ISO 19005)” is checked, and select PDF/A 1-b under “PDF/A version”. Then click “Export” and you are done!
Abobe Acrobat supports 1-A and 1-B export. There are multiple ways to do this in Acrobat. See (https://helpx.adobe.com/acrobat/using/pdf-x-pdf-a-pdf.html) for more information.
Converting a PDF/A
If you don’t have a source document, and only have a PDF document to work with, you may still be able to convert the PDF to PDF/A. Several programs are available that will help with this. However, no matter which one you choose, make sure that you visually inspect your document, to make sure that it’s formatting hasn’t changed out from under you. PDF is difficult to read in and convert faithfully (which is why we are doing this in the first place).
Adobe Acrobat is probably your best choice for converting PDF/A documents. In Adobe Acrobat DC, you can choose
File -> Save as Other -> Archivable PDF (PDF/A). Adobe supports several other programs, and there are other ways to do this that give you more fine-grained control. See (https://helpx.adobe.com/acrobat/using/pdf-x-pdf-a-pdf.html) for more information.
Libre Office can import PDF documents and re-export them as PDF/A, as described above. However, you’ll need to spend some time cleaning up the document. You may run into font issues, or your paragraphs may be out of order.
If you are good with the command line, GhostScript will generally do a good job of converting PDF documents to PDF/A. However, you will need to define several Postscript commands to help with this, and find a color profile that is suitable. Check (https://github.com/sbn-psi/data-provider-tools/tree/master/pdf2pdfa) for a good place to start. The greatest advantage of this method is that it can be automated as part of a document pipeline, and it can also convert a large group of files at once.
Online converters are an easy way to convert small batches of PDF documents. One converter that seems relatively easy to use is PDFTron (https://www.pdftron.com/pdf-tools/pdfa-converter/). Just drag your file onto the page, then select the output profile that you want, and click “Convert”. The biggest issue with online converters is that they are more difficult to work with for large batches of files, and you are giving a copy of your file to a third party. Make sure that you are comfortable with this before uploading it.
What Won’t Work
Many applications don’t support PDF/A, but in this section, we’ll cover applications that seem to support PDF/A, but don’t actually work for some reason.
MacOS Big Sur has a PDF/A converter in the preview app, but it currently uses a profile that PDS can’t accept. There is a chance that it will be eventually updated to support 1-A or 1-B (or the PDS may start accepting other profiles), but for now, it can’t be used.
4. How to validate a PDF/A file
VeraPDF is a PDF/A validator that supports every profile level of PDF/A, and has both a usable GUI and command line interface. This is the best bet, if your system runs Java.
PDFBox is another strong candidate, if you have a system that supports Java. However, it is a code library, and not a direct tool, so it is mostly useful if you want to include PDF/A validation in an application.
Other PDF/A validators are available, such as the 3-Heights validator, or the Solid Documents validator. However, they are not free so we have not evaluated them. But they do exist.
The online validator that I personally have had the most success with is the Solid Documents PDF/A validator, at (http://www.validatepdfa.com/). This works by emailing them your document, and they will email you back a validation report. This takes some time, however, and it can be hard to keep track of multiple versions of the same document, so it is probably only suitable for small batches.
PDF/A Documents are a great format for long-term, archiving, since they are designed as standalone files, and remove obstacles to usabilty. They are so well suited that they are the standard document archivng format for PDS. There are many ways to create a PDF/A document, either from scratch or by converting an existing PDF document. Once a PDF/A document is created, it will have to be validated to ensure conformance. This process will ensure that your documents are ready to be archived in the PDS.