PDS4 Line Feed Changes

The Planetary Data System is implementing a change to its standards, which may impact users of PDS data, especially if you have written custom scripts/software to read data.

This change is scheduled to go into effect June 2021 with IM version 1.16.0.0

What is changing?

Current standards require that Text Stream (like ASCII), Character Table (like Fixed-Width Tables), and Delimited Table (like CSV) files follow the Windows format in which every line ends with both the Carriage Return (CR) and Line Feed (LF) characters. This change will allow such archived data in which every line ends with a Line Feed character (LF) alone, used by Linux and MacOS systems, in addition to the prior Windows format. Files will still be required to use their preferred line feed delimiter consistently across the file.

What are line feed characters?

In digital files, special invisible "control characters" are used to indicate when a line has ended. These may be used in a text document, table, image, or any other format. In Windows and most older computing systems, this is indicated with two adjacent characters: "Carriage Return" (CR, Hex 0xD, "\r") and "Line Feed" (LF, Hex 0xA, "\n"). On Unix-based systems like Mac and Linux, this is typically represented as a single Line Feed character instead. Further background can be found here.

How will this affect me?

If you use your own custom software or scripts to parse and use data from the PDS, this may need to be updated to support this change.

If you use software distributed by the PDS on our archived data, you will need to download the latest versions when those are available after this change goes into effect June 2021.

What might I need to change in my code?

Maybe nothing! If you use widely-distributed image/text/file parsing libraries in your code, many of them already know how to handle variable format line endings.

In many programming languages (like Python), these characters are represented in code by "\n" (for LF) and "\r" for carriage return. So in those cases, you would need to change any code that expected "\r\n" to also expect "\n".

If your code makes assumptions about the byte length of a line (for example, in a Fixed-Width table), you may find that files in this new format will not be read properly.

Don't do this:

f = open(filename, 'r')
lines = []
while True:
    line = f.read(20) # Read twenty characters
    if not line:
        break;
    lines.append(line)
    
    # Skip CR+LF characters
    f.read(2)

Instead do this:

f = open(filename, 'r')
lines = []
while True:
    # Read a line with either CR+LF or LF, and remove those characters
    line = f.readline().strip('\r\n') 
    if not line:
        break;
    lines.append(line)

In cases where you are splitting lines by expecting a CR+LF line ending, modify it to expect either CR+LF or LF. Errors might occur if you switch to just expecting LF; the trailing CR could cause issues. You must handle both cases.

Alternatively, just use built-in methods of splitting lines:

Don't do this:

f = open(filename, 'r')
lines = f.read().split('\r\n')

Instead do this:

f = open(filename, 'r')
lines = f.read().splitlines() 
# Reads the entire file into list of lines (without ending characters)

Some examples of methods you can use to handle this automatically:

Python

To read files, use the open function, which defaults to interpreting and translating either form of newline. From there, iterate over the results or use readLine()
If you already have the file data in a string, you can use splitlines instead

Java

Java's standard library includes the readLine method, used as part of a BufferedReader which will assist in traversing the file.
In Java 8, you can use Files.lines as well
If you're using Apache Commons, you may wish to use FileUtils.readLines()

What do I not need to worry about?

Data that has already been archived with the existing PDS3 or PDS4 standards will not change, and can continue to be used with any tools out there today. This change will only affect some data files archived in the future model versions (date TBD).

Software written and distributed by the PDS will be updated to work with this new format at the same time, so you will not need to modify that software, merely use the most updated versions.

Data providers already supplying data using Windows-format line endings will not need to change anything. The record delimiter type has always been specified in the accompanying labels, and will continue to be so.

I am a data provider, how will this affect me?

This is merely an addition to what's allowed in our standards. Any existing data that you provide will continue to be archivable. Going forward though, you will have more flexibility in the types of files you can archive, or the amount of work you need to put in to make your data archive-ready.

Who can I contact for feedback or questions?

Please contact us at sbn@psi.edu if you have any questions or need assistance adjusting your workflow with this change.