ReliableTXT Specification

Preface

Reading arbitrary text files, without knowledge about how they were written, is an ambiguous process. Without knowing the exact encoding scheme and line break convention, that was used to write the text file, assumptions have to be made during the reading process. These assumptions however introduce an uncertainty, that the text file could be falsely read and therefor data could be misinterpreted.

Especially text files that use an encoding, which does not have a defined preamble or omits its preamble, cause problems when reading their content. In such cases an algorithm has to go over the data and guess which encoding might have been used. Also problematic are encodings, that don't have clearly distinguishable preambles, and therefor as well an algorithm has to try to determine the used encoding.

Another source of uncertainty is the used line break convention. Several varieties of single characters or character sequences exist, that indicate a line break or end of line. Sometimes even multiple line break conventions occur mixed together in a single text file.

The Reliable Text File Format addresses these problems and defines rules that allow reading Reliable Text Files without uncertainty and ambiguities.

Overview

A Reliable Text File always starts with a few bytes, called the preamble, that indicate which encoding was used to encode the characters of a text into raw bytes.

After reading the preamble the encoding is set and the following raw bytes of encoded text can be read and decoded into an array of characters. In the next step the resulting array of characters is split into lines by separating the line characters using the defined Reliable Text File line break character.

Encoding

A Reliable Text File can only use one of the following four Unicode encodings:

UTF-8
UTF-16 (Big Endian)
UTF-16 Reverse (Little Endian)
UTF-32 (Big Endian)

For all encodings the preamble must be written, leading to the following possible preamble bytes:

UTF-8 without a byte order mark (BOM) is therefor not supported and thus a file written with such a convention would not be a valid Reliable Text File and would need to be imported.

The Reliable Text File Format does not prohibit the use of null characters. Therefor the UTF-32 encoding with little endian as well is not supported. The reason for that is, that the preamble of the UTF-32 little endian encoding could be misinterpreted as UTF-16 little endian encoding, followed by a null character as first character. In order to resolve this and to add the UTF-32 Little Endian encoding to future Reliable Text File versions, a clearly distinguishable preamble would need to be defined.

When the encoded text contains invalid bytes, unpaired surrogates or invalid code points, an error message must be shown instead of ignoring or replacing these characters.

ReliableTXT does not use the terms big or little endian. Instead, the word reverse is used to differentiate the order. As humans who read from left to right read numbers in big endian, the term reverse is used to indicate little endianess.

Line Breaks

In a Reliable Text File two lines are separated by a single line feed character (U+000A or '\n' in C). The following syntax diagram illustrates this:

To determine the number of lines, the number of occurring line feed characters needs to be counted and increased by one, as shown in the following formula:

NumLines = count('\n') + 1

Other line break characters, such as carriage returns or Unicode line break characters (i.e. Line Tabulation (U+000B), Form Feed (U+000C), Carriage Return (U+000D), Next Line (U+0085), Line Separator (U+2028), Paragraph Separator (U+2029)), are ignored and considered to be normal characters of a line. This is a deliberate deviation from the Unicode Line Breaking Algorithm to make derivative plain text-file formats easier.

A trailing line feed character at the end of a text file leads to an empty last line.

Characters

ReliableTXT documents are based on Unicode. A character in ReliableTXT is considered to be a Unicode scalar value. That means valid code points are in the range from U+0000 to U+D7FF and from U+E000 to U+10FFFF. The high and low surrogate code points in the range from D800 to DFFF are not allowed.

Comparison to POSIX

Text files according to POSIX use the line feed character to terminate a line. Thus the last line in a file is also terminated by a line feed character. Otherwise the line is considered as an incomplete line. In ReliableTXT files the line feed character is used to separate two lines, not to terminate. According to the POSIX specification a POSIX text file can contain no lines. A ReliableTXT file has by definition always at least one line.

Another difference is that lines in a POSIX text file are limited to a certain size, expressed in bytes (LINE_MAX), and are not allowed to contain null characters. A line in a ReliableTXT has no limit in length and can contain null characters.

Conclusion

The Reliable Text File Format offers a clean and unambiguous way of writing and reading text files. With a defined encoding scheme and a fixed line break convention, it is always clear how the text file bytes should be written or read without ambiguities and misinterpretations. Programs that work with Reliable Text Files can fully rely on the automatic and deterministic handling of the encoding and decoding process.

<?xml version="1.0" encoding="utf-8"?>

Therefor textual file formats that are based on the Reliable Text File Format don't have to provide another declaration of their used encoding on content level, as opposed to XML or HTML where in some cases it is required.

< Prev

Next >