Skip to main content

Text Files Do Not Exist

Discussing two misleading terms of software engineering: text files and binary files.

It's a bitter truth for every experienced and even beginner coder. We all know that everything stored in memory (of all kinds) is physically stored as a sequence of bytes in binary form, basically, as meaningless byte sequences that only programmers give meaning to with a code.

Thus, there is no difference between so-called text files and binary files at a low level, and I believe we also have to stop making this difference at higher abstraction levels in our code.

How Do We Say We Work With Text Files?

It's different in different programming languages. C & C++ will offer you to write fopen("text.txt", "rt") to open a file in "text mode", and Python will offer you something very similar - open("test.txt", "rt") , and even C# has an explicit File.OpenText(...) along with File.Open(...) method. And all these are in the top-10 most used programming languages in 2022.

Thankfully, at the same time many programming languages, like Java, Ruby, JavaScript (in a Node.js environment), and many others decouple file APIs from APIs that work with contents, so when opening/reading/writing files, we do not explicitly say we're working with text or binary file.

We work in Unicode But Still Think In ASCII

Well, I hope not all of us, but I see that is still a problem for many programmers. We still use this division on text files and binary files, and some programming languages still resemble this difference on a language level or at least a standard library level as a part of public file API.

Well, I'm pretty much serious about the problem, just Google "difference between text and binary files" and you will find not just misleading "differences" created out of thin air but also things that are just not true (for the sake of justice, they could be true in the 1970s-1980s, for example):

  • "In a text file, text, character, and numbers are stored one character per byte" - but how do I store a traditional Chinese character '說' in one byte if one byte can encode 256 characters and there are around 8000 characters in traditional Chinese alphabet?
  • "text files are more user-friendly" - but who/what is the user? What quality of a file does the user consider as friendly? I would say a game developer would think that a small-sized binary format to store game levels is very friendly because it significantly speeds-up game loading time.
  • "text files store information in a human-readable format" - and even this (at first sight obvious) claim is not that true, because all this depends on the viewer we use: a good Hex Viewer can show us information in a much more human-readable form than a text viewer that opened UTF-8 encoded text in UTF-16 mode, or will Base-64 encoded image is a human-readable format (while it is definitely a text data format)

The root cause of this is obviously historical. We perceive text files as something different because we work with them much more frequently than with audio, image, video, and another kind of files that are usually for more narrow domains.

So What's The Point?

And the point is that we must stop thinking of text and binary files as technically different concepts on a language level, and start writing and thinking of text as just another kind of information that can be encoded in a lot of different ways (there are a lot of encodings) similarly to how images are encoded in TGA, JPEG, PNG, BMP, and many other formats, or audio is encoded in MP3, FLAC, ALAC, etc.

Thus we are actually not talking about text and binary files, we are talking about data that has different encoding stages but is not a different kind (nature) of data:

That requires splitting reading/writing logic (usually in different forms of Stream or Reader/Writer objects) decoupled from file API itself, and thankfully it is the way almost all programming languages follow.

And the main thing: I offer to stop using text files & binary files as terms in professional discussions. I truly believe we have to change how we think about files, starting to talk about files and the kind of data stored in those files thus mentally decoupling these terms:

  • Files - without the division on the text and binary files (IMHO, the term binary file is the same nonsense as a text file*, as the term describes rather how underlying hardware works than some quality of data storage/format as such*)
  • Uniformly Encoded Text Data (or simpler: Text Data)
  • Non-Uniformly Encoded Structured Data (or simpler: Structured Data)

This small (even insignificant at first sight) change can:

  • eliminate ambiguities and make understanding the concept of files simpler and clearer for newbies
  • increase the culture of creating APIs relying on the concept of files

In the end, while these terms may seem more complex than what we use now, clarity wins. 

Comments