The Ultimate Guide to CSV Files

author-avatar

What exactly is a CSV file?

Chances are you have at some point worked with a CSV file, or at least, a file that began as a CSV. By the time you received the information, it might have been presented in an Excel or Google Sheets format, but CSV files are far too common for anyone not to have been exposed to one. They are by far one of the most commonly used file formats to transfer business, scientific, engineering, and consumer data. It is also widely used in health, manufacturing, finance and other things. But what exactly is it, and why is it the de facto standard for tabular information transfer, even today?

CSV is designed to allow tabular data to be presented in a delimited text format, readable by any text editor. It delineates the data by separating the data with commas, hence the acronym, CSV, or comma-separated value. In a CSV file, each delimited line is a data record, and each record has one or more fields, each separated by a comma.

History of the CSV Format

The CSV format has been around longer than the personal computer. It was first used by the IBM Fortran compiler in 1972 and further defined over time. Free-form list directed input-output was defined in 1978 by Fortran 77. The terms comma-separated value and CSV were in use by 1983, when Osborn Computing Company (OCC) and its SuperCalc spreadsheet defined the quoting convention that allows strings to contain embedded commas. Since then there have been few attempts to standardize the format. It was not until 2005 that serious attempts to standardize the well-known but poorly documented format with the RFC 4180 initiative. It has gone under numerous revisions since, but because of how long it went without regulation, there are still a lot of different versions out there. 

How to Write a Basic CSV file

As stated, CSV is ridiculously easy to learn and use, in its most simple format. In its simplest format, it is written simply as lines of entries, separated by commas. A top line establishes headers, and each subsequent line matches records and fields to the top header line.

For example, the below lines of text if saved as .csv file would be read in your spreadsheet as:

Email Address, Name, Postal Code

1111@email.com, Prime Directive, 10111

2222@email.com, Dos Equis, 20222

3333@email.com, Tres Chic, 90210 

Email Address Name Postal Code
1111@email.com Prime Directive 10111
2222@email.com Dos Equis 20222
3333@email.com Tres Chic 90210

This is the CSV file structure in its most basic form. Pretty simple, right?

Relevant and Prevalent 

Why is a format created in the early days of computing still relevant and prevalent today? Well, it can be attributed to the simplicity and the transferability of the format. For one thing, it is human-readable, and therefore simple to reproduce, even if you are not a programming expert. They can be opened by any text editor, and by nearly every spreadsheet program since its inception, including the modern ones in use today (Microsoft Excel and Google Sheets). Even software that does not look or feel like a spreadsheet often offers it as an input format to import large amounts of data, such as contact lists and more. (Note, it is these latter applications and the formatting challenges inherent in them that Flatirons Fuse is designed to accommodate). 

Advantages

  • Human readable, and human writable – It is a simple format to learn, even accounting for more complicated additions such as quotation marks. It is not encoded or converted into bits and bytes and can be easily edited. 
  • Widely used and compatible everywhere – People in every walk of life and industry are used to working with CSV files. It is not restricted to PC or Mac users, or any OS. It can be edited or read in any text editor, or any spreadsheet program. 
  • Simple schema – Databases can crunch numerical data without requiring a great deal of compute power.
  • Fast – Because of the simplicity of the format, it is not taxing for the parser to parse the data. This means data read and write operations are quick and painless. 
  • Business friendly – Because it is so prevalent, it is the best choice if you want a format that is accessible and readable by everyone. The goal of any business is to reach the most clients, and the CSV format is perfect for that goal. 

Challenges to the CSV format

As with any protocol, there are some challenges to the CSV format. One of the most problematic is the lack of standardization. While there have been efforts to standardize the format, notably RFC 4180 and several revisions in the years to follow, it is still ill-defined and can experience difficulties with special characters or characters from other languages (such as vowels with accents and other differences). Older versions of CSV could not handle fields with commas, assuming these were column delimiters in and of themselves. Quotations were introduced to handle these cases, but the more complex the data, the more challenging it could be to express in a standardized .csv format.

Another challenge is that there are other types of delimited text formats lumped in with comma-separated (or comma delimited) values, including pipe-delimited, tab-delimited, and semi-colon delimited versions of the format. While they sometimes have different extensions, (.tsv, for example) they are often considered as CSV. 

TSV is expressed as follows: 

Email Name Title Role

KK@123.com “Kubla Khan” Emperor Conqueror

AtG@123.com “Alexander the Great” King Conqueror

GW@123.com “George Washington” President Revolutionary

 

Email Name Title Role
KK@123.com “Kubla Khan” Emperor Conqueror
AtG@123.com “Alexander the Great” King Conqueror
GW@123.com “George Washington” President Revolutionary

 

Note that the above example uses tab as the delimiter between each entry, but between larger entries, you cannot tell at a glance whether it is a space or a tab between. This can make the TSV format a less readable format and easier to make a mistake with, particularly in text format. But functionally speaking, it operates in exactly the same way as CSV. The same goes for versions that use the semi-colon (;) as the delimiter, or the pipe (|).

These alternative versions can be a challenge to convert, save manually because a simple replace operation cannot be safely used – there may be commas within the data that do not serve as delimiters, or some of the same variables mentioned earlier that can cause problems for CSV, such as special characters, etc.

Importing CSV

One of the most common use cases for importing CSV files is uploading customer contacts. For example, if creating a campaign for email or phone marketing campaigns, the call center or email marketing software may accept a CSV import to tell it where the emails should be sent. Online tax software might accept either CSV or Excel Spreadsheet data for import as well. CRMs such as Hubspot or Nutshell also allow you to upload a list of customers or contacts. 

There are many more solutions out there that make use of, and/or accept CSV as an import format. Each will have its own template and may use a specific version of CSV. It is important to be cognizant of the different issues that can arise in the import process and to consider any processes or products that can streamline it for your customers, such as Flatirons Fuse.

Challenges to importing CSV files

Because there are many versions of CSV, determining whether it is a compatible format can be a challenge in itself. Most systems today use more modern implementations such as UTF-8, but there are many to choose from. One of the first challenges to establishing a common format for import can be establishing which of the various versions of CSV is in use, especially when they all look nearly alike in Excel or a text editor. One needs only to look at the numerous CSV options that Excel supports and can export/save as to see that the CSV landscape can still be considered the Wild West. 

Another issue can be the clean-up of the data itself. A human-readable, human-writable format is subject to that most prevalent of problems, human error. There can be missing data, missing headers, problems with misconstrued strings of data, non-digestible data formats, and any number of other issues that can be difficult to find by eyeballing the data. These issues are where a system like Flatirons Fuse can help, ensuring that the format of the data matches that of the target system, and with clear and simple validation, can help you to find and eliminate data record errors.

Exporting CSV

Nearly every spreadsheet out there has multiple options to export via CSV. Analytics software (such as IBM Cognos, for example) allows data to be exported as CSV because it is so readily usable for import purposes on other systems. JSON and other languages are common as well, but CSV is almost always an option for simpler tabular datasets.

When exporting CSV files, we recommend always using the latest CSV version, (UTF-8) as it is the most commonly used format. This is very important if using an exported file to import elsewhere. The transferability of CSV is the key reason for its popularity, and standardizing on the latest format and version is one way to limit potential problems in moving your data. 

Summary

CSV is one of the most commonly used file formats in the world today and has been for over three decades. Because of the many advantages we discussed today, it is not likely to go away anytime soon. As with any software, it is prone to human error and has seen a lot of changes over the years, resulting in many different solutions to common problems. Some of these have resulted in non-standard file formats that get lumped in with CSV, older versions that are not as flexible, and other potential issues. 

Whether the reason is the format, version, or just human error, sometimes the transferability (the key advantage of CSV) can be compromised. To mitigate these issues, Flatirons Fuse has created a product to help automate the import of CSVs to your website or web application solution, ensuring a streamlined onboarding process by preventing frustrations with the import of your customer’s data.