Last Updated: February 25, 2016
·
1.849K
· jasonscheirer

Coping with unknown delimited text data in Python

The CSV module has been around in Python for quite some time. One of the hidden gems in it is the Sniffer class, which will try to determine what your quoting rules are and what delimiter is being used. Mighty nice for importing data from unknown sources.

The documentation suggesting using it like this:

with open('example.csv', 'rb') as csvfile:
   dialect = csv.Sniffer().sniff(csvfile.read(1024))
   csvfile.seek(0)
   reader = csv.reader(csvfile, dialect)
   # ... process CSV file contents here ...

I've found that this is a bit error-prone and fails more often than it should. And the thing that trips it up, it seems, is arbitrarily truncating at character 1024. A more effective solution I've found it just feed it a few lines:

with open('example.csv', 'rb') as csvfile:
   sample_text = ''.join(csvfile.readline() 
                         for x in xrange(3))
   dialect = csv.Sniffer().sniff(sample_text)
   csvfile.seek(0)
   reader = csv.reader(csvfile, dialect)
   # ... process CSV file contents here ...

Then it goes from 85% good to 95% good.