Coping with unknown delimited text data in Python
The CSV module has been around in Python for quite some time. One of the hidden gems in it is the Sniffer class, which will try to determine what your quoting rules are and what delimiter is being used. Mighty nice for importing data from unknown sources.
The documentation suggesting using it like this:
with open('example.csv', 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
I've found that this is a bit error-prone and fails more often than it should. And the thing that trips it up, it seems, is arbitrarily truncating at character 1024. A more effective solution I've found it just feed it a few lines:
with open('example.csv', 'rb') as csvfile:
sample_text = ''.join(csvfile.readline()
for x in xrange(3))
dialect = csv.Sniffer().sniff(sample_text)
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
Then it goes from 85% good to 95% good.
Written by Jason Scheirer
Related protips
Have a fresh tip? Share with Coderwall community!
Post
Post a tip
Best
#Python
Authors
Sponsored by #native_company# — Learn More
#native_title#
#native_desc#