Ignoring delimiter while reading CSV files from URLs - Python

41 Views Asked by At

I have some URLs for downloading CSV files.

import pandas as pd
import io
import requests

url1 = 'https://www.ons.gov.uk/generator?format=csv&uri=/economy/economicoutputandproductivity/output/timeseries/' + 'k22a' + '/diop'

url2 = 'https://www.ons.gov.uk/generator?format=csv&uri=/economy/economicoutputandproductivity/output/timeseries/' + 'k24c' + '/diop'

s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

When I use url1, there is a ',' in the 4th record. But some urls (url2) dont have this unexpected separator. This is causing

ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 2

when I try to merge the CSV files into a single dataframe. How do I ignore these unexpected separators. Anyway the first seven records are to be deleted. But I still get this error.

This solution suggests we pre-parse each line before converting into CSV. Since I have many such URLs, and don't know for sure which unexpected delimiters would be encountered in future, not sure how to debug. Can pre-parsing before converting to CSV work? How to implement in such a manner to include other separators encountered in the future?

1

There are 1 best solutions below

2
mozway On BEST ANSWER

Since you don't need the metadata, just skip it using the skiprows parameter of read_csv. As a nice side effect, you'll also have the correct dtypes automatically:

url = url1
N = 7

s = requests.get(url1).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')), header=0, skiprows=range(1, N+1))

Output:

  Title  IOP: C:MANUFACTURING: CVMSA
0  1948                         25.2
1  1949                         27.0
2  1950                         29.0
3  1951                         29.9
4  1952                         28.4
...

If you don't even need headers:

url = url1
N = 8

s = requests.get(url1).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None, skiprows=N)

Output:

      0     1
0  1948  25.2
1  1949  27.0
2  1950  29.0
3  1951  29.9
4  1952  28.4
...