DolphinDB: Fail to correctly parse the Chinese column names in a CSV file using loadText

14 Views Asked by At

I want to load a CSV file into memory using the loadText function but the system fails to correctly parse the column names in Chinese. To solve this issue, I use Python to read the CSV file by specifying encoding = 'gb18030' before loading, but it is very slow. So I switch to loadTextEx for parallel writing, but a prompt message pops up, indicating that as strings in DolphinDB are encoded in UTF-8, input text files are required to be encoded in UTF-8. Is there an effective way to solve this issue?

1

There are 1 best solutions below

0
dbaa9948 On

Generate the schema table for the input data file using the extractTextSchema function:

schema=extractTextSchema('SH600000.CSV')

// output
name    type
����    SYMBOL
ʱ�� DATETIME
���̼�   DOUBLE
��߼�    DOUBLE
��ͼ�    DOUBLE
���̼�   DOUBLE
�ɽ���_��_   INT
�ɽ���_Ԫ_    INT
��Ȩϵ��  DOUBLE

Change the encoding of strings to UTF-8 using toUTF8 or convertEncode:

schema[`name]=toUTF8(schema.name,"gbk");

// output
name    type
代码  SYMBOL
时间  DATETIME
开盘价 DOUBLE
最高价 DOUBLE
最低价 DOUBLE
收盘价 DOUBLE
成交量_手_  INT
成交额_元_  INT
复权系数    DOUBLE

Load the CSV file using loadText or loadTextEx with the parameter schema specified as the schema table:

loadText('SH600000.CSV',,schema)

// output
代码  时间  开盘价 最高价 最低价 收盘价 成交量_手_  成交额_元_  复权系数
SH600000    2019.01.02T09:31:00 120.1828    120.7998    120.1828    120.4296    4,174   4,072,450   12.3391
SH600000    2019.01.02T09:32:00 120.553 120.6764    120.4296    120.4296    1,494   1,459,340   12.3391
SH600000    2019.01.02T09:33:00 120.6764    120.7998    120.4296    120.6764    1,189   1,162,432   12.3391
SH600000    2019.01.02T09:34:00 120.6764    120.6764    120.0594    120.1828    4,601   4,487,235   12.3391
...

SH600000    2019.01.08T10:09:00 123.0208    123.0208    122.774 122.774 846 842,200 12.3391
SH600000    2019.01.08T10:10:00 122.774 122.8974    122.6507    122.6507    360 358,480 12.3391