Powershell ConvertTo-Json from import-csv poor performance

Question

Powershell ConvertTo-Json from import-csv poor performance

731 Views Asked by sridharnetha At 28 August 2022 at 10:24

I have a performance issue with the below Powershell script. I want to parse information from CSV file to JSON file. The performance of this conversion is very good when the CSV file size is small. But the execution is not completing/hang when I ran the script for CSV file size like 200MB or above.

import-csv -path "F:\csvs\stack.csv" -UseCulture -Encoding UTF8 | 
select "Name", "Mobile","DOB","Email",@{n='Father_Name'; e={$($_."Father Name")}}, @{n = 'Other_mobile_no'; e = { [long]$_."Other mobile no." } },"Pincode","State" | 
ConvertTo-Json -Compress -Depth 100 | Add-Content -Encoding ASCII -Path "F:\csvs\stack.json"

Could you please let me know if you see any improvements of the script or any changes that I could do?

Original Q&A

There are 3 best solutions below

**zett42** · Answer 1 · 2022-08-28T18:24:54.387000

A simple way of performance improvement is to avoid pipeline commands as much as possible, at the expense of requiring more memory. The pipeline is very flexible but for several reasons there is an overhead that becomes significant when doing large amounts of low-level data crunching.

Select-Object creates PSCustomObject which has some inherent overhead due to its dynamic nature. As Santiago Squarzon points out, a faster way is to use a class instead, which also uses less memory.
For instanciating such a class, avoid ForEach-Object though. As pointed out by mklement0, due to the way scripts are called, it is much slower than a foreach(x in y) loop. What turned out to be even faster (and more memory efficient) than a foreach loop, is piping directly to a simple script block | & {…} (which is able to do pipeline processing by using a process section). Piping to a non-extended function without parameters, that only contains a process section works as well.
Finally, by explicitly passing the data to the -InputObject parameter of ConvertTo-Json (instead of piping it), another small performance gain can be achieved, because this prevents unrolling of the input array.

$inputPath  = 'F:\csvs\stack.csv'
$outputPath = 'F:\csvs\stack.json'

# Defines the CSV columns
class MyCsvData {
    [string] $Name
    [string] $Mobile
    [string] $DOB
    [string] $Email
    [string] $Father_Name
    [object] $Other_mobile_no
    [string] $Pincode
    [string] $State
}

# Parse the whole CSV file into memory and transform the columns.
# Piping to a script block instead of ForEach-Object is much faster.
$csv = Import-Csv -path $inputPath -Encoding UTF8 | & { 
    process {
        # Try to convert 'Other mobile no.' field to long
        $Other_mobile_no = 0l
        if( -not [long]::TryParse( $_.'Other mobile no.', [ref] $Other_mobile_no ) ) {
            $Other_mobile_no = '--'
        }

        # Implicit output, gets captured in $csv.
        [MyCsvData] @{ 
            Name            = $_.Name
            Mobile          = $_.Mobile
            DOB             = $_.DOB
            Email           = $_.Email
            Father_Name     = $_.'Father Name'
            Other_mobile_no = $Other_mobile_no
            Pincode         = $_.Pincode
            State           = $_.State
        }
    }
}
 
ConvertTo-Json -InputObject $csv -Compress | Set-Content -Encoding UTF8 -Path $outputPath

Using my own sample data this code runs about 5 times faster than your original code.

Bigger performance improvements could be possible by rewriting the inner loop in (inline) C# and using a .NET CSV parser. This way we could also get around the memory overhead of the current solution.

I did a preliminary test using the TextFieldParser class, which cut down the runtime by another half. I had expected more from a C# solution and comments here also suggest it isn't the fastest. It could be worth a try to test another one (see CSV parser benchmark).

**js2010** · Answer 2 · 2022-08-29T14:50:00.650000

The select-object seems expensive, especially with custom objects.

if (test-path stack.json) { rm stack.json }
import-csv stack.csv -UseCulture -Encoding UTF8 | 
select Name,Mobile,DOB,Email,
@{n = 'Father_Name';     e = { $_.'Father Name' } },
@{n = 'Other_mobile_no'; e = { [long]$_.'Other mobile no.' } },
Pincode,State | 
ConvertTo-Json -Compress -Depth 100 |
add-Content -Encoding ASCII stack.json

# create example
# 'name,mobile,dob,email,father name, other mobile no.,pincode,state' | set-content stack.csv
# 1..224kb | % { 'joe,5555555,5555555,me@home,joe,6666666,1234,ny' } | add-content stack.csv # 11mb csv, 1 min 11 sec
# measure-command { .\script.ps1 }

# default 1:11
# set-content 1:56
# $out = :59
# no select :24
# select but no @{} :25
# with "Father Name" :37
# with [long]"Other mobile no." 1:02
# same without [long]

**sridharnetha** · Answer 3 · 2022-09-03T10:52:40.213000

Answering to my question, This answer may helpful to others who are interested to use the python.

I had to use a python script to read CSV files and loading as JSON files. Dask is a python library for parallel computing. In my situation Dask faster than the Powershell.

Here is my python script using a Dask library: Note that Dask should chunk data the data automatically and save the files with .part extension. I tested this script on a 250MB CSV file and it converted to JSON file(s) within the 16seconds only.

import dask.dataframe as dd
def e2f(string):
try:
    result = float(string)
except ValueError:
    result=string
return str(result).replace('.0','').replace('nan','')

cols=['Name','Mobile','Email','Father Name','Other mobile no.','Pincode','State']

df=dd.read_csv('F:\csvs\stack.csv', low_memory=False, usecols=cols, sep = ",", header = 0, 
converters={'Name':e2f,'Mobile':e2f,'DOB':e2f,'Email':e2f,'Father Name':e2f,'Other mobile no.':e2f,'Pincode':e2f,'State':e2f},
#dtype={'Name':str,'Mobile':str,'DOB':str,'Email':str,'Father Name':str,'Other mobile no.':str,'Pincode':str,'State':str})
.rename(columns=lambda x: x.replace(' ', '_').replace('.',''))

df.to_json(r'F:\csvs\myfiles', orient ='records', compression = 'infer', date_format = "epoch", double_precision = 10, force_ascii = True, date_unit = "ms", default_handler = None,lines=False )

Powershell ConvertTo-Json from import-csv poor performance

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in POWERSHELL

Related Questions in DASK

Related Questions in IMPORT-CSV

Related Questions in CONVERTTO-JSON

Trending Questions

Popular # Hahtags

Popular Questions