I have a performance issue with the below Powershell script. I want to parse information from CSV file to JSON file. The performance of this conversion is very good when the CSV file size is small. But the execution is not completing/hang when I ran the script for CSV file size like 200MB or above.
import-csv -path "F:\csvs\stack.csv" -UseCulture -Encoding UTF8 |
select "Name", "Mobile","DOB","Email",@{n='Father_Name'; e={$($_."Father Name")}}, @{n = 'Other_mobile_no'; e = { [long]$_."Other mobile no." } },"Pincode","State" |
ConvertTo-Json -Compress -Depth 100 | Add-Content -Encoding ASCII -Path "F:\csvs\stack.json"
Could you please let me know if you see any improvements of the script or any changes that I could do?
A simple way of performance improvement is to avoid pipeline commands as much as possible, at the expense of requiring more memory. The pipeline is very flexible but for several reasons there is an overhead that becomes significant when doing large amounts of low-level data crunching.
Select-ObjectcreatesPSCustomObjectwhich has some inherent overhead due to its dynamic nature. As Santiago Squarzon points out, a faster way is to use a class instead, which also uses less memory.ForEach-Objectthough. As pointed out by mklement0, due to the way scripts are called, it is much slower than aforeach(x in y)loop. What turned out to be even faster (and more memory efficient) than aforeachloop, is piping directly to a simple script block| & {…}(which is able to do pipeline processing by using aprocesssection). Piping to a non-extended function without parameters, that only contains aprocesssection works as well.-InputObjectparameter ofConvertTo-Json(instead of piping it), another small performance gain can be achieved, because this prevents unrolling of the input array.Using my own sample data this code runs about 5 times faster than your original code.
Bigger performance improvements could be possible by rewriting the inner loop in (inline) C# and using a .NET CSV parser. This way we could also get around the memory overhead of the current solution.
I did a preliminary test using the
TextFieldParserclass, which cut down the runtime by another half. I had expected more from a C# solution and comments here also suggest it isn't the fastest. It could be worth a try to test another one (see CSV parser benchmark).