Powershell, SQL Server, SQLBULKCOPY: Syntax to specify codepage or encoding

463 Views Asked by At

I'm using a Powershell script found at ((https://blog.netnerds.net/2015/01/powershell-high-performance-techniques-for-importing-csv-to-sql-server/) to load data from quotes delimited, comma separated .csv in to a SQL Server (2016) database.

The source csv files contain extended characters within the data such as umlauts in some first names and the m2 (only it's a little 2) for metre squared. I believe I need to need to specify a codepage of 65001 in order for such extended character values to flow from the csv file and be inserted into varchar datatypes.

I don't know the syntax to specify a particular codepage (or encoding) value?

I've attempted this:

$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring,([System.Data.SqlClient.SqlBulkCopyOptions] 'TableLock'))
$bulkcopy.DestinationTableName = $table
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = $batchsize
#$bulkcopy.Codepage = $codepage

where $codepage is already assigned the value of 65001 but the above syntax errors? I've googled this for ages but not found an example. Any help will be much appreciated.

1

There are 1 best solutions below

0
Carsten.R On

Using a simple split command to read CSV files is not recommended because the CSV format is more complex than it appears, with different delimiters, quoted fields, and line breaks within fields. To handle these complexities and correctly parse CSV files, it's important to use a dedicated CSV parsing library like CsvHelper, which supports a variety of CSV specifications, including RFC 4180. Additionally, if the CSV file contains characters that are not in the default encoding, the output might display incorrectly. To ensure that the file is read correctly, it's important to check the encoding of the file and set the appropriate encoding before reading it.

This here is a sample using csvhelper that is acutely a nuget package inside powershell. This sample adds records to System.Data.DataTable and to a PSCustomObject. You need to implement the batch bulk inserts from DataTable on your own. (This should not be too difficult).

As in the comment mentioned there can be several pitfalls here, incompliant source files, incorrect reading of the source files. Database collations, applications that read from db incorrectly. On top you don't have real control over all parts of these environments. This should give your script side most correct phrasing throught CSVHelper and some encoding control, if your editing is limited to the PowerShell script.

To test the script below just create a C:\temp\test.csv

entry1,entry2
x,Üüüü¢+Это текст
a,z

Advantage: RFC 4180 Compliant and control over the read/write encoding to the datatable

function Convert-Encoding {
    param (
        [Parameter(Mandatory = $true)]
        [string]$text,
        [Parameter(Mandatory = $true)]
        [System.Text.Encoding]$sourceEncodingName,
        [Parameter(Mandatory = $true)]
        [System.Text.Encoding]$targetEncodingName
    )
 
    $bytes = $sourceEncodingName.GetBytes($text)
    $newText = $targetEncodingName.GetString($bytes)
    return $newText
}

# Useing the CsvHelper nuget package inside powershell
Install-Module -Name GetNuGet
$nuget = Get-NuGet -Version 6.5.0
& $nuget install CsvHelper -Version 30.0.1 -OutputDirectory "$env:USERPROFILE\.nuget\packages\"
Add-Type -Path "$env:USERPROFILE\.nuget\packages\CsvHelper\30.0.1\lib\net45\CsvHelper.dll"

$sourceEncoding = [System.Text.Encoding]::UTF8
$targetEncoding = [System.Text.Encoding]::Default

$reader = [System.IO.StreamReader]::new("C:\temp\test.csv",$sourceEncoding)
$csv = [CsvHelper.CsvReader]::new($reader, [System.Globalization.CultureInfo]::InvariantCulture)

$records = [System.Collections.Generic.List[object]]::new()
$csv.Read()
$csv.ReadHeader()
$datatable = New-Object System.Data.DataTable
$header = $csv.HeaderRecord

foreach($item in $header)
{
    $datatable.Columns.Add($item);
}

while ($csv.Read()) {

    $record = [PSCustomObject] @{
        Id = Convert-Encoding -text $csv.GetField("entry1") -sourceEncodingName ($sourceEncoding) -targetEncodingName ($targetEncoding) 
        Name = Convert-Encoding -text $csv.GetField("entry2") -sourceEncodingName ($sourceEncoding) -targetEncodingName ($targetEncoding)
    }

    $records.Add($record)

    $workRow = $datatable.NewRow()
    foreach($item in $header)
    {
        $workRow[$item] = Convert-Encoding -text $csv.GetField($item) -sourceEncodingName ($sourceEncoding) -targetEncodingName ($targetEncoding)
    }
    $datatable.Rows.Add($workRow)

}

$csv.Dispose()
$reader.Dispose()

Write-Output $records

This will endup in the datatable as

entry1,entry2
x,Üüüü¢+Это текÑÑ‚
a,z

Cause i changed the encoding in between.

$sourceEncoding = [System.Text.Encoding]::UTF8
$targetEncoding = [System.Text.Encoding]::Default