I've tried to follow a data cleaning tutorial on youtube . The mentor uses SQL Server and I use BigQuery (too much trouble install SQL Server on a Mac and I would like learn to adapt SQL syntax to BigQuery).
When it was instructed in SQL Server how to Populate Property Address data, the SQL syntax is
Select *
From PortfolioProject.dbo.NashvilleHousing
--Where PropertyAddress is null
order by ParcelID
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress)
From PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
Update a
SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress)
From PortfolioProject.dbo.NashvilleHousing a
JOIN PortfolioProject.dbo.NashvilleHousing b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
I've managed to adapt some to BigQuery language
Select *
From `sturdy-filament-415311.NashvilleHousing.NHData`
order by ParcelID
;
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, COALESCE(a.PropertyAddress,b.PropertyAddress)
From `sturdy-filament-415311.NashvilleHousing.NHData` a
JOIN `sturdy-filament-415311.NashvilleHousing.NHData` b
on a.ParcelID = b.ParcelID
AND a.UniqueID_ <> b.UniqueID_
Where a.PropertyAddress is null
;
But somehow got error when it comes to the below query
Update a
SET PropertyAddress = COALESCE(a.PropertyAddress,b.PropertyAddress)
From sturdy-filament-415311.NashvilleHousing.NHData a
JOIN sturdy-filament-415311.NashvilleHousing.NHData b
on a.ParcelID = b.ParcelID
AND a.UniqueID_ <> b.UniqueID_
Where a.PropertyAddress is null
I've got Table "a" must be qualified with a dataset (e.g. dataset.table).
Then I fix it by this
Update sturdy-filament-415311.NashvilleHousing.NHData
SET PropertyAddress = COALESCE(a.PropertyAddress,b.PropertyAddress)
From sturdy-filament-415311.NashvilleHousing.NHData a
JOIN sturdy-filament-415311.NashvilleHousing.NHData b
on a.ParcelID = b.ParcelID
AND a.UniqueID_ <> b.UniqueID_
Where a.PropertyAddress is null
And got UPDATE/MERGE must match at most one source row for each target row
How do I fix this?
Unlike some other DBMS's, SQL Server allows an update on multiple matching rows. However, without additional logic, such updates run only on first instance of joined matches based on indexes which is a non-deterministic outcome that can change with each run and hence not reliable.
Possibly, due to this scenario, BigQuery raises an error on such multi-row updates.
If goal is to populate missing property addresses with any non-missing property address by corresponding ParcelID, consider using an aggregate query.
Also, be sure to consistently use backticks as hyphen
-in table names can raise syntax errors. Also, let's heed, Bad Habits to Kick: Table aliases like (a, b, c) or (t1, t2, t3), by using more informative aliases: