python rounding error when sampling variable Y as a function of X with histogram?

81 Views Asked by At

I'm trying to sample a variable (SST) as a function of another variable (TCWV) using the function histogram, with weights set to the sample variable like this:

# average sst over bins
num, _   = np.histogram(tcwv, bins=bins)
sstsum, _ = np.histogram(tcwv, bins=bins,weights=sst)
out=np.zeros_like(sstsum)
out[:]=np.nan
sstav  = np.divide(sstsum,num,out=out, where=num>100)

The whole code for reproducability is given below. My problem is that when I plot a scatter plot of the raw data and then I plot my calculated averages, the averages lie way outside the data "cloud" like this (see points on right):

enter image description here

I can't think why this is happening, unless it is a rounding error perhaps?

This is my whole code:

import numpy as np
import matplotlib.pyplot as plt
from netCDF4 import Dataset

# if you have a recent netcdf libraries you can access it directly here 
url = ('http://clima-dods.ictp.it/Users/tompkins/CRM/data/WRF_1min_mem3_grid4.nc#mode=bytes')
ds=Dataset(url)

### otherwise need to download, and use this:
###ifile="WRF_1min_mem3_grid4.nc"
###ds=Dataset(idir+ifile)


# axis bins
bins=np.linspace(40,80,21)

iran1,iran2=40,60

# can put in dict and loop 
sst=ds.variables["sst"][iran1:iran2+1,:,:]
tcwv=ds.variables["tcwv"][iran1:iran2+1,:,:]

# don't need to flatten, just tried it to see if helps (it doesn't)
sst=sst.flatten()
tcwv=tcwv.flatten()

# average sst over bins
num, _   = np.histogram(tcwv, bins=bins)
sstsum, _ = np.histogram(tcwv, bins=bins,weights=sst)
out=np.zeros_like(sstsum)
out[:]=np.nan
sstav  = np.divide(sstsum,num,out=out,where=num>100)

# bins centroids
avbins=(np.array(bins[1:])+np.array(bins[:-1]))/2

#plot
subsam=2
fig,(ax)=plt.subplots()
plt.scatter(tcwv.flatten()[::subsam],sst.flatten()[::subsam],s=0.05,marker=".")
plt.scatter(avbins,sstav,s=3,color="red")
plt.ylim(299,303)
plt.savefig("scatter.png")
1

There are 1 best solutions below

0
Nick ODell On BEST ANSWER

I can't think why this is happening, unless it is a rounding error perhaps?

It is in fact a rounding error.

Specifically, when you're calculating the sum of sst within each bin here:

sstsum, _ = np.histogram(tcwv, bins=bins,weights=sst)

The results come out wrong by 0.1% versus two alternate methods I tried for calculating the sum.

I have two ideas for ways to fix this.

Approach #1

The simplest fix is to do the calculation in more precision.

sstsum, _ = np.histogram(tcwv, bins=bins,weights=sst.astype('float64'))

Without this change, sst has a dtype of float32.

Approach #2

You might want to keep the calculation in 32-bit floats for performance reasons. They are somewhat faster than 64-bit floats. An alternate solution would be to subtract the mean before summing to improve numerical stability.

sst_mean = sst.mean()
num, _   = np.histogram(tcwv, bins=bins)
sstsum, _ = np.histogram(tcwv, bins=bins,weights=sst - sst_mean)
out=np.zeros_like(sstsum)
out[:]=np.nan
sstav  = np.divide(sstsum,num,out=out,where=num>100)
sstav += sst_mean

This subtracts the overall mean of sst from each data point, then adds it back at the end. Since floats have more precision around 0, this makes the calculation more precise.

Comparison

Here is a plot of what approach #1 looks like:

plot of sst vs tcwv done in higher precision

The plot of approach #2 looks the same. The two methods are equal within 1.32 * 10-5 of each other.