Postgres errors on ARM-based M1 Mac w/ Big Sur

15.9k Views Asked by At

Ever since I got a new ARM-based M1 MacBook Pro, I've been experiencing severe and consistent PostgreSQL issues (psql 13.1). Whether I use a Rails server or Foreman, I receive errors in both my browser and terminal like PG::InternalError: ERROR: could not read block 15 in file "base/147456/148555": Bad address or PG::Error (invalid encoding name: unicode) or Error during failsafe response: PG::UnableToSend: no connection to the server. The strange thing is that I can often refresh the browser repeatedly in order to get things to work (until they inevitably don't again).

I'm aware of all the configuration challenges related to ARM-based M1 Macs, which is why I've uninstalled and reinstalled everything from Homebrew to Postgres multiple times in numerous ways (with Rosetta, without Rosetta, using arch -x86_64 brew commands, using the Postgres app instead of the Homebrew install). I've encountered a couple other people on random message boards who are experiencing the same issue (also on new Macs) and not having any luck, which is why I'm reluctant to believe that it's a drive corruption issue. (I've also run the Disk Utility FirstAid check multiple times; it says everything's healthy, but I have no idea how reliable that is.)

I'm using thoughtbot parity to sync up my dev environment database with what's currently in production. When I run development restore production, I get hundreds of lines in my terminal that look like the output below (this is immediately after the download completes but before it goes on to create defaults, process data, sequence sets, etc.). I believe it's at the root of the issue, but I'm not sure what the solution would be:

pg_restore: dropping TABLE [table name1]
pg_restore: from TOC entry 442; 1259 15829269 TABLE [table name1] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name1]" does not exist
Command was: DROP TABLE "public"."[table name1]";
pg_restore: dropping TABLE [table name2]
pg_restore: from TOC entry 277; 1259 16955 TABLE [table name2] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name2]" does not exist
Command was: DROP TABLE "public"."[table name2]";
pg_restore: dropping TABLE [table name3]
pg_restore: from TOC entry 463; 1259 15830702 TABLE [table name3] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name3]" does not exist
Command was: DROP TABLE "public"."[table name3]";
pg_restore: dropping TABLE [table name4]
pg_restore: from TOC entry 445; 1259 15830421 TABLE [table name4] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name4]" does not exist
Command was: DROP TABLE "public"."[table name4]";

Has anyone else experienced this? Any solution ideas would be much appreciated. Thanks!

EDIT: I was able to reproduce the same issue on an older MacBook Pro (also running Big Sur), so it seems unrelated to M1 but potentially related to Big Sur.

4

There are 4 best solutions below

7
Ben Wilson On

UPDATE #2:

WAL Buffer etc. adjustments extended the time between errors, but didn't eliminate it completely. Ended up reinstalling a fresh Apple Silicon version of Postgres using Homebrew then doing a pg_dump of my existing database (experiencing the errors) and restoring it to the new installation/cluster.

Here's the interesting bit: pg_restore failed to restore one of the indexes in the database, and noted it during the restore process (which otherwise completed). My hunch is that corruption or another issue with this index was causing the Bad Address errors. As such, my final suggestion on this issue is to perform pg_dump, then use pg_restore, not pg_dump to restore the database. pg_restore appears to have flagged this issue where pg_dump didn't, writing a clean DB sans the faulty index.

UPDATE:

Continued to experience this issue after attempting several workarounds, including a full pg_dump and restore of the affected database. And while some of the fixes seem to extend the time between occurrences (particularly increasing shared buffer memory), none have proven a permanent fix.

That said, some more digging on postgres mailing lists revealed that this "Bad Address" error can occur in conjunction with WAL (write-ahead-log) issues. As such, I've now set the following in my postgresql.conf file, significantly increasing the WAL buffer size:

wal_buffers = 4MB

and have not experienced the issue since (knock on wood, again).

It makes sense that this would have some effect, as the wal_buffer size increases by default in proportion to the shared buffer size (as aforementioned, increasing shared buffer size provided temporary relief). Anyway, something else to try until we get definitive word on what's causing this bug.


Was having this exact issue sporadically on an M1 MacBook Air: ERROR: could not read block and Bad Address in various permutations.

I read in postgres forum that this issue can occur in virtual machine setups. As such, I assume this is somehow caused by Rosetta. Even if you're using the Universal version of postgres, you're likely still using an x86 binary for some adjunct process (e.g. Python in my case).

Regardless, here's what has solved the issue (so far): reindexing the database

Note: you need to reindex from the command line, not using SQL commands. When I attempted to reindex using SQL, I encountered the same Bad Address error over and over, and the reindexing never completed.

When I reindexed using the command line, the process finished, and the Bad Address error has not recurred (knock on wood).

For me, it was just:

reindexdb name_of_database

Took 20-30 minutes for a 12GB DB. Not only am I not getting these errors anymore, but the database seems snappier to boot. Only hope the issue doesn't return with repeated reads/writes/index creation in Rosetta. I'm not sure why this works... maybe indices created on M1 Macs are prone to corruption? Maybe the indices become corrupt due to write or access because of the Rosetta interaction?

2
Ben Wilson On

Definitive workaround for this:

After trying all the workarounds in the other answer, I was STILL getting this error occasionally. Even after dumping and restoring the database, switching to M1-native postgres, running all manner of maintenance script, etc.

After much tinkering with postgresql.conf, the only thing that has reliably worked around this issue indefinitely (have not since received the error):

In postgresql.conf, change:

max_worker_processes = 8

to

max_worker_processes = 1

After making this change, I have thrown every test at my previously error-ridden database and it hasn't displayed the same error once. Previously an extraction routine I run on a database of about 20M records would give the bad address error after processing 1-2 million records. Now it completes the whole process.

Obviously there is a performance penalty to reducing the number of parallel workers, but this is the only way I've found to reliably and permanently resolve this issue.

3
Ian Gow On

Is it possible that something in the Big Sur Beta 11.3 fixed this issue?

I've been having the same issues as OP since installing PostgreSQL 13 using MacPorts on my Mac mini M1 (now on PostgreSQL 13.2).

I would see could not read block errors:

  1. Occasionally when running ad hoc queries
  2. Always when compiling a book in R Markdown that makes several queries
  3. Always when running VACUUM FULL on my main database (there's about 620 GB in the instance on this machine and the error would be thrown very quickly relative to how long a VACUUM FULL would take).

(My "fix" so far has been to point my Mac to the Ubuntu server I have running in the corner of my office, so no real problem for me.)

But I've managed to do 2 and 3 without the error since upgrading to Big Sur Beta 11.3 today (both failed immediately prior to upgrading). Is it possible that something in the OS fixed this issue?

0
Kamil Tomšík On

I restored postgresql.conf from postgresql.conf.sample (and restarted db server) and it works fine since then.

TBC, I was trying both wal_buffers & max_worker_processes here and it didn't help. I discovered it accidentally because I tried so many things I just needed to go back. I did not reinitiazed whole database or anything like that, just the config file.