Recently I tried to find an article with a title like this, but found nothing. So, here it is.
Solid-state drives wear out from writes. Manufactures even put the amount of writes in their warranty conditions. For example, my 1TB "Samsung 870 EVO" marked as "5 years or 600 TB of writes — whatever comes first".
These are results of my experiments of reducing writes of @rss2tg_bot by postgresql.conf
tuning. These charts are kinda spiky to estimate, but that's about 50% reduce in bytes per day:
Of course, the best way to reduce writes is to do less of INSERT
s or UPDATE
s, and also reduce the "width" of your tables. Remember, each time when you UPDATE
a single column, the new version of the entire row has to be written. This is how MVCC in PostgreSQL works, and this is inevitable. But that's not the subject of this article.
iostat
to estimate. It outputs stats since the machine boot. Take kB_wrtn/s
× 60 × 60 × 24 × 365 ÷ 1024 ÷ 1024 ÷ 1024 — and you'll estimate a number of terabytes per year.
UPDATE
a lot, or INSERT
after DELETE
.
postgresql.conf
:synchronous_commit = off
Short explanation: it disables invoking fsync()
for WAL on each COMMIT
.
fsync()
and WAL?Here goes a long explanation.
When you ask your operating system to write something on disk, the OS wouldn't actually do it. Instead, your data will stuck for a some time in a memory cache, or "write buffer". So if you want to be sure your data will undergo system crash or poweroff, you want to use fsync()
system call to ask the operating system to actually flush data on disk.
But PostgreSQL doesn't rely on fsync()
to keep a database in a consistent state. Besides, PostgreSQL doesn't keep a database consistent at all. A database is a permanent mess of chunks of data written on disk and chunks of data stuck in memory. Instead, before writing into the database, PostgreSQL describes everything it's going to do in the Write-Ahead Log. With WAL, a database could be restored even after a crash. WAL is an append-only file, so it's impossible to corrupt it in a catastrophic way.
By default, PostgreSQL invokes fsync()
on WAL on each transaction commit. By disabling it, last transactions appended to WAL could be lost in a case of a crash. But no flushing on disk also provides a better buffering, causing writes to disk to be in bigger chunks.
postgresql.conf
:wal_writer_delay = 10s # default 200ms
wal_writer_flush_after = 8MB # default 1MB
postgresql.conf
:wal_compression = on
I bet you already understand what compression means.
postgresql.conf
:checkpoint_timeout = 3h # or bigger; default 5min
max_wal_size = 20GB # or bigger; default 1GB
As I said before, a database is always in a state of a permanent mess. But there are moments when it is in a completely consistent state on the disk. These moments are called "checkpoints", and PostgreSQL do checkpoints in order to be able to remove old segments of write-ahead log.
Less checkpoints you made, more time your data could live in a write buffer.
By doing this you will also increase performance when checkpoints are not being invoked, but every checkpoint will cause a spike in latency. You will also increase the startup time of your database.
postgresql.conf
:bgwriter_lru_maxpages = 0
Background writer, well, writes from cache to disk when the database has nothing to do. This is how it achieves steady performance.
In the case of SSD, this is the dumbest thing ever. Our disks are fast enough to flush all the buffers quickly, but this background writer produces a lot of writes we don't want.
This setting works best with the increased checkpoint_timeout
above.
Alsmost every setting above basically was aiming one goal: to keep data in a memory buffer as long as possible, so a data block could be changed multiple times before it actually being written on the disk.
The problem is: there are two layers of these memory buffers. The one is "shared buffers" of PostgreSQL (you might already heard about them), and the other one are buffers in you operating system (I hope it's Linux since that's everything I use). And you should choose one of them to rely on, because splitting your memory 50%/50% is the worst-case scenario: you can use only 50% of your memory as a cache!
This is the easies one. You just set shared_buffers
to 80% of your free memory or something.
And I don't like it. The main reason is me usually having no servers dedicated only for a database. And I don't want nothing to be OOM killed.
I'll just show you how I set everyting up:
echo 360000 > /proc/sys/vm/dirty_expire_centisecs
echo 360000 > /proc/sys/vm/dirty_writeback_centisecs
echo 70 > /proc/sys/vm/dirty_background_ratio
echo 70 > /proc/sys/vm/dirty_ratio
That means: occupy up to 70% of free memory for buffers, flush them every hour.
The problem with all these settings is: despite the documentation, none of them makes any sense.
dirty_expire_centisecs
doesn't make any difference no matter what value was set. Maybe it's because files of a PostgreSQL database are too large, and there are always some expired slabs in them, but the system could flush only the whole file. I don't know.dirty_background_ratio
works in no dependence of dirty_writeback_centisecs
. I mean it's like "whatever comes first".The good part is: you can monitor the current size of cached buffers via:grep Dirty /proc/meminfo
I even made a munin plugin for that:
Is it harmful to update a single row million times?
No. It makes zero difference where exactly you do writes on SSD. Even if you target a single block on disk, the internal controller will write the data in different physical blocks every time.
Moreover, on a physical level there is no such thing as "modifying row" in PostgreSQL. It's kind of "append-only" database with a VACUUM
function.