What a surprise that I haven't heard about containers at all during
whole conference.. Really a special thing today, database experts are
probably quite reluctant to such cool hype technologies... Anyway, it
was very interesting 1-day conference around PostgreSQL, so sharing my
notes, since those might be interesting for others as starting point for
further study. Most of the slides are here:
http://p2d2.cz/rocnik-2016/prednasky
First
presentation by Pavel Stehule, last time replacement for Tomas Vondra,
was about news in PostgreSQL 9.5 and even without slides it was very
interesting summary with enough details at the same time. PostgreSQL 9.5
turns to be wanted by quite a lot of attendees.
Most of main PostgreSQL upstream developers do it as full time jobs, which means it is a professional development, even though it is still true open-source and always will be. On the other hand, as the project is more and more mature, a lot of features are in development for even 3 years.
As for version 9.5, many users are interested in UPSERT, performance is not that different from previous versions. UPSERT is kind of like MERGE from Ansi SQL, but MERGE was not implemented 100%, so it is called differently.
Grouping sets kind of allow to create multidimensional views on data.
Locks reworked a lot and arrays in PLPGSQL mean that for some complicated scenarios it may improve performance a lot. Usually, sort with C lang is much quicker than with non-C lang. Languages are simply sorted slowly, in 9.5 it is much better even for language sorting.
Bitmap indexes with a lot of duplicits do not exist like in Oracle. Something new was created though, originally called minmax, now BRIN index, which is much smaller than other indexes and is primarily used for append only data.
Extension pg_rewind avoids cloning master and we can make a slave from master that has even a bit old data during fail-over.
Hlavní novinky v PostgreSQL 9.5
Most of main PostgreSQL upstream developers do it as full time jobs, which means it is a professional development, even though it is still true open-source and always will be. On the other hand, as the project is more and more mature, a lot of features are in development for even 3 years.
As for version 9.5, many users are interested in UPSERT, performance is not that different from previous versions. UPSERT is kind of like MERGE from Ansi SQL, but MERGE was not implemented 100%, so it is called differently.
Grouping sets kind of allow to create multidimensional views on data.
Locks reworked a lot and arrays in PLPGSQL mean that for some complicated scenarios it may improve performance a lot. Usually, sort with C lang is much quicker than with non-C lang. Languages are simply sorted slowly, in 9.5 it is much better even for language sorting.
Bitmap indexes with a lot of duplicits do not exist like in Oracle. Something new was created though, originally called minmax, now BRIN index, which is much smaller than other indexes and is primarily used for append only data.
Extension pg_rewind avoids cloning master and we can make a slave from master that has even a bit old data during fail-over.
Two
things where PostgreSQL is behind commercial DBMS are multiprocessors
utilizing and partitioning. Today the HDD is not the bottleneck anymore,
with RAM disks and many CPUs it is speed of on CPU. In 9.6 there is
already couple of commits for parallel queries, which is so great that
we might even call 9.6 to be 10.0. There are also plans to have logical
or BDR replication.. some more patches for making guesses better, using
correlations. Especially useful for users who do not understand query
execution and cannot use EXPLAIN.
Another feature is RRS (row right security), especially interesting for banking sector for example.
Andres
from CitusData talked about tuning, giving overview around memory
architecture, explaining why there is shared memory and private per
process memory. One cache handled by PostgreSQL, another hash is in the
system when reading files. Clock sweep invalidates part of the buffer so
we know what buffer part we can replaced -- that might be expensive in
some cases.
Reason why WAL is efficient is because it is a sequential write. Checkpoint then write bunch of data at some point, that can be expensive. We can extend checkpoint timeout, for large data written, checkpoint_segments should be tuned up. To reduce starting up we can do checkpoint before shutdown.
Another feature is RRS (row right security), especially interesting for banking sector for example.
IO in Postgres - Architecture, Tuning, Problems
Reason why WAL is efficient is because it is a sequential write. Checkpoint then write bunch of data at some point, that can be expensive. We can extend checkpoint timeout, for large data written, checkpoint_segments should be tuned up. To reduce starting up we can do checkpoint before shutdown.
Checkpoints explain why there is
sometimes more load and sometimes less in benchmarks graphs. Andres also
showed how the tuning of buffer size and checkpoint values influence
the graphs.
Interesting that turning off caching in OS may do better job, although we need to count with consequences.
Tuning buffers is basically about understanding the workload, no advice can be good enough if it is done without that. Drop index and re-indexing can be better with lower shared_buffer settings.
Batch runs may use different settings. Slides available at anarazel.de/talks/pgdevday-prague-2016-02-18/io.pdf
Sasha from Shards.io was speaking about his cool project SQL tabs .
Interesting that turning off caching in OS may do better job, although we need to count with consequences.
Tuning buffers is basically about understanding the workload, no advice can be good enough if it is done without that. Drop index and re-indexing can be better with lower shared_buffer settings.
Batch runs may use different settings. Slides available at anarazel.de/talks/pgdevday-prague-2016-02-18/io.pdf
SQL Tabs
Sometimes
psql console is not enough, but people love it. So he started a new SQL
client with requirements to be nonblocking, black and white themed,
having connections switching...
react.js, nodejs, and other
technologies used for creating web-based system that allows browsing,
datatype exploration, context advisor, functions listing, query history,
... It uses libpq in backend, so the output is as usual. It can also
generate some simple graphs by adding some semantic in SQL comments and
selecting proper type of chart and writing comments in markdown. It also
allows to combine more queries and having a simple report with text,
chars, table.
It understands some datatypes so time series are nicely formated as one would expect.
More info at www.sqltabs.com.
Ales Zeleny from Ceska Sporitelna sharing experiences with PostgreSQL. He advised:
It understands some datatypes so time series are nicely formated as one would expect.
More info at www.sqltabs.com.
Jak jsme začali provozovat PostgreSQL (a co jsme se u toho naučili)
- to try backup before going to production
- start with simple app for lunches
- not making oracle from PostgreSQL
- think about what to log because audit is missing in PostgreSQL
- creating schema for app instead of using public schema
- separating app modules to own schemas
- sometimes even separate databases
- Table spaces on own clusters to avoid influencing other databases when character of one app changes rapidly
- Monitoring is important but sometimes troublesome
- Use check_postgres.pl and autovacuum, configured well for the workload
- Default privileges help creating fine granulary permissions
- Logging can supplement audit
- Backup and recovery by barman, pgbackrest, pacemaker from rh for recovery, streaming replication with delay
- Testing data recovery scenery
- For monitoring they use elegraph + influx + grafana
- Visualization helps understand where is problem
- Configure autovacuum so that it runs often, then it does little things, so it is quickly
Open-source "clusterovací" řešení pro PostgreSQL
Petr Jelínek from 2ndQuadrant talked about general scaling concepts (vertical, horizontal), focusing on OLTP, OLAP, ...
For vertical scaling and OLTP we have couple of features already in 9.5 (sampling, BRIN indexes, sort optimizations and in 9.6 there will be some first implementation of parallel query execution.
Hot standby shorty introduced - integrated solution for scaling, it solves only read scaling, since slaves are read only.
PL/Proxy - now almost legacy, developed in Skype, no maintanance, map-reduce implementation in PostgreSQL.
Greenplum by Pivotal (fork of PostgreSQL), stable, open-source since Autumn 2015 (Apache) is kind of MPP database (multi parallel processing) and is quite a lot diverse from vanilla PostgreSQL. It has own tooling, tables optimized for inserts.
CitusData is gonna release CitusDB very soon as extension only, now it is a fork of PostgreSQL and extensions. And it will be open-source soon as well, now only pg_shard is open-sourced. It has logical servers, so it duplicates data in distributed tables.
Postgres-XL comming from Postgres-XC and StormDB, now Open-Source (MPL) taken by community after the company crashed. Soon on 9.5. It is MPP that supports both OLTP and OLAP, but is more complex for administration and installation.
BDR from 2ndQuadrant is also a fork of PostgreSQL (9.4), but it has goal to be integrated back once. It already pushed couple of features to vanilla PostgreSQL (background workers, event triggers, ...). It is asynchronous multi-master solution (all data on all servers) that uses optimistic detection and conflict solutions (in contradiction to MVCC in vanilla PostgreSQL). So, it makes the consistancy be true eventually, after conflict resolution (after commit). Primary focusses on latency optimization (avoid global writes across the universe).
Questions about eventual consistency and sequences; answer was that nobody can expect normal application will work the same as before without multi-master.
Pavel
Stehule talking about administration and monitoring, starting with
interesting thought that moving from "It may work" to "I know it works"
is quite a small step.. Architecture is always a key. Client server is
for one server, cluster scales... Wrong chose in the beginning will
affect all.
Already during user interface design to think about database -- so the database can return only limited set of lines. Important values are parameters of RAM, IO -- while we are in RAM, then there is performance degradation and it does not go linearly, it degradates by jumps.
Configuration of database and optimization of queries does not help when the architecture and app design is wrong.
Sometimes users add indexes without thinking. Sometimes users deploy replication and cluster when it is not necessary, ram may help much better, aggregated data in db is also not a bad idea.
Work_mem shared_buffers, effective_cache_size, max_connections are important - wm*mc*2 + sb + fs + os < ram
Database servers can never swap.
Shared buffers too high (over 20gb) could make finding free cache too slow, but it always depends on cache work characteristic
Effective chache size says how big portion of index size is actively used.
Strong negative feedback is about 80/20 rule, that 80% of workload is generated by 20% of queries. Sometimes even modules that are not used, eat 50% of workload.
People cannot work with memory today, it is better to read data once, work with them and write time to time to db, not connect to db for every piece.
Monitoring is important to seek for regressions.
Good queries should be verified that it is not an accident.
Disk in work has much different characteristic than on machine without load. Pg_bench is the other extreme, it does not simmulates any real workload.
Some overview at pg_stat table - Many Rollbacks are wrong, many tmp files are wrong. Changes in transactions count is suspicious, like firewall killing connections by mistake.
Kill very long running queries before they kill the server.
Especially in beginning of the app it is important to monitor queries because database is a living organism.
Queries around 50ms cannot be much faster, if we need more, we need cache and not use db at all..
Monitoring regressions in cpu utilization using profiling tools works good in linux, fix is usually to distribute the workload to more processors.
Autovacuum monitoring is also important.
And finally the bench-marking is needed and it should be done soon enough and ideally in production environment.
Finding whether something is effective is hard, takes long time. Finding whether something behaves the same as yesterday is simple and quick. Graphs are quick way from hope to know.
Too wide tables are usually wrong, in oltp we should respect normal forms.
How large is typical oltp today? Usually tens of gb. Today people often store bitmaps to db which makes it larger.
Ram > 1/10 db
Pg_paxos
was introduced by Marco Slot and we learned it is a framework for
distributed problems. Servers sometimes need to agree on something, that
is what paxos algorithm is about.
Two phases: nodes agree in participation, proposal asks for acceptance when majority must agree
If anything fails, we start over.
Paxos state machine (multi-paxos) helps to replicate log on multiple servers.
Paxos is extension with low throughput and high latency, so no alternative to streaming or logical replication, not even a magic distributed postgres, but it is a useful building block for distributed systems. It is somewhat experimental in pl/pgSQL. Available on github on citusdata organization.
Demo showed how to use three paxos servers in cloud to acquire a lock, then communication can begin normally and we know only one server works with the data.
For vertical scaling and OLTP we have couple of features already in 9.5 (sampling, BRIN indexes, sort optimizations and in 9.6 there will be some first implementation of parallel query execution.
Hot standby shorty introduced - integrated solution for scaling, it solves only read scaling, since slaves are read only.
PL/Proxy - now almost legacy, developed in Skype, no maintanance, map-reduce implementation in PostgreSQL.
Greenplum by Pivotal (fork of PostgreSQL), stable, open-source since Autumn 2015 (Apache) is kind of MPP database (multi parallel processing) and is quite a lot diverse from vanilla PostgreSQL. It has own tooling, tables optimized for inserts.
CitusData is gonna release CitusDB very soon as extension only, now it is a fork of PostgreSQL and extensions. And it will be open-source soon as well, now only pg_shard is open-sourced. It has logical servers, so it duplicates data in distributed tables.
Postgres-XL comming from Postgres-XC and StormDB, now Open-Source (MPL) taken by community after the company crashed. Soon on 9.5. It is MPP that supports both OLTP and OLAP, but is more complex for administration and installation.
BDR from 2ndQuadrant is also a fork of PostgreSQL (9.4), but it has goal to be integrated back once. It already pushed couple of features to vanilla PostgreSQL (background workers, event triggers, ...). It is asynchronous multi-master solution (all data on all servers) that uses optimistic detection and conflict solutions (in contradiction to MVCC in vanilla PostgreSQL). So, it makes the consistancy be true eventually, after conflict resolution (after commit). Primary focusses on latency optimization (avoid global writes across the universe).
Questions about eventual consistency and sequences; answer was that nobody can expect normal application will work the same as before without multi-master.
Monitoring PostgreSQL prakticky
Already during user interface design to think about database -- so the database can return only limited set of lines. Important values are parameters of RAM, IO -- while we are in RAM, then there is performance degradation and it does not go linearly, it degradates by jumps.
Configuration of database and optimization of queries does not help when the architecture and app design is wrong.
Sometimes users add indexes without thinking. Sometimes users deploy replication and cluster when it is not necessary, ram may help much better, aggregated data in db is also not a bad idea.
Work_mem shared_buffers, effective_cache_size, max_connections are important - wm*mc*2 + sb + fs + os < ram
Database servers can never swap.
Shared buffers too high (over 20gb) could make finding free cache too slow, but it always depends on cache work characteristic
Effective chache size says how big portion of index size is actively used.
Strong negative feedback is about 80/20 rule, that 80% of workload is generated by 20% of queries. Sometimes even modules that are not used, eat 50% of workload.
People cannot work with memory today, it is better to read data once, work with them and write time to time to db, not connect to db for every piece.
Monitoring is important to seek for regressions.
Good queries should be verified that it is not an accident.
Disk in work has much different characteristic than on machine without load. Pg_bench is the other extreme, it does not simmulates any real workload.
Some overview at pg_stat table - Many Rollbacks are wrong, many tmp files are wrong. Changes in transactions count is suspicious, like firewall killing connections by mistake.
Kill very long running queries before they kill the server.
Especially in beginning of the app it is important to monitor queries because database is a living organism.
Queries around 50ms cannot be much faster, if we need more, we need cache and not use db at all..
Monitoring regressions in cpu utilization using profiling tools works good in linux, fix is usually to distribute the workload to more processors.
Autovacuum monitoring is also important.
And finally the bench-marking is needed and it should be done soon enough and ideally in production environment.
Finding whether something is effective is hard, takes long time. Finding whether something behaves the same as yesterday is simple and quick. Graphs are quick way from hope to know.
Too wide tables are usually wrong, in oltp we should respect normal forms.
How large is typical oltp today? Usually tens of gb. Today people often store bitmaps to db which makes it larger.
Ram > 1/10 db
pg_paxos: Table replication through distributed consensus
Two phases: nodes agree in participation, proposal asks for acceptance when majority must agree
If anything fails, we start over.
Paxos state machine (multi-paxos) helps to replicate log on multiple servers.
Paxos is extension with low throughput and high latency, so no alternative to streaming or logical replication, not even a magic distributed postgres, but it is a useful building block for distributed systems. It is somewhat experimental in pl/pgSQL. Available on github on citusdata organization.
Demo showed how to use three paxos servers in cloud to acquire a lock, then communication can begin normally and we know only one server works with the data.
That's all folks.