Wednesday, August 15, 2007

Learn PostgreSQL: Newbie Questions

A poster on the PostgreSQL pgsql-general mail list asked some good questions that might be important to anyone. The general mail is a:

General discussion area for users. Apart from compile, acceptance test, and bug problems, most new users will probably only be interested in this mailing list (unless they want to contribute to development or documentation). All non-bug related questions regarding PostgreSQL's version of SQL, and all installation related questions that do not involve bugs or failed compiles, should be restricted to this area. Please note that many of the developers monitor this area.

The poster says he is a long time MySQL user and was switching to PostgreSQL. He had several reasons but sumed it up as, "...the true features included in 5.1 (as of this moment) are nothing to write home about. The InnoDB stuff is highly advocated but it has its own set of issues,...".

So he was posting some questions to make sure he was heading in the right direction. You can read the actual email here. I am going to sum up some of the emails for you below but if these questions interest you, I would suggest reading the entire thread.

The poster asked four questions but for this entry, I am going to concentrate on his first question. I may add additional entries later to cover his other questions.

His first question regarded performance on DML around a hot table. 10k inserts/day, 500k selects/day and 1M updates/day. He had heard the wrapping the statements in a BEGIN TRANSACTION; and COMMIT; would increase the performance. This generated many great responses.

Here is a partial list of comments and considerations from this question. Many of these I knew but there were a few nuggets that were new to me. Some of these are exact copies of responses and some are slightly paraphrased by me, but the knowledge is actually coming from others.

  • A transaction is a bunch of queries which you want to all get committed or aborted together. The expensive step in Postgres is the COMMIT. Postgres does an fsync which forces the data to go to disk (if you're using good hardware, and don't have fsync=off). That takes from 5-15ms depending on how much data and how fast your drives are.
  • Grouping more work into a single transaction makes the delay for the fsync at COMMIT time less of a problem. Also having more connections (but not too many, more than a few per processor is probably not helping, more than 100 and it's probably slowing you down) also means it's less important since another process can do some of its work while you're waiting for the fsync.
  • You want "fsync=on" on any machine which holds data you care about. And you want hardware which doesn't lie to you so that "fsync is finished" really means the data is on-disk. Else PostgreSQL cannot ensure ACID compliance.
  • psql runs in autocommit mode by default. If you want multiple queries in a transaction you have to issue a BEGIN statement. Drivers may do various things by default.
  • The rule is, if any query within the transaction errors, then all queries within the transaction are rolled back.
  • Build your application around the application needs first, then later look at how to optimize it.
  • Remember the two rules of optimization: 1) Don't 2) (for experts only) Don't yet
  • The only case where you should consider batching together transactions like that is if you're processing a batch data load of some kind. In that case you have a large volume of updates and they're all single-threaded. But usually in that case you want to abort the whole load if you have a problem.

All in all some really good info. If you are picking up Postgres for the first time, I would suggest you sign up for the general mail list and lurk for a while or just head over to the link above and browse.

LewisC