Problem Statement
Slow running queries against a large data set tables (>5 GB) 
Areas You May Have Already Looked into
- SQL Query Tuning
 - and JDBC Code Optimization
 
Areas to Explore More
- DB Design
 - Disk I/O
 - Number of Columns in a Table
 - Choose right Data Types for Columns
 - Tune DB Buffer Size
 - To improve Caching
 - Standard recommended solution to deal with large data set tables
 - Table Partitioning
 - Since at a certain point you realize the data size in the database is the bottleneck
 - Application Design
 - Tune JDBC Code and Design
 - Apply Application Level Cache
 
How Postgres Stores Data
In order to understand the performance improvements suggestion, first we need to understand some of the basic architecture of PostgreSQL:
- Disk Files
 - All table, indexes are stored in OS files
 - Files can find under: <Postgres home>/pgdata/base/
 - Page Size
 - The size of a page is fixed at 8,192 bytes (8K)
 - you can increase or decrease the page size if you build your own copy of PostgreSQL from source, but all pages within a database will be the same size
 - All disk I/O is performed on a page-by-page basis, when you select a single row from a table, PostgreSQL will read at least one page
 - Heap page & Index Page
 - Two different pages one for actual table data and another for index data
 - Heap and Index Cache Hits
 - Disk I/O is expensive
 - Caches heap and index pages
 - Postgres itself tracks access patterns of your data and will on its own keep frequently accessed data in cache
 - Memory
access is fast; disk access is slow. 
 - Insertion order matters for effective caching
 
Minimize Disk I/O
- Normalize Database
 - Remove unnecessary columns
 - Choose Right Data Types
 - Avoid using larger size data type if data values are small and can fit into smaller data types
 - It has a direct impact on Cache hits
 
Tune DB Buffer Size
- Adjust the DB Buffer Size
 - To improve heap block and index block cache hits
 - You could increase the cache size to be large enough to hold the entire table, but that's a large shared memory segment, and if you don't have enough physical memory, your system will start to thrash
 
Dynamic Table Partition For Large Data Set Table
- Table partition standard solution for large data set tables
 - The average number of heap/index blocks you'll have to navigate in order to find a row goes down
 - Partition also benefits on choosing the right scan types during query
 - There are some maintenance advantages too. You can DROP an individual partition, to erase all of the data from that range. This is a common technique for pruning historical data out of a partitioned table, one that avoids the VACUUM cleanup work that DELETE leaves behind.
 - Dynamic partition rules can be setup which minimizes maintenance overhead and transparent to application layer
 - Tips:
 - On what column to partition matters
 - No. of partitions should not be large ( < 100)
 - Race condition if two separate transactions inserts, may get two child partition tables with the same name if not taken care properly
 
No comments:
Post a Comment