Problem Statement
Slow running queries against a large data set tables (>5 GB)
Areas You May Have Already Looked into
- SQL Query Tuning
- and JDBC Code Optimization
Areas to Explore More
- DB Design
- Disk I/O
- Number of Columns in a Table
- Choose right Data Types for Columns
- Tune DB Buffer Size
- To improve Caching
- Standard recommended solution to deal with large data set tables
- Table Partitioning
- Since at a certain point you realize the data size in the database is the bottleneck
- Application Design
- Tune JDBC Code and Design
- Apply Application Level Cache
How Postgres Stores Data
In order to understand the performance improvements suggestion, first we need to understand some of the basic architecture of PostgreSQL:
- Disk Files
- All table, indexes are stored in OS files
- Files can find under: <Postgres home>/pgdata/base/
- Page Size
- The size of a page is fixed at 8,192 bytes (8K)
- you can increase or decrease the page size if you build your own copy of PostgreSQL from source, but all pages within a database will be the same size
- All disk I/O is performed on a page-by-page basis, when you select a single row from a table, PostgreSQL will read at least one page
- Heap page & Index Page
- Two different pages one for actual table data and another for index data
- Heap and Index Cache Hits
- Disk I/O is expensive
- Caches heap and index pages
- Postgres itself tracks access patterns of your data and will on its own keep frequently accessed data in cache
- Memory
access is fast; disk access is slow.
- Insertion order matters for effective caching
Minimize Disk I/O
- Normalize Database
- Remove unnecessary columns
- Choose Right Data Types
- Avoid using larger size data type if data values are small and can fit into smaller data types
- It has a direct impact on Cache hits
Tune DB Buffer Size
- Adjust the DB Buffer Size
- To improve heap block and index block cache hits
- You could increase the cache size to be large enough to hold the entire table, but that's a large shared memory segment, and if you don't have enough physical memory, your system will start to thrash
Dynamic Table Partition For Large Data Set Table
- Table partition standard solution for large data set tables
- The average number of heap/index blocks you'll have to navigate in order to find a row goes down
- Partition also benefits on choosing the right scan types during query
- There are some maintenance advantages too. You can DROP an individual partition, to erase all of the data from that range. This is a common technique for pruning historical data out of a partitioned table, one that avoids the VACUUM cleanup work that DELETE leaves behind.
- Dynamic partition rules can be setup which minimizes maintenance overhead and transparent to application layer
- Tips:
- On what column to partition matters
- No. of partitions should not be large ( < 100)
- Race condition if two separate transactions inserts, may get two child partition tables with the same name if not taken care properly
No comments:
Post a Comment