Partitioning solutions

Why partitioning ?

It is a technique to physically distribute the data into separate data stores to improve scalability , reduce contention and optimize performance.

Case Study 2 : Online food delivery

All the orders that the company receives gets stored in the transactional SQL server. Customer needs to be authenticated to place the order. There is a FTP server which collects invoice PDS and mail them to the customer and analytics team. This architecture is chosen to have faster transactions and maintain consistency. But, the business analytics team is having difficulties to run the historical queries Because

They queries on OLTP system are running for a longer duration
Also there are some columns with PII data which the organization doesn’t want to show to the business analytics team.

Partitioning strategy

Horizontal Partitioning :

→ Also known as sharding.
→ In this partition, every partition is a separate data store
→ But all the partitions has same schema
→ Each partition is called a shard, which holds a subset of data.
Horizontal Partitioning

Vertical Partitioning :

→ Each partitioning, each partition holds a subset of fields .
→ Vertical partitioning helps separate the PII columns and speed up the query by setting up the indexes on the different partitions.

Functional Partitioning :

→ Refers to separate data from different domains into different physical nodes to ensure better isolation and performance. .
Functional Partitioning

Partitioning will gives us
→ Scalability
→ performance
→ Availability

Pain points

→ The sharding key we chose should have high cardinality.
→ If we choose a boolean key which just has two states , then it doesn’t matter if we have 10 nodes in the cluster, data will be partitioned into two nodes only.
→ If we chose a date column as partitioning column but downsides of that is,records with current date will go to the same partition , that will cause contention.
→ if the records date got updated with the current date, records shuffling will happen which will affect the database performance.
→ If we choose a column which has too much cardinality , that will cause too many partitions.

Good partitioning key : points

Enough cardinality
Field should not require reshuffling
Frequency of values should be even

Azure partitioning strategies

Hash distributed tables : , a hash function on the key decides the partition that records will be placed. Used when data is greater than 2GB.
Round robin distribution: distributes the rows equally to all the partitions in a round robin fashion. This will be used if there is no good column to distribute.

Partitioning files in Azure Data Lake Gen2

Represent the folder structure in such a way that it represents partitions. Eg: Date wise partitions