Databases & DBMS

A database is a collection of data organized in a way that makes it easy to access, manage, and update.

Databases are fundamental components of software applications and come in many forms.

What is a Database?

A database stores and manages information in a structured way:

Traditional method: tables, columns, and rows
Typically structured data with a defined schema
Designed for transactional operations (OLTP — CRUD operations)

Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server, Amazon Aurora

Note: Traditional databases are designed for transactions, NOT for data analytics. That’s where Data Warehouses come in.

Database Management Systems (DBMS)

A DBMS is software that allows users to create, manage, and access databases. It provides:

Centralized data management
Efficient data storage and retrieval
Security and access control
Data integrity enforcement

ACID Properties

The ACID properties ensure database transactions are processed reliably:

Property	Description	Example
Atomicity	Transaction is all-or-nothing	If transfer fails mid-way, entire transaction rolls back
Consistency	Database moves from one valid state to another	A transaction cannot leave the database in an invalid state
Isolation	Concurrent transactions don’t interfere	Each transaction runs as if it’s the only one
Durability	Committed transactions survive system failures	Once committed, data persists even after a crash

Why ACID Matters

Banking: Transfer $100 from Account A to Account B must either complete fully or not at all
E-commerce: Order placement must update inventory and create order record consistently
Healthcare: Patient records must maintain integrity across updates

OLTP vs OLAP

Understanding the difference is critical for choosing the right architecture:

Aspect	OLTP (Transactional)	OLAP (Analytical)
Purpose	Day-to-day operations	Business intelligence & reporting
Operations	CRUD (Create, Read, Update, Delete)	Complex queries, aggregations
Data	Current, real-time	Historical, aggregated
Schema	Normalized (3NF)	Denormalized (Star/Snowflake)
Users	Many concurrent users	Fewer analysts
Query speed	Fast, simple queries	Complex, long-running queries
Examples	Order processing, CRM	Data warehouses, BI dashboards

Database Relationships

Primary Keys and Foreign Keys

Key Type	Purpose	Characteristics
Primary Key	Uniquely identifies each row	One per table, cannot be NULL
Foreign Key	References another table’s primary key	Creates relationships between tables
Surrogate Key	System-generated unique ID	Auto-increment, no business meaning
Natural Key	Real-world identifier	Business meaning (e.g., SSN, email)

Relationship Types

1:1 (One-to-One): One record in Table A relates to one record in Table B
1:N (One-to-Many): One record in Table A relates to many records in Table B
N:M (Many-to-Many): Many records in Table A relate to many records in Table B (requires junction table)

Indexes

Indexes improve query performance by providing fast data lookup:

How Indexes Work

An index stores a sorted copy of selected columns with references to original rows—like a book’s index.

Pros and Cons

Pros	Cons
Faster data retrieval	Consumes storage space
Speeds up WHERE, JOIN, ORDER BY	Slows down INSERT/UPDATE/DELETE
Enforces uniqueness	Requires maintenance

When to Create Indexes

Create indexes on columns that are:

Frequently used in WHERE clauses
Used in JOIN conditions
Used in ORDER BY or GROUP BY

Warning: Over-indexing tables with heavy write operations can degrade performance.

PostgreSQL as a Hybrid System

PostgreSQL can serve as both OLTP and OLAP with proper configuration:

Making PostgreSQL Analytical

Feature	How It Helps
Partitioning	Split large tables by date/region for faster scans
Materialized Views	Pre-compute complex queries for dashboards
Parallel Queries	Use multiple CPU cores for large queries
Extensions (Citus)	Distribute across multiple nodes
Columnar Storage	Extensions like `cstore_fdw` for analytics

When PostgreSQL Works as a DWH

Small to medium data volumes
Need cost-effective solution
Can create separate analytical schema with star model
Don’t need petabyte-scale processing

SQL vs. NoSQL: Choosing Your Foundation

Choosing between a relational (SQL) and non-relational (NoSQL) database is the first step in designing any data architecture.

SQL (RDBMS)

Structure: Rigid, predefined schemas (tables, columns, rows).
Scaling: Primarily vertical (adding power to one server).
Integrity: Strict ACID compliance.
Usage: Financial systems, ERPs, core business entities.
Examples: PostgreSQL, MariaDB, MySQL.

NoSQL (Not Only SQL)

Structure: Flexible, schema-less (JSON-like documents, key-value pairs).
Scaling: Horizontal (adding more servers/nodes).
Integrity: Often sacrifices strict consistency for availability (BASE model).
Usage: Web apps, mobile feeds, real-time analytics, IoT.
Examples: MongoDB, Redis, Cassandra.

The Self-Hostable Stack

For D&A professionals interested in Self-Hosting or Home Labs, these databases offer excellent Docker support and performance.

MySQL & MariaDB

MariaDB is a popular “drop-in” replacement for MySQL.

Self-Hosting Tip: MariaDB is often preferred for ARM64 (Raspberry Pi) projects as it provides stable official containers for low-power hardware.

PostgreSQL

The “Swiss Army Knife” of databases. As mentioned before, its hybrid nature makes it a favorite for small-to-medium data projects.

Special Purpose NoSQL

Type	Examples	Best For
Document	MongoDB	Handling semi-structured JSON data; content management.
Key-Value	Redis	In-memory caching, session management, real-time message brokering.
Time-Series	InfluxDB	Storing metric data from IoT sensors or server monitoring.
Wide-Column	Cassandra	Massive, distributed datasets requiring high availability.
Graph	Neo4j	Analyzing highly connected data (social networks, fraud detection).

Connecting with Python: The Ecosystem

Most D&A workflows involve connecting to these databases via Python. Here are the standard drivers you will encounter:

Database	Python Library
MySQL / MariaDB	`mysql-connector-python` (Official)
PostgreSQL	`psycopg2` or `psycopg3`
MongoDB	`pymongo`
Redis	`redis-py`
Cassandra	`cassandra-driver`
Kafka	`confluent-kafka`

[!TIP] Always use Connection Pooling in production environments to manage multiple database connections efficiently and avoid exhausting server resources.

Key Takeaways

Databases are for transactions (OLTP), data warehouses are for analytics (OLAP)
ACID properties ensure reliable transaction processing
Indexes speed up reads but slow down writes
Primary/Foreign keys establish relationships and data integrity
PostgreSQL can be hybrid with proper optimization, but has limits at scale