Databases & DBMS
A database is a collection of data organized in a way that makes it easy to access, manage, and update.
Databases are fundamental components of software applications and come in many forms.
What is a Database?
A database stores and manages information in a structured way:
- Traditional method: tables, columns, and rows
- Typically structured data with a defined schema
- Designed for transactional operations (OLTP — CRUD operations)
Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server, Amazon Aurora
Note: Traditional databases are designed for transactions, NOT for data analytics. That’s where Data Warehouses come in.
Database Management Systems (DBMS)
A DBMS is software that allows users to create, manage, and access databases. It provides:
- Centralized data management
- Efficient data storage and retrieval
- Security and access control
- Data integrity enforcement
ACID Properties
The ACID properties ensure database transactions are processed reliably:
| Property | Description | Example |
|---|---|---|
| Atomicity | Transaction is all-or-nothing | If transfer fails mid-way, entire transaction rolls back |
| Consistency | Database moves from one valid state to another | A transaction cannot leave the database in an invalid state |
| Isolation | Concurrent transactions don’t interfere | Each transaction runs as if it’s the only one |
| Durability | Committed transactions survive system failures | Once committed, data persists even after a crash |
Why ACID Matters
- Banking: Transfer $100 from Account A to Account B must either complete fully or not at all
- E-commerce: Order placement must update inventory and create order record consistently
- Healthcare: Patient records must maintain integrity across updates
OLTP vs OLAP
Understanding the difference is critical for choosing the right architecture:
| Aspect | OLTP (Transactional) | OLAP (Analytical) |
|---|---|---|
| Purpose | Day-to-day operations | Business intelligence & reporting |
| Operations | CRUD (Create, Read, Update, Delete) | Complex queries, aggregations |
| Data | Current, real-time | Historical, aggregated |
| Schema | Normalized (3NF) | Denormalized (Star/Snowflake) |
| Users | Many concurrent users | Fewer analysts |
| Query speed | Fast, simple queries | Complex, long-running queries |
| Examples | Order processing, CRM | Data warehouses, BI dashboards |
Database Relationships
Primary Keys and Foreign Keys
| Key Type | Purpose | Characteristics |
|---|---|---|
| Primary Key | Uniquely identifies each row | One per table, cannot be NULL |
| Foreign Key | References another table’s primary key | Creates relationships between tables |
| Surrogate Key | System-generated unique ID | Auto-increment, no business meaning |
| Natural Key | Real-world identifier | Business meaning (e.g., SSN, email) |
Relationship Types
- 1:1 (One-to-One): One record in Table A relates to one record in Table B
- 1:N (One-to-Many): One record in Table A relates to many records in Table B
- N:M (Many-to-Many): Many records in Table A relate to many records in Table B (requires junction table)
Indexes
Indexes improve query performance by providing fast data lookup:
How Indexes Work
An index stores a sorted copy of selected columns with references to original rows—like a book’s index.
Pros and Cons
| Pros | Cons |
|---|---|
| Faster data retrieval | Consumes storage space |
| Speeds up WHERE, JOIN, ORDER BY | Slows down INSERT/UPDATE/DELETE |
| Enforces uniqueness | Requires maintenance |
When to Create Indexes
Create indexes on columns that are:
- Frequently used in
WHEREclauses - Used in
JOINconditions - Used in
ORDER BYorGROUP BY
Warning: Over-indexing tables with heavy write operations can degrade performance.
PostgreSQL as a Hybrid System
PostgreSQL can serve as both OLTP and OLAP with proper configuration:
Making PostgreSQL Analytical
| Feature | How It Helps |
|---|---|
| Partitioning | Split large tables by date/region for faster scans |
| Materialized Views | Pre-compute complex queries for dashboards |
| Parallel Queries | Use multiple CPU cores for large queries |
| Extensions (Citus) | Distribute across multiple nodes |
| Columnar Storage | Extensions like cstore_fdw for analytics |
When PostgreSQL Works as a DWH
- Small to medium data volumes
- Need cost-effective solution
- Can create separate analytical schema with star model
- Don’t need petabyte-scale processing
SQL vs. NoSQL: Choosing Your Foundation
Choosing between a relational (SQL) and non-relational (NoSQL) database is the first step in designing any data architecture.
SQL (RDBMS)
- Structure: Rigid, predefined schemas (tables, columns, rows).
- Scaling: Primarily vertical (adding power to one server).
- Integrity: Strict ACID compliance.
- Usage: Financial systems, ERPs, core business entities.
- Examples: PostgreSQL, MariaDB, MySQL.
NoSQL (Not Only SQL)
- Structure: Flexible, schema-less (JSON-like documents, key-value pairs).
- Scaling: Horizontal (adding more servers/nodes).
- Integrity: Often sacrifices strict consistency for availability (BASE model).
- Usage: Web apps, mobile feeds, real-time analytics, IoT.
- Examples: MongoDB, Redis, Cassandra.
The Self-Hostable Stack
For D&A professionals interested in Self-Hosting or Home Labs, these databases offer excellent Docker support and performance.
MySQL & MariaDB
MariaDB is a popular “drop-in” replacement for MySQL.
- Self-Hosting Tip: MariaDB is often preferred for ARM64 (Raspberry Pi) projects as it provides stable official containers for low-power hardware.
PostgreSQL
The “Swiss Army Knife” of databases. As mentioned before, its hybrid nature makes it a favorite for small-to-medium data projects.
Special Purpose NoSQL
| Type | Examples | Best For |
|---|---|---|
| Document | MongoDB | Handling semi-structured JSON data; content management. |
| Key-Value | Redis | In-memory caching, session management, real-time message brokering. |
| Time-Series | InfluxDB | Storing metric data from IoT sensors or server monitoring. |
| Wide-Column | Cassandra | Massive, distributed datasets requiring high availability. |
| Graph | Neo4j | Analyzing highly connected data (social networks, fraud detection). |
Connecting with Python: The Ecosystem
Most D&A workflows involve connecting to these databases via Python. Here are the standard drivers you will encounter:
| Database | Python Library |
|---|---|
| MySQL / MariaDB | mysql-connector-python (Official) |
| PostgreSQL | psycopg2 or psycopg3 |
| MongoDB | pymongo |
| Redis | redis-py |
| Cassandra | cassandra-driver |
| Kafka | confluent-kafka |
[!TIP] Always use Connection Pooling in production environments to manage multiple database connections efficiently and avoid exhausting server resources.
Key Takeaways
- Databases are for transactions (OLTP), data warehouses are for analytics (OLAP)
- ACID properties ensure reliable transaction processing
- Indexes speed up reads but slow down writes
- Primary/Foreign keys establish relationships and data integrity
- PostgreSQL can be hybrid with proper optimization, but has limits at scale