JAlcocerTech E-books

Databases & DBMS

A database is a collection of data organized in a way that makes it easy to access, manage, and update.

Databases are fundamental components of software applications and come in many forms.


What is a Database?

A database stores and manages information in a structured way:

  • Traditional method: tables, columns, and rows
  • Typically structured data with a defined schema
  • Designed for transactional operations (OLTP — CRUD operations)

Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server, Amazon Aurora

Note: Traditional databases are designed for transactions, NOT for data analytics. That’s where Data Warehouses come in.


Database Management Systems (DBMS)

A DBMS is software that allows users to create, manage, and access databases. It provides:

  • Centralized data management
  • Efficient data storage and retrieval
  • Security and access control
  • Data integrity enforcement

ACID Properties

The ACID properties ensure database transactions are processed reliably:

PropertyDescriptionExample
AtomicityTransaction is all-or-nothingIf transfer fails mid-way, entire transaction rolls back
ConsistencyDatabase moves from one valid state to anotherA transaction cannot leave the database in an invalid state
IsolationConcurrent transactions don’t interfereEach transaction runs as if it’s the only one
DurabilityCommitted transactions survive system failuresOnce committed, data persists even after a crash

Why ACID Matters

  • Banking: Transfer $100 from Account A to Account B must either complete fully or not at all
  • E-commerce: Order placement must update inventory and create order record consistently
  • Healthcare: Patient records must maintain integrity across updates

OLTP vs OLAP

Understanding the difference is critical for choosing the right architecture:

AspectOLTP (Transactional)OLAP (Analytical)
PurposeDay-to-day operationsBusiness intelligence & reporting
OperationsCRUD (Create, Read, Update, Delete)Complex queries, aggregations
DataCurrent, real-timeHistorical, aggregated
SchemaNormalized (3NF)Denormalized (Star/Snowflake)
UsersMany concurrent usersFewer analysts
Query speedFast, simple queriesComplex, long-running queries
ExamplesOrder processing, CRMData warehouses, BI dashboards

Database Relationships

Primary Keys and Foreign Keys

Key TypePurposeCharacteristics
Primary KeyUniquely identifies each rowOne per table, cannot be NULL
Foreign KeyReferences another table’s primary keyCreates relationships between tables
Surrogate KeySystem-generated unique IDAuto-increment, no business meaning
Natural KeyReal-world identifierBusiness meaning (e.g., SSN, email)

Relationship Types

  • 1:1 (One-to-One): One record in Table A relates to one record in Table B
  • 1:N (One-to-Many): One record in Table A relates to many records in Table B
  • N:M (Many-to-Many): Many records in Table A relate to many records in Table B (requires junction table)

Indexes

Indexes improve query performance by providing fast data lookup:

How Indexes Work

An index stores a sorted copy of selected columns with references to original rows—like a book’s index.

Pros and Cons

ProsCons
Faster data retrievalConsumes storage space
Speeds up WHERE, JOIN, ORDER BYSlows down INSERT/UPDATE/DELETE
Enforces uniquenessRequires maintenance

When to Create Indexes

Create indexes on columns that are:

  • Frequently used in WHERE clauses
  • Used in JOIN conditions
  • Used in ORDER BY or GROUP BY

Warning: Over-indexing tables with heavy write operations can degrade performance.


PostgreSQL as a Hybrid System

PostgreSQL can serve as both OLTP and OLAP with proper configuration:

Making PostgreSQL Analytical

FeatureHow It Helps
PartitioningSplit large tables by date/region for faster scans
Materialized ViewsPre-compute complex queries for dashboards
Parallel QueriesUse multiple CPU cores for large queries
Extensions (Citus)Distribute across multiple nodes
Columnar StorageExtensions like cstore_fdw for analytics

When PostgreSQL Works as a DWH

  • Small to medium data volumes
  • Need cost-effective solution
  • Can create separate analytical schema with star model
  • Don’t need petabyte-scale processing

SQL vs. NoSQL: Choosing Your Foundation

Choosing between a relational (SQL) and non-relational (NoSQL) database is the first step in designing any data architecture.

SQL (RDBMS)

  • Structure: Rigid, predefined schemas (tables, columns, rows).
  • Scaling: Primarily vertical (adding power to one server).
  • Integrity: Strict ACID compliance.
  • Usage: Financial systems, ERPs, core business entities.
  • Examples: PostgreSQL, MariaDB, MySQL.

NoSQL (Not Only SQL)

  • Structure: Flexible, schema-less (JSON-like documents, key-value pairs).
  • Scaling: Horizontal (adding more servers/nodes).
  • Integrity: Often sacrifices strict consistency for availability (BASE model).
  • Usage: Web apps, mobile feeds, real-time analytics, IoT.
  • Examples: MongoDB, Redis, Cassandra.

The Self-Hostable Stack

For D&A professionals interested in Self-Hosting or Home Labs, these databases offer excellent Docker support and performance.

MySQL & MariaDB

MariaDB is a popular “drop-in” replacement for MySQL.

  • Self-Hosting Tip: MariaDB is often preferred for ARM64 (Raspberry Pi) projects as it provides stable official containers for low-power hardware.

PostgreSQL

The “Swiss Army Knife” of databases. As mentioned before, its hybrid nature makes it a favorite for small-to-medium data projects.


Special Purpose NoSQL

TypeExamplesBest For
DocumentMongoDBHandling semi-structured JSON data; content management.
Key-ValueRedisIn-memory caching, session management, real-time message brokering.
Time-SeriesInfluxDBStoring metric data from IoT sensors or server monitoring.
Wide-ColumnCassandraMassive, distributed datasets requiring high availability.
GraphNeo4jAnalyzing highly connected data (social networks, fraud detection).

Connecting with Python: The Ecosystem

Most D&A workflows involve connecting to these databases via Python. Here are the standard drivers you will encounter:

DatabasePython Library
MySQL / MariaDBmysql-connector-python (Official)
PostgreSQLpsycopg2 or psycopg3
MongoDBpymongo
Redisredis-py
Cassandracassandra-driver
Kafkaconfluent-kafka

[!TIP] Always use Connection Pooling in production environments to manage multiple database connections efficiently and avoid exhausting server resources.


Key Takeaways

  1. Databases are for transactions (OLTP), data warehouses are for analytics (OLAP)
  2. ACID properties ensure reliable transaction processing
  3. Indexes speed up reads but slow down writes
  4. Primary/Foreign keys establish relationships and data integrity
  5. PostgreSQL can be hybrid with proper optimization, but has limits at scale