JAlcocerTech E-books

Self-Hosted Databases for Data & Analytics

A quick recap of databases for Data & Analytics professionals interested in self-hosting.

This chapter covers both SQL and NoSQL databases that you can deploy on your own infrastructure using Docker.

SQL Databases

The structured databases you’ll find throughout Data & Analytics work.

MySQL

MySQL is one of the most popular open-source relational databases.

Key Features:

  • ACID compliance
  • Strong community support
  • Wide ecosystem of tools
  • Good performance for read-heavy workloads

Installation:

pip install mysql-connector-python

MariaDB

MariaDB is a MySQL fork created by the original MySQL developers, offering improved performance and features.

Why MariaDB?

  • Drop-in replacement for MySQL
  • Better performance
  • More storage engines
  • Active development
  • ARM64 support (great for Raspberry Pi)

Docker Setup:

version: '3.8'
services:
  mariadb:
    image: mariadb:latest
    container_name: mariadb
    environment:
      MYSQL_ROOT_PASSWORD: rootpassword
      MYSQL_DATABASE: analytics
      MYSQL_USER: analyst
      MYSQL_PASSWORD: analystpass
    ports:
      - "3306:3306"
    volumes:
      - mariadb-data:/var/lib/mysql
    restart: unless-stopped

volumes:
  mariadb-data:

Use Cases:

  • Web applications
  • Data warehousing (small to medium scale)
  • Analytics workloads
  • IoT data storage

PostgreSQL

PostgreSQL is the most advanced open-source relational database.

Key Features:

  • Advanced SQL features
  • JSON/JSONB support
  • Full-text search
  • Geospatial data (PostGIS)
  • ACID compliance
  • Extensibility

Docker Setup:

version: '3.8'
services:
  postgres:
    image: postgres:15
    container_name: postgres
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: securepassword
      POSTGRES_DB: analytics
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: unless-stopped

volumes:
  postgres-data:

Installation:

pip install psycopg2-binary

Use Cases:

  • Complex analytical queries
  • Data warehousing
  • Geospatial analysis
  • Time-series data (with TimescaleDB extension)
  • AI/ML data pipelines

Advanced Features:

TimescaleDB Extension:

TimescaleDB turns PostgreSQL into a time-series database:

Docker with TimescaleDB:

version: '3.8'
services:
  timescaledb:
    image: timescale/timescaledb:latest-pg15
    container_name: timescaledb
    environment:
      POSTGRES_PASSWORD: securepassword
    ports:
      - "5432:5432"
    volumes:
      - timescale-data:/var/lib/postgresql/data

volumes:
  timescale-data:

NoSQL Databases

NoSQL databases offer flexible schemas and horizontal scalability, perfect for diverse data types and high-volume workloads.

Why NoSQL?

Advantages:

  • Flexible Schemas: Handle diverse data without rigid table structures
  • Horizontal Scalability: Add servers to handle more data
  • High Performance: Optimized for specific use cases
  • Agile Development: Adapt to changing requirements quickly

NoSQL Database Types:

TypeExamplesBest For
DocumentMongoDB, CouchbaseSemi-structured data, JSON
Key-ValueRedis, DynamoDBCaching, session management
Column-OrientedCassandra, HBaseBig data analytics
GraphNeo4jSocial networks, relationships

MongoDB

MongoDB is a popular document-oriented NoSQL database.

Key Features:

  • JSON-like documents (BSON)
  • Flexible schema
  • Rich query language
  • Aggregation framework
  • Horizontal scaling (sharding)
  • Replication for high availability

Docker Setup:

version: '3.8'
services:
  mongodb:
    image: mongo:latest
    container_name: mongodb
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: adminpass
    ports:
      - "27017:27017"
    volumes:
      - mongo-data:/data/db
    restart: unless-stopped

volumes:
  mongo-data:

Python Connection (PyMongo):

from pymongo import MongoClient

try:
    # Establish a connection
    client = MongoClient("mongodb://admin:adminpass@localhost:27017/")
    
    # Access a database
    db = client["mydatabase"]
    
    # Access a collection
    collection = db["mycollection"]
    
    # Insert a document
    document = {"name": "John Doe", "age": 30, "city": "New York"}
    result = collection.insert_one(document)
    print(f"Inserted document ID: {result.inserted_id}")
    
    # Query documents
    for doc in collection.find({"age": {"$gt": 25}}):
        print(doc)

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    # Close the connection
    if 'client' in locals() and client is not None:
        client.close()

Installation:

pip install pymongo

Use Cases:

  • Content management systems
  • Real-time analytics
  • IoT sensor data
  • User profiles and preferences
  • Catalog data

InfluxDB

InfluxDB is a time-series database optimized for time-stamped data.

Key Features:

  • Purpose-built for time-series data
  • High write throughput
  • Built-in retention policies
  • Downsampling and aggregation
  • SQL-like query language (InfluxQL)

Docker Setup:

version: '3.8'
services:
  influxdb:
    image: influxdb:2.7
    container_name: influxdb
    ports:
      - "8086:8086"
    volumes:
      - influxdb-data:/var/lib/influxdb2
      - influxdb-config:/etc/influxdb2
    environment:
      - DOCKER_INFLUXDB_INIT_MODE=setup
      - DOCKER_INFLUXDB_INIT_USERNAME=admin
      - DOCKER_INFLUXDB_INIT_PASSWORD=adminpass
      - DOCKER_INFLUXDB_INIT_ORG=myorg
      - DOCKER_INFLUXDB_INIT_BUCKET=mybucket
    restart: unless-stopped

volumes:
  influxdb-data:
  influxdb-config:

Python Connection:

from influxdb_client import InfluxDBClient, Point
from influxdb_client.client.write_api import SYNCHRONOUS

# Initialize client
client = InfluxDBClient(
    url="http://localhost:8086",
    token="your-token",
    org="myorg"
)

# Write data
write_api = client.write_api(write_options=SYNCHRONOUS)

point = Point("temperature") \
    .tag("location", "room1") \
    .field("value", 23.5)

write_api.write(bucket="mybucket", record=point)

# Query data
query_api = client.query_api()
query = 'from(bucket:"mybucket") |> range(start: -1h)'
result = query_api.query(query=query)

for table in result:
    for record in table.records:
        print(record)

client.close()

Use Cases:

  • IoT sensor monitoring
  • Application metrics
  • Server monitoring
  • Financial data
  • Real-time analytics

Specialized Databases

Elasticsearch & Kibana

Elasticsearch is a distributed search and analytics engine.

Key Features:

  • Full-text search
  • Real-time indexing
  • Distributed architecture
  • RESTful API
  • JSON documents

Docker Setup (ELK Stack):

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch

volumes:
  es-data:

Use Cases:

  • Log analysis
  • Application search
  • Security analytics (SIEM)
  • Business analytics

Redis

Redis is an in-memory data store used for caching and real-time applications.

Key Features:

  • In-memory storage (extremely fast)
  • Data structures (strings, hashes, lists, sets, sorted sets)
  • Pub/Sub messaging
  • Persistence options
  • Atomic operations

Docker Setup:

version: '3.8'
services:
  redis:
    image: redis:7-alpine
    container_name: redis
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes
    restart: unless-stopped

volumes:
  redis-data:

Installation:

pip install redis

Use Cases:

  • Caching
  • Session management
  • Real-time analytics
  • Message queuing
  • Leaderboards

Cassandra

Apache Cassandra is a distributed NoSQL database for high availability.

Key Features:

  • Distributed architecture
  • No single point of failure
  • Linear scalability
  • Tunable consistency
  • High write throughput

Installation:

pip install cassandra-driver

Use Cases:

  • Time-series data
  • IoT applications
  • Messaging systems
  • Product catalogs

Big Data & Streaming

Apache Kafka

Kafka is a distributed event streaming platform.

Key Features:

  • High-throughput message broker
  • Distributed architecture
  • Fault-tolerant
  • Real-time data pipelines
  • Stream processing

Use Cases:

  • Real-time data pipelines
  • Event sourcing
  • Log aggregation
  • Stream processing
  • Microservices communication

Apache Druid

Druid is a real-time analytics database.

Key Features:

  • Sub-second query latency
  • Real-time data ingestion
  • Columnar storage
  • Time-series optimization
  • Horizontal scalability

Use Cases:

  • Real-time dashboards
  • Clickstream analytics
  • Network telemetry
  • Application performance monitoring

Integration:

  • Works well with Kafka for ingestion
  • Superset for visualization
  • SQL interface for querying

Trino (formerly Presto SQL)

Trino is a distributed SQL query engine.

Key Features:

  • Query data across multiple sources
  • Federated queries
  • High performance
  • ANSI SQL support
  • Connector-based architecture

Use Cases:

  • Data lake analytics
  • Cross-database queries
  • Ad-hoc analysis
  • Data virtualization

Comparison:

ToolPurposeBest For
KafkaEvent streamingData ingestion, real-time pipelines
DruidReal-time analyticsSub-second queries, time-series
TrinoFederated queriesCross-source analytics

Typical Architecture:

Data Sources → Kafka → Druid → Visualization (Superset)

              Trino (for ad-hoc queries)

Database Comparison

SQL vs NoSQL

FeatureSQLNoSQL
SchemaFixedFlexible
ScalabilityVerticalHorizontal
ACIDYesVaries
JoinsYesLimited
Best ForStructured dataDiverse data types

Choosing the Right Database

Use PostgreSQL when:

  • You need ACID compliance
  • Complex queries and joins
  • Relational data
  • Strong consistency

Use MongoDB when:

  • Flexible schema needed
  • Document-oriented data
  • Rapid development
  • Horizontal scaling

Use InfluxDB when:

  • Time-series data
  • IoT metrics
  • High write throughput
  • Retention policies

Use Redis when:

  • Caching required
  • Session storage
  • Real-time analytics
  • Sub-millisecond latency

Use Elasticsearch when:

  • Full-text search
  • Log analysis
  • Real-time indexing
  • Complex queries on unstructured data

Docker Best Practices

Data Persistence

Always use volumes for database data:

volumes:
  db-data:
    driver: local

Environment Variables

Use .env files for sensitive data:

# .env
POSTGRES_PASSWORD=securepassword
MONGO_ROOT_PASSWORD=mongopass
services:
  postgres:
    env_file:
      - .env

Networking

Create dedicated networks:

networks:
  db-network:
    driver: bridge

services:
  postgres:
    networks:
      - db-network
  app:
    networks:
      - db-network

Backups

PostgreSQL Backup:

docker exec postgres pg_dump -U postgres analytics > backup.sql

MongoDB Backup:

docker exec mongodb mongodump --out /backup

Restore:

docker exec -i postgres psql -U postgres analytics < backup.sql

Resource Limits

services:
  postgres:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
        reservations:
          cpus: '1'
          memory: 2G

Conclusion

Self-hosting databases gives you:

  • Control: Full control over your data
  • Cost Savings: No cloud database fees
  • Privacy: Data stays on your infrastructure
  • Learning: Deep understanding of database internals

Getting Started:

  1. Start Simple: Begin with PostgreSQL or MariaDB
  2. Understand Your Needs: Match database to use case
  3. Use Docker: Simplifies deployment
  4. Plan for Backups: Implement backup strategies
  5. Monitor Performance: Use appropriate tools

Next Steps:

  • Deploy a test database
  • Connect from your application
  • Implement backup procedures
  • Monitor performance
  • Scale as needed

Each database has its strengths—choose based on your data model, query patterns, and scalability requirements.