Self-Hosted Databases for Data & Analytics
A quick recap of databases for Data & Analytics professionals interested in self-hosting.
This chapter covers both SQL and NoSQL databases that you can deploy on your own infrastructure using Docker.
SQL Databases
The structured databases you’ll find throughout Data & Analytics work.
MySQL
MySQL is one of the most popular open-source relational databases.
Key Features:
- ACID compliance
- Strong community support
- Wide ecosystem of tools
- Good performance for read-heavy workloads
Installation:
pip install mysql-connector-python
MariaDB
MariaDB is a MySQL fork created by the original MySQL developers, offering improved performance and features.
Why MariaDB?
- Drop-in replacement for MySQL
- Better performance
- More storage engines
- Active development
- ARM64 support (great for Raspberry Pi)
Docker Setup:
version: '3.8'
services:
mariadb:
image: mariadb:latest
container_name: mariadb
environment:
MYSQL_ROOT_PASSWORD: rootpassword
MYSQL_DATABASE: analytics
MYSQL_USER: analyst
MYSQL_PASSWORD: analystpass
ports:
- "3306:3306"
volumes:
- mariadb-data:/var/lib/mysql
restart: unless-stopped
volumes:
mariadb-data:
Use Cases:
- Web applications
- Data warehousing (small to medium scale)
- Analytics workloads
- IoT data storage
PostgreSQL
PostgreSQL is the most advanced open-source relational database.
Key Features:
- Advanced SQL features
- JSON/JSONB support
- Full-text search
- Geospatial data (PostGIS)
- ACID compliance
- Extensibility
Docker Setup:
version: '3.8'
services:
postgres:
image: postgres:15
container_name: postgres
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: securepassword
POSTGRES_DB: analytics
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
restart: unless-stopped
volumes:
postgres-data:
Installation:
pip install psycopg2-binary
Use Cases:
- Complex analytical queries
- Data warehousing
- Geospatial analysis
- Time-series data (with TimescaleDB extension)
- AI/ML data pipelines
Advanced Features:
TimescaleDB Extension:
TimescaleDB turns PostgreSQL into a time-series database:
Docker with TimescaleDB:
version: '3.8'
services:
timescaledb:
image: timescale/timescaledb:latest-pg15
container_name: timescaledb
environment:
POSTGRES_PASSWORD: securepassword
ports:
- "5432:5432"
volumes:
- timescale-data:/var/lib/postgresql/data
volumes:
timescale-data:
NoSQL Databases
NoSQL databases offer flexible schemas and horizontal scalability, perfect for diverse data types and high-volume workloads.
Why NoSQL?
Advantages:
- Flexible Schemas: Handle diverse data without rigid table structures
- Horizontal Scalability: Add servers to handle more data
- High Performance: Optimized for specific use cases
- Agile Development: Adapt to changing requirements quickly
NoSQL Database Types:
| Type | Examples | Best For |
|---|---|---|
| Document | MongoDB, Couchbase | Semi-structured data, JSON |
| Key-Value | Redis, DynamoDB | Caching, session management |
| Column-Oriented | Cassandra, HBase | Big data analytics |
| Graph | Neo4j | Social networks, relationships |
MongoDB
MongoDB is a popular document-oriented NoSQL database.
Key Features:
- JSON-like documents (BSON)
- Flexible schema
- Rich query language
- Aggregation framework
- Horizontal scaling (sharding)
- Replication for high availability
Docker Setup:
version: '3.8'
services:
mongodb:
image: mongo:latest
container_name: mongodb
environment:
MONGO_INITDB_ROOT_USERNAME: admin
MONGO_INITDB_ROOT_PASSWORD: adminpass
ports:
- "27017:27017"
volumes:
- mongo-data:/data/db
restart: unless-stopped
volumes:
mongo-data:
Python Connection (PyMongo):
from pymongo import MongoClient
try:
# Establish a connection
client = MongoClient("mongodb://admin:adminpass@localhost:27017/")
# Access a database
db = client["mydatabase"]
# Access a collection
collection = db["mycollection"]
# Insert a document
document = {"name": "John Doe", "age": 30, "city": "New York"}
result = collection.insert_one(document)
print(f"Inserted document ID: {result.inserted_id}")
# Query documents
for doc in collection.find({"age": {"$gt": 25}}):
print(doc)
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Close the connection
if 'client' in locals() and client is not None:
client.close()
Installation:
pip install pymongo
Use Cases:
- Content management systems
- Real-time analytics
- IoT sensor data
- User profiles and preferences
- Catalog data
InfluxDB
InfluxDB is a time-series database optimized for time-stamped data.
Key Features:
- Purpose-built for time-series data
- High write throughput
- Built-in retention policies
- Downsampling and aggregation
- SQL-like query language (InfluxQL)
Docker Setup:
version: '3.8'
services:
influxdb:
image: influxdb:2.7
container_name: influxdb
ports:
- "8086:8086"
volumes:
- influxdb-data:/var/lib/influxdb2
- influxdb-config:/etc/influxdb2
environment:
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=admin
- DOCKER_INFLUXDB_INIT_PASSWORD=adminpass
- DOCKER_INFLUXDB_INIT_ORG=myorg
- DOCKER_INFLUXDB_INIT_BUCKET=mybucket
restart: unless-stopped
volumes:
influxdb-data:
influxdb-config:
Python Connection:
from influxdb_client import InfluxDBClient, Point
from influxdb_client.client.write_api import SYNCHRONOUS
# Initialize client
client = InfluxDBClient(
url="http://localhost:8086",
token="your-token",
org="myorg"
)
# Write data
write_api = client.write_api(write_options=SYNCHRONOUS)
point = Point("temperature") \
.tag("location", "room1") \
.field("value", 23.5)
write_api.write(bucket="mybucket", record=point)
# Query data
query_api = client.query_api()
query = 'from(bucket:"mybucket") |> range(start: -1h)'
result = query_api.query(query=query)
for table in result:
for record in table.records:
print(record)
client.close()
Use Cases:
- IoT sensor monitoring
- Application metrics
- Server monitoring
- Financial data
- Real-time analytics
Specialized Databases
Elasticsearch & Kibana
Elasticsearch is a distributed search and analytics engine.
Key Features:
- Full-text search
- Real-time indexing
- Distributed architecture
- RESTful API
- JSON documents
Docker Setup (ELK Stack):
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: elasticsearch
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
volumes:
- es-data:/usr/share/elasticsearch/data
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
container_name: kibana
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
volumes:
es-data:
Use Cases:
- Log analysis
- Application search
- Security analytics (SIEM)
- Business analytics
Redis
Redis is an in-memory data store used for caching and real-time applications.
Key Features:
- In-memory storage (extremely fast)
- Data structures (strings, hashes, lists, sets, sorted sets)
- Pub/Sub messaging
- Persistence options
- Atomic operations
Docker Setup:
version: '3.8'
services:
redis:
image: redis:7-alpine
container_name: redis
ports:
- "6379:6379"
volumes:
- redis-data:/data
command: redis-server --appendonly yes
restart: unless-stopped
volumes:
redis-data:
Installation:
pip install redis
Use Cases:
- Caching
- Session management
- Real-time analytics
- Message queuing
- Leaderboards
Cassandra
Apache Cassandra is a distributed NoSQL database for high availability.
Key Features:
- Distributed architecture
- No single point of failure
- Linear scalability
- Tunable consistency
- High write throughput
Installation:
pip install cassandra-driver
Use Cases:
- Time-series data
- IoT applications
- Messaging systems
- Product catalogs
Big Data & Streaming
Apache Kafka
Kafka is a distributed event streaming platform.
Key Features:
- High-throughput message broker
- Distributed architecture
- Fault-tolerant
- Real-time data pipelines
- Stream processing
Use Cases:
- Real-time data pipelines
- Event sourcing
- Log aggregation
- Stream processing
- Microservices communication
Apache Druid
Druid is a real-time analytics database.
Key Features:
- Sub-second query latency
- Real-time data ingestion
- Columnar storage
- Time-series optimization
- Horizontal scalability
Use Cases:
- Real-time dashboards
- Clickstream analytics
- Network telemetry
- Application performance monitoring
Integration:
- Works well with Kafka for ingestion
- Superset for visualization
- SQL interface for querying
Trino (formerly Presto SQL)
Trino is a distributed SQL query engine.
Key Features:
- Query data across multiple sources
- Federated queries
- High performance
- ANSI SQL support
- Connector-based architecture
Use Cases:
- Data lake analytics
- Cross-database queries
- Ad-hoc analysis
- Data virtualization
Comparison:
| Tool | Purpose | Best For |
|---|---|---|
| Kafka | Event streaming | Data ingestion, real-time pipelines |
| Druid | Real-time analytics | Sub-second queries, time-series |
| Trino | Federated queries | Cross-source analytics |
Typical Architecture:
Data Sources → Kafka → Druid → Visualization (Superset)
↓
Trino (for ad-hoc queries)
Database Comparison
SQL vs NoSQL
| Feature | SQL | NoSQL |
|---|---|---|
| Schema | Fixed | Flexible |
| Scalability | Vertical | Horizontal |
| ACID | Yes | Varies |
| Joins | Yes | Limited |
| Best For | Structured data | Diverse data types |
Choosing the Right Database
Use PostgreSQL when:
- You need ACID compliance
- Complex queries and joins
- Relational data
- Strong consistency
Use MongoDB when:
- Flexible schema needed
- Document-oriented data
- Rapid development
- Horizontal scaling
Use InfluxDB when:
- Time-series data
- IoT metrics
- High write throughput
- Retention policies
Use Redis when:
- Caching required
- Session storage
- Real-time analytics
- Sub-millisecond latency
Use Elasticsearch when:
- Full-text search
- Log analysis
- Real-time indexing
- Complex queries on unstructured data
Docker Best Practices
Data Persistence
Always use volumes for database data:
volumes:
db-data:
driver: local
Environment Variables
Use .env files for sensitive data:
# .env
POSTGRES_PASSWORD=securepassword
MONGO_ROOT_PASSWORD=mongopass
services:
postgres:
env_file:
- .env
Networking
Create dedicated networks:
networks:
db-network:
driver: bridge
services:
postgres:
networks:
- db-network
app:
networks:
- db-network
Backups
PostgreSQL Backup:
docker exec postgres pg_dump -U postgres analytics > backup.sql
MongoDB Backup:
docker exec mongodb mongodump --out /backup
Restore:
docker exec -i postgres psql -U postgres analytics < backup.sql
Resource Limits
services:
postgres:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '1'
memory: 2G
Conclusion
Self-hosting databases gives you:
- Control: Full control over your data
- Cost Savings: No cloud database fees
- Privacy: Data stays on your infrastructure
- Learning: Deep understanding of database internals
Getting Started:
- Start Simple: Begin with PostgreSQL or MariaDB
- Understand Your Needs: Match database to use case
- Use Docker: Simplifies deployment
- Plan for Backups: Implement backup strategies
- Monitor Performance: Use appropriate tools
Next Steps:
- Deploy a test database
- Connect from your application
- Implement backup procedures
- Monitor performance
- Scale as needed
Each database has its strengths—choose based on your data model, query patterns, and scalability requirements.