Container has changed the way applications are developed and deployed. The ease of packaging dependencies and the ability to freeze the contents of the container has given the power of deploying repeatable environments. Developers can now confidently echo “What works on my machine can work everywhere”. Containers go hand in hand with DevOps process. In fact, it is a right fit for the DevOps. With Infrastructure defined as a Code (IaC), an entire application image can be built using a code and deployed on the platform of choice. Since containers require very less time to start compared to the Virtual Machine startup, deploying a new release or rollback can be done easily. But what about a state of an application? A majority of the applications require a state to be maintained for preserving either user data or actions. Due to ephemeral nature of containers, it is recommended for packaging application source code, but what about a database?
Before we try to answer the question, we must first understand
- What are you really expecting from a data tier?
- Do you need it to be scalable?
- Do you need it to be highly available?
- Do you require non-production environments to quickly spin up for testing/development and destroy them without leaving any traces?
- Does your database provide a built-in feature to automatically synchronize nodes in a cluster? like couchdb
- Does your database has poor sharding configuration?
- Is it really data in your database or your application actually requires caching solutions like Redis?
- Do you require consistent read/write performance from the database?
- What are your latency, throughput expectations?
- Do you expect occassional spikes? How well it is addressed in overall architecture? OR the whole load is sent to the database?
- Does your database assumes underlying hardware for managing spikes?
If you can find answers to most of the above questions, it will definitely help you to decide whether Containerizing a database is really worth it. In fact, it is very intimidating to deploy application and database using the same tools. But release process for database differs from the application. The ability to rollback or push the changes in a database for every release must be well thought. Changes made to the database cannot be easily rollbacked. Unlike code, you can’t simply pull from a specific tag and deploy the database version you desire.
By now, we do understand that making a decision about containerizing database is not as easy as containerizing application. Let’s have a look at broad level requirements for database
Since containers are ephemeral, it is advised to not maintain any state which may be required after container instance has been lost. So the usual practice is to mount the volume from host instance and make it available to Container. In Docker, it is usually achieved using the following
docker run –name dev_db_sql -v /host/dir:/var/lib/mysql ….
In above example the host directory is made available using
-v argument. What if the host itself fails? This leads to another approach, wherein data need to be stored on shared instance. Hence even if the database fails, the shared storage can be mapped to another instance. Thats not all, there are lot of plugins available to manage volume, have a look at Convoy and NetApp Trident.
It is worth noting that Cloud has become an important factor while deciding future of Organization’s digital capabilities. Whether it is a private cloud or public cloud, the aspect of employing service to enable agility, time to market and cost efficiency are key factors for IT strategies. There are now many vendors who are offering Database as a Service (DaaS) which takes away the burden of addressing non-functional requirements with SLA. The vendors are offering Multi-AZ (availability zone) deployments to guarantee high availability, read replicas for high scalability OR auto-scaling databases (mostly NoSQL) as per spike in demands. All this available at very lucrative cost.
While all this may sound very interesting and perfect choice instead of containerizing database, there are few caveats need to be considered. Consuming such services from remote applications will add extra latency. Moving data to cloud proves that data is portable but it doesn’t make it more persistent. Here is an interesting post from Joyent.
Do you think the data being held by your database can actually be categorized into isolated models that each can act independently? Often, not all datasets need to reside in one single monolith database. Gone the days, when the Relational databases were only an option considered to store entire data. We can now see, there are many options like Graph database, Document database, Columnar database, Key-Value etc. Being able to categorize the data and looking at the trend of how data has been accessed, it is necessary to identify appropriate database solution for given requirement. This also gives freedom to employ different scaling patterns to each of the systems independently.
Container Database security
Databases have always been a lucrative treasure for prying eyes. In fact one of the security vendor – Random7 published a report in 2016 about how MongoDB was compromised due to misconfiguration. Containers have always been criticized on the ground of being highly susceptible for Security attacks. Have a look from one of the research report published in 2015
Surprisingly, we found that more than 30% of official repositories contain images that are highly susceptible to a variety of security attacks (e.g., shellshock, heartbleed, poodle, etc.). For general images (images pushed by docker users, but not explicitly verified by any authority) this number jumps up to ~40% with a sampling error bound of 3%.
The number seems to be a big concern for adopting Container as a solution. The report goes on explaining some interesting facts about the Container.
Containers provide a layer of isolation between applications in separate containers, thereby increasing security. Containers still need to communicate with other containers and systems, however, and thus they remain susceptible to remote exploits of security vulnerabilities baked into the container images, such as those uncovered by our analysis.
And here is what Industry Expert Analyst – Gartner has to say about Container in July 2016
Despite the challenges, Gartner believes that one of the biggest benefits of containers is security. Gartner asserts that applications deployed in containers are more secure than applications deployed on the bare OS and, arguably, on a VM. Although containers will not prevent applications from being compromised, they greatly limit the damage of a successful compromise because applications and users are isolated on a per-container basis so that they cannot compromise other containers or the host OS — as long as a kernel privilege escalation vulnerability does not exist on the host OS.
Having said all that, there are mixed reactions from the industry. Here are few comments from ycombinator
we currently use docker for some of our production databases, mainly for almost-idle services (mongodb for graylog, zookeeper for kafka), but I have had no problem using them for some moderately sized services with a couple thousands writes per second on redis/kafka (which is nothing for them).
We’re still using non-containerised versions of the databases that need dedicated bare metal servers mostly because I don’t see the risk-benefit being worth it, but I’d love to hear someone’s war stories about running larger scale databases in docker.
While I don’t consider myself to be a pure DBA, I do know Postgres quite well and manage quite a few both “classic” deploys in a VM and containerized instances. I was the one who created a default Postgres setup/image/config that our devs use, which when it’s used correctly and as documented when it is deployed to production, it is exactly the same as managing a normal instance.
For the devs it’s simple, their local env is a checkout of a sample env, copy that to their new project, docker-compose up, and they have a database running with pretty much the same config they would get in test, acceptance and production. No surprises, we both know what to expect.
Backups? Still the same. Patches? I tell my config management to a pull a new postgres image on the servers and restart the db images during a maintenance window. This makes it actually a lot easier than updating the non-containerized services.
> and the business suffers a massive loss of data
This scenario should be recoverable in the first place, and should be tested on regular basis. I’m actually setting up a process to automatically verify database recovery using containers, which makes stuff like this a lot easier and more convenient. Spin up container, restore backup to it, full vacuum analyze, pg_check, select counts from every table, select random records from every table, and if possible spin up a test instance of the application (again, very easy if that also runs in a container) where we can run unit tests against the restored database.
> who are they going to sue? A liability-free open source project?
So when have you last heard about someone suing MS or Oracle when they had data loss? I suggest you read your license agreements… Our entire business runs on such “liability-free” open-source projects. Linux, Postgres, GNU userland, Python, GCC, clang, Boost, Wildfly, Java, … and it worked out pretty well for us. We’re not some hipster startup with nodejs, angular and mongodb “cloud” apps, we provide some mission-critical services for clients that are banks, oil companies, governments, … with corresponding SLA’s. The attitude of our (very tech-focused) management is simple: we don’t need liability umbrella’s when we _own_ the technology and know what the hell we’re doing. If something does go wrong, this would mean that yes, we would be responsible, no point in hiding.
I hope this article must have given you a clue about what you should really look at before considering container as a right fit for a database.