Arch architecture.
Arch architecture. (source: Pixabay)

The idea of “rational” machines has always been part of the human imagination. As the field of artificial intelligence (AI) advanced, computational tools became more sophisticated, and specific applications of AI, such as machine learning, evolved.

Machine learning transforms business

Machine learning revolves around the idea that we should be able to give machines access to data and let them learn for themselves. It arose within the interesting confluence of emergence of big data, cheap and powerful computational processing, and more efficient data storage. From banking to health care to retail, machine learning is revolutionizing the way we do business. Whether it is used for detecting fraud, identifying patterns in trading, or recommending a new product based on real-time information processing, the potential for this burgeoning field is vast.

Many industries recognize that real-time insights into big data make them more efficient and differentiate them from their competitors. And ignoring the data carries a hefty price tag: PayPal reported losing $10 million a month to hackers until they implemented machine learning to detect fraudulent patterns. It is no surprise that machine learning is on the top of every IT department’s priority list for long-term investment.

IT departments quickly realized, however, that while machine learning as a field has exploded, the landscape of tools, technology, and infrastructure to power these applications is confusing and fragmented. It is not easy to manage all the servers and connect services in a way that can be scaled when needed.

The right tools for the job

Companies that want to deliver new services with data insights often find it difficult to capture and process their big data. For instance, the machine learning tools must integrate easily with the software platforms that support existing business processes, users, and diverse projects. The tools must also interface with many different data platforms and handle structured, semi-structured, and unstructured data. Lastly, the tools must integrate with the company’s preferred technology stack.

In the past few years alone, a plethora of tools has emerged to facilitate machine learning, including a broad set of container and big data technologies, such as distributed databases, message queues, and real-time analytics engines. Analysts might require access to Hadoop for batch processing analytics, Spark for processing data in real time, Kafka for near real-time messaging, and Cassandra as a fast, scalable data store for high-volume web applications.

Each of these systems and services is complicated in its own right. And within each category, there are many options: various solutions and features, each with their own merits and suited for a different purpose. Yet, all of the technologies involved must be able to work together and cooperate when needed. IT departments find themselves orchestrating data processing tools, data stores, integration, distributed computing primitives, cluster managers and task schedulers, deployment, configuration management, data analytics, and machine learning tools.

A reference architecture solution

What does a reference architecture for machine learning look like, given these challenges? When we speak of architecture, we have to consider the big questions beyond resource orchestration, such as security, scalability, flexibility, performance, pipeline tuning, fault tolerance, versioning, logging, and more.

The machine learning industry has been grappling with this problem and solutions have emerged. Apache Mesos, Kubernetes (K8S), Fleet, and Docker Swarm have all offered some version of container management and cluster management. All the solutions above have their own merits, but Mesos is the only one (so far) that can run data services alongside containers, as it natively supports both stateful and stateless workloads.

The key design point that allows Apache Mesos to scale is its two-level scheduler architecture. Unlike a monolithic scheduler that schedules every task or virtual machine, the two-level scheduler delegates actual tasks to the frameworks. The first-level scheduling allows Mesos Master to decide which framework gets the resources based on allocation policy. The second-level scheduling happens at the framework level, which decides which tasks to execute. This enables data services to run without resource contention with the other data services in the cluster, improving framework scheduling regardless of scale. It also allows the Mesos Master to be a lightweight piece of code that is easy to scale as the size of the cluster grows.

Working with Apache Mesos, though, can be challenging in terms of building the framework and components. One company, Mesosphere, addressed this by building onto the Mesos core to create a full-reference architecture with its DC/OS (datacenter operating system) platform. DC/OS is primarily open source, with an enterprise version that includes additional capabilities and commercial support. DC/OS allows enterprises to run relational databases, data warehouses, and big data platforms, and manage enterprise applications and cloud-native applications within the same platform. This strategy addresses Mesos’ complexity with smaller solutions that allow you to deploy and scale with a single command line and web user interface.


It’s too early in evolution of machine learning for a solution to dominate the field. Currently, either Apache Mesos or Mesosphere software are powering major web players such as Airbnb, eBay, Hubspot, Netflix, Yelp, and Twitter. This is an interesting space to watch as the field rapidly evolves and more solutions come into play.

This post is a collaboration between O'Reilly and Mesosphere. See our statement of editorial independence.

Article image: Arch architecture. (source: Pixabay).