Introduction

In this tutorial we will discuss advance topics of cube and floe development on the basis of a metaphor. This tutorial is for floe developers who are already familiar with the basic concepts of cubes and floes. If you need an introduction to these concepts please first read the previous section of our documentation: Basic Cube and Floe Development

We will use the example of a cookie making factory to introduce some of the difficult topics in cube and floe development.

../../../_images/introduction_1_revised.png

A Cookie Factory

Our metaphorical factory consists of a number of assembly lines that are used to make different types of cookies, for example dry cookies and cookies with filling. Each assembly line in the factory is built by a sequence of work stations that are connected together. In here, each work station processes either the raw ingredients or the output from an upstream processing station and creates either an intermediate dough component or the final product. We assume that the individual stations of an assembly line are not necessarily located in a single building; this is important to discuss the transport of cookie dough intermediates from one work station to another.

In this metaphor the work stations of the assembly lines represent cubes, while the assembly lines themselves are floes. The factory represents a floe package that contains functionally related cubes and floes. Finally, the ingredients represent the data we want to process, and the baked cookies represent the results that we want to calculate.

Throughout this tutorial we will use these analogies to make it easier to relate the topics we discuss to a real-life scenario: Optimizing the cookie making process in a cookie factory. We will discuss individual topics based on the assembly lines (floes) of the factory. We will not discuss the setup and composition of the factory (floe package) itself.

Our goal is to design the best factory in terms of:

  • Efficiency of the individual assembly lines.

  • Best work practices.

Within our factory, the best practice is to have functionally related stations (cubes) and preferably, these stations should be as general and reusable as possible without sacrificing performance. The reason is that the more general and reusable a station is, the less time and resources are needed to maintain it and develop new assembly lines (code maintenance). For example, the assembly lines making different cookies may have a set of work stations in common, for example baking the cookies or packing them. On the other hand assembly lines may differ in other stations, for example those creating different cookie shapes.

We will start by discussing the most intuitive floe design for a cookie baking workfloe.

../../../_images/introduction_2_revised.png

An Intuitive Approach to Producing Cookies

Intuitively, we may want to make cookies in a sequential procedure, where we start by processing one ingredient, pass it to the next station and (maybe) add another ingredient. This strategy is repeated until the cookies are done. In this workfloe, we have a station which counts the total number of cookies produced and ensures that each cookie box has the same number of cookies (given as an input parameter to this counting station).

The above described sequential approach is slow and may work to make a small number of cookies. But, if we would like to make cookies on an industrial scale (similar to processing large amount of data in Orion®), an optimized workfloe may look like this:

../../../_images/introduction_3_revised.png

An Optimized Process

The optimized assembly line takes advantage of the fact that wet and dry ingredients are processed differently and can be worked on concurrently. For more efficient production, it also introduces new features on the assembly line (such as putting stations in groups or batching items before sending them downstream). In this tutorial, we will describe different concepts and methods, which are applied to transform the intuitive procedure into the optimized floe. Namely we will discuss how to optimize:

  • Concurrency

  • Computational Resources

  • Parallelization (of cubes)

  • Data Traffic (between cubes)

  • Collections I/O