Skip to content

Federation Technologies

jbradbury edited this page Jul 24, 2018 · 8 revisions

Federation - What Does It Mean

In Information Technology Federation is principally related to the networking domain and refers to a model for establishing a large scale and diverse infrastructure for applications. This mostly occurs by making different computing entities adhering to a certain standard of operations in a collective manner to facilitate communication.

Why Federation

Given the introductory definition, it is not surprising that the association with nowadays cloud-based infrastructures will quickly and instinctively emerge. After all, the seamless inter-operation of two distinct, formally disconnected networks that may even have different internal structures has always been both a necessity and a technical challenge indeed.

But how exactly does this apply to cloud-computing applications and why using federation? Truth is, Federation makes it easy to manage entire multiple clusters or only single resources, in particular when it comes down to give individuals or groups of users the ability to access, modify and transfer extremely large amounts of geographically distributed data for research purposes.

Generally speaking, there exist two main approaches in order to achieve federation within the domain of cluster-based architectures. As touched upon previously, these can be separated in terms of how broad or specific resources they target: (1) clusters federation, and (2) data federation (sometimes vaguely referred as data virtualization).

1. Clusters Federation

It can be achieved by providing two major building blocks:

I. Sync resources across clusters: Federation provides the ability to keep resources in multiple clusters in sync. For example, you can ensure that the same deployment exists in multiple clusters.

II. Cross cluster discovery: Federation provides the ability to auto-configure DNS servers and load balancers with backends from all clusters. For example, you can ensure that a global VIP or DNS record can be used to access backends from multiple clusters.

Some other use cases that federation enables are:

  • High Availability: By spreading load across clusters and auto configuring DNS servers and load balancers, federation minimizes the impact of cluster failure.

  • Avoiding provider lock-in: By making it easier to migrate applications across clusters, federation prevents cluster provider lock-in.

Federation is not helpful unless you have multiple clusters. Some of the reasons why you might want multiple clusters are:

  • Low latency: Having clusters in multiple regions minimizes latency by serving users from the cluster that is closest to them.

  • Fault isolation: It might be better to have multiple small clusters rather than a single large cluster for fault isolation (for example: multiple clusters in different availability zones of a cloud provider).

  • Scalability: There are scalability limits to a single kubernetes cluster (this should not be the case for most users).

  • Hybrid cloud: You can have multiple clusters on different cloud providers or on-premises data centers.

So far the most relevant project which has gained popularity for addressing the need of users and customers in tying together ("federating") multiple clusters in some sensible way in order to enable the above use cases, has been the "Kubernetes Cluster Federation" (a.k.a "[Ubernetes]"(https://github.com/kubernetes/kubernetes/blob/8813c955182e3c9daae68a8257365e02cd871c65/release-0.19.0/docs/proposals/federation.md)).

2. Data Federation

More and more often the terms data virtualization, data federation, and data integration are used. As can be expected, this leads to confusing discussions. Some regard them as synonyms, others see them as overlapping concepts, and there are those who prefer the see them as opposites. We will attempt to report some interesting proposed definitions for these three related terms.

Data Virtualization

Virtualization is not a new concept in the IT industry. Nowadays, almost everything can be virtualized, including processors, storage, networks, and operating systems. In general, virtualization means that applications can use a resource without concern for where it resides, what the technical interface is, how it has been implemented, the platform it uses, how large it is, and how much of it is available. Therefore, data virtualization can be seen as the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.

Technically, data virtualization can be implemented in many different ways. But the one which is the most interesting and relevant for our case study is principally the following:

  • With a federation server, multiple data stores can be made to look as one. The applications will see one large data store, while in fact the data is stored in several data stores. This definition will be explored further in the next paragraph.

Data Federation

In most cases, if the term federation is used, it refers to combining autonomously operating objects. Thence, data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration

This definition seems to be based on the following concepts:

  • Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation. For example, if an organization wants to virtualize the database of one application, no need exists for data federation. But data federation always results in data virtualization.

  • Heterogeneous set of data stores: data federation should make it possible to access, modify and bring data together from data stores using different storage structures, different access languages, different types of database servers, files with different format and different APIs.

  • Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation.

  • One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data.

  • On-demand integration: This refers to when the data from a heterogeneous set of data stores is integrated. With data federation, integration takes place on the fly, and not in batch. When the data consumers ask for data, only then data is accessed and integrated. So the data is not stored in an integrated way, but remains in its original location and format.

Data Integration

In its broadest sense, integration means combining parts of something so that they work together or form a whole. Thus, if data from different data sources is brought together, we talk about data integration. In other words, data integration is the process of combining data from a heterogeneous set of data stores to create one unified view of all that data. Furthermore, it often involves joining data, transforming data values, enriching data, and cleansing data values.

Data Federation Tools

When exploring the open-source solutions that have been gaining momentum within the cloud-computing marketplace, one cannot but notice that the following projects (or combination of them) are more often than not adopted and thus most popular: iRODS, ONEDATA, Owncloud and its most recent fork Nextcloud.

iRODS

The integrated Rule-Oriented Data System (iRODS) is open source data management software used by research organizations and government agencies worldwide. iRODS is released as a production-level distribution aimed at deployment in mission critical environments. It virtualizes data storage resources, so users can take control of their data, regardless of where and on what device the data is stored. As data volumes grow and data services become more complex, iRODS is serving an increasingly important role in data management.

iRODS core competencies are:

  • data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.

  • data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the iRODS Zone.

  • data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the Zone.

  • secure collaboration, so users only need to log in to their home Zone to access data hosted on a remote Zone.

Here you can find the extensive and official getting started for iRODS

ONEDATA

Onedata is a global data management system, providing easy access to distributed storage resources, supporting wide range of use cases from personal data management to data-intensive scientific computations by perform heavy computations on huge datasets while allowing accessing to data in a dropbox-like fashion regardless of its location. Furthermore, it gives users the option to publish and share your results with public or closed communities.

With Onedata, users can access, store, process and publish data using global data storage backed by computing centers and storage providers worldwide. At the same time, it focuses on instant, transparent access to distributed data sets, without unnecessary staging and migration, allowing access to the data directly from your local computer or worker node.

Onedata is composed of several components:

  • Onezone - allows to connect multiple storage providers into a larger distributed domain and provides users with Graphical User Interface for typical data management tasks.

  • Oneprovider - the main data management component of Onedata, deployed at each storage provider site, responsible for unifying and controlling access to data over low level storage resources of the provider.

  • Oneclient - command line tool which enables transparent access to users data spaces through Fuse virtual filesystem.

  • Onepanel - administration and configuration interface for Onezone and Oneprovider components.

  • LUMA - service which allows mapping of between Onedata user accounts and local storage ID's, here we provide an example implementation of this service.

Here you can find the official getting started for ONEDATA

Owncloud and Nextcloud

Both Owncloud and Nextcloud are a suite of client–server software for creating file hosting services that allows for access, modification and transfer of substantially large amounts of data. They are functionally very similar to the widely used Dropbox, with the primary functional difference being that the Server Editions are free and open-source, and thereby allowing anyone to install and operate it without charge on a private server. Furthermore, they both provide integration of external storage, end-to-end encryption and last but not least they enable full federation, which simply puts it means allowing users to have shared files and folders even when using different server instances.

Here you can find the official getting started for both projects:

If you are stricken and puzzled over the obvious similarities between two projects, starting simply from their websites layout, documentations and last but not least their features, well then be assured you are not alone.

In fact, the original Owncloud developer Frank Karlitschek forked Owncloud and created Nextcloud, which continues to be actively developed by Karlitschek and other members of the original Owncloud team. An interesting "whole story" regarding Owncloud vs Nextcloud has been published online and offers a good overview and summary with lot of details.

Owncloud in PhenoMeNal

Our goal has been the one to select a user-friendly, lightweight and reliable tool to be easily deployed within the context of a kubernetes-based application.

Although iRODS is an impressive, mature and advanced project, its scope is mostly related to a low level implementation, which requires both more advanced technical skills and to be accustomed of not having a GUI where performing management actions and settings configuration. Otherwise, when it comes down to choosing an open-source tool for data virtualization that is simultaneously production-ready, reliable and aims at deployment in mission critical environments, then iRODS is definitely the way to go. Which is why it will be kept in high consideration for any possible future development of the PhenoMeNal project.

A second quite suitable candidate on the list is Onedata. Such project has developed a tool that could potentially fill the gap between the advanced, low level implementation nature of iRODS and the need of a more web-based, user-friendly data management system that provides easy access to distributed storage resources. Despite such project has all what it takes to be promoted to the role of perfect candidate, we couldn't ignore the fact that both deploying and configuring Onedata is somewhat convoluted and eager in terms of resources; on top of that, the available documentation is not often helpful, sometimes quite inconsistent by leaving certain blanks to the interpretation of the user and/or his experience. Furthermore, Onedata doesn't currently support a kubernetes ready and compatible release (e.g. an official helm chart) which we consider a basic requirement in our PhenoMeNal project. Therefore despite its attractiveness, we find Onedata a still young project with surely lots of potential and certainly to be kept into account for any future integration in case a stable helm chart was released.

Eventually, we come to Owncloud or Nextcloud as a third viable option. They pretty much offer the same essential functionalities when it comes down to providing a suite of client-server tools for file sharing with the plus option of enabling full federation. So why did we choose Owncloud in the end? Both projects have a solid base, offer reliable support, growing communities and increasing set of addon apps. Though it is fair to say that Nextcloud appears to have more action and “buzz” at the moment but ownCloud appears to be a bit more established, corroborated by an unofficial comment that "their approach is to make fewer, but larger commits." Since our goal was to opt for a user-friendly, lightweight and reliable tool to be deployed within kubernetes, Owncloud has won the competition over the other projects by supporting an official, stable kubernetes helm chart which we have also had the pleasure of contributing by integrating the kubernetes concept of Ingress.

In conclusion, this stable helm chart allows quick deployment and easy configuration (in particular for admin users) while simultaneously providing a stable, secure, encrypted and thus reliable out-of-the-box file-sharing solution with the option of full federation.

How to Access the Owncloud's Interface in PhenoMeNal

Once your PhenoMeNal cluster has been deployed, run the following command to get the whole list of enabled Ingresses:

kn kubectl get ingress --all-namespaces

or alternatively you can first login into the master via ssh with the command kn ssh and then run only kubectl get ingress --all-namespaces. You should get a similar output:

NAMESPACE     NAME                           HOSTS                                ADDRESS   PORTS     AGE
default       galaxy-stable-galaxy-ingress   galaxy.34.244.118.105.nip.io                   80        15m
default       jupyter-ingress                notebook.34.244.118.105.nip.io                 80        15m
default       luigi-ingress                  luigi.34.244.118.105.nip.io                    80        15m
kube-system   kubernetes-dashboard           dashboard.34.244.118.105.nip.io                80        15m
logmon        efk-kibana                     kibana.34.244.118.105.nip.io                   80        22s
logmon        grafana                        grafana.34.244.118.105.nip.io                  80        18s
logmon        prometheus-alertmanager        alertmanager.34.244.118.105.nip.io             80        20s
logmon        prometheus-pushgateway         pushgateway.34.244.118.105.nip.io              80        20s
logmon        prometheus-server              prometheus.34.244.118.105.nip.io               80        20s
federation    server-owncloud                owncloud.34.244.118.105.nip.io                 80        20s

Copy the server-owncloud Ingresses' Hosts listed in your output and paste it in a browser window. You will be prompted with a pop-up dialogue box asking for credentials. You can find them in the config.tfvars file generated when you initialize your working directory with the command kn init provider my-deployment.

In such a file, passwords for all various dashboards are usually listed under the provision block (check here for further documentation), specifically in the extra_vars field of an action block. Password can be of course customized. However, strong password are required to be used otherwise during the initialization of the cluster a pre-init check will return an error informing that the password/s saved in the config.tfvars file is/are too weak.

Note: Such secrets are carefully encrypted before being ingested into the related Kubernetes' namespaces.

How to Configure Federation Sharing in Owncloud

We would like to recommend the official documentation

How to Mount External Storage in Owncloud

Same here, we recommend the official documentation

Clone this wiki locally