Skip to main content

Manage multiple data sources for high-performance applications

Applications can pull from and work with data from , multiple sources as long as the app design incorporates these five fundamental data management and mapping techniques.
If you don't account for and carefully manage data sources during application design, there's a real risk that the application will fail to meet performance, resilience and elasticity expectations. This risk is especially true in analytics applications that draw from multiple data sources.
However, there are five ways to address the problem of multiple data sources in an application architecture: Know what data you need to combine, use data visualization, add data blending tools, create abstracted virtual database services and determine where to host data sources. Let's look at what each of these tasks entails and why they make a difference in application data management.

1. Know what data to combine

The first thing to understand is what you should combine, both in terms of the data sources and the uses thereof. When it comes to multiple data sources, the right management tool depends on the storage formats involved and the goal you have in mind for this data.
For example, a relational database stores most business data and enables the application to perform standard functions, such as queries that draw from multiple physical databases. Whatever the data format, your queries need to align with the format of the data they address.
In terms of mission, narrow the focus according to whether the application uses:
  • real-time analytics inquiries that delve into a broad range of data;
  • structured queries using some language; or
  • direct application access via APIs.
The application's query approach determines whether you need to use a visualization tool, like Google Data Studio, to specifically manage your relational database management system (RDBMS) or program specific design patterns. For instance, if you have real-time analytics, you're best off using something like Data Studio. If you're working with structured queries in non-real time, you want to pay particular attention to how RDBMS queries are permitted to join multiple databases or create their own databases. Finally, if you're using an API, you can develop a storefront or adapter design pattern to shield the APIs from direct use and impose restrictions on data relationships or database creation in the new design pattern you've created. With any choice, you have to manage the performance risks that can arise from improper use and placement of information.

2. Use data visualization

Software managers trying to unify multiple data sources should use a data visualization dashboard to map them out. Data visualization provides value when the user plans to interactively analyze and query information. The approach helps architects get a clear view of their data, and it also helps them manage the relationship between data sources. There are a variety of visualization tools available from vendors such as Google, IBM and Oracle.
Consider data visualization a primary tool to manage multiple data sources in an application, even if you're not specifically looking at dynamic data visualization to gain business insight. Using a visualization tool is a great way to lay out data source relationships, test the value of cross-source analysis and even assess how database placement in the network or cloud will affect application performance. Anyone who does any form of query or analytics work should have a visualization tool.

3. Turn to data blending tools

Always take care to maximize the value and efficiency of data-centric applications in order to improve cost and performance.
Data blending tools are useful for analytics applications. These tools turn multiple data sources into a unified data source through a join clause that lets you define multisource data relationships and reuse them as necessary. Data blending is an increasingly common feature in visualization tools.

As with other tools selected to manage multiple data sources, data blending capabilities must align with the characteristics of the given data sources. For example, if you have information stored in a nonstandard database structure, this specific format might dictate which data blending tool you can use.

4. Create virtual database services through abstraction

It's surprising how many companies use data studios for interactive analytics and then make discrete API calls for the same sort of data when they write applications. Don't fall into this trap. Forcing an application to process too many database formats can cause performance to suffer. This practice can also create a scenario where data source correlation isn't consistent for every user.
Abstract database services facilitate application access to multiple data sources. Because these services define -- and hide -- the complex way that information can connect across sources, they encourage standardized information use and reduce development complexity. They also create a small number of services you can use to identify the specific data sources and determine what users are doing with them. For compliance purposes, this data abstraction is a critical function.

5. Decide where to host data sources

Finally, consider where you will host data sources and whether the network connections are sufficient. Data sources are almost abstract in nature, because you access them through a logical name call, rather than network or data center addressing. Because the information's location is typically hidden, it may not be obvious how accessing it from a specific data store will affect application performance.
Public cloud access to data sources is a prime example of the difficulties related to application performance optimization. When cloud applications access data in the data center or another cloud, traffic charges and network transit delays can mount up. Applications that access multiple data sources magnify this issue, and the abstraction performed in data blending can further exacerbate the problem by hiding the specifics of each data source. Therefore, when you design cloud applications, either host the data in the cloud, or abstract the database access to a service and run that service local to the data.
Application orchestration and deployment tools, such as Kubernetes for containerized applications, can connect database resources to application components, but it's still possible to misuse those data sources. Always take care to maximize the value and efficiency of data-centric applications in order to improve cost and performance.

Comments

Popular posts from this blog

Python and Parquet Performance

In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of  columnar storage ,  columnar compression  and  data partitioning . Used together, these three optimizations can dramatically accelerate I/O for your Python applications compared to CSV, JSON, HDF or other row-based formats. Parquet makes applications possible that are simply impossible using a text format like JSON or CSV. Introduction I have recently gotten more familiar with how to work with  Parquet  datasets across the six major tools used to read and write from Parquet in the Python ecosystem:  Pandas ,  PyArrow ,  fastparquet ,  AWS Data Wrangler ,  PySpark  and  Dask . My work of late in algorithmic trading involves switching between these tools a lot and as I said I often mix up the APIs. I use Pandas and PyArrow for in-RAM comput...

Build Data Platform

I'd appreciate your likes and comments). Additionally, it will result in my lengthiest blog post to date. However, regardless of the length, it's a significant achievement that I'm eager to share with you. I'll refrain from delving into unnecessary details and get straight to the main points to prevent this post from turning into a 100-minute read :). As always, I'll strive to simplify everything to ensure even those who aren't tech-savvy can easily follow along. Why? Everything has a why, this project too. (DevOps for data engineering) and I needed to apply them in an end-to-end project. Of course, this project is not the best one out there, but it helps me to quickly iterate and make errors. (And it reflects the reality of Modern Data Engineering, with beautiful tool icons everywhere). End Goal The end goal of this project is to have a fully functional data platform/pipeline, that will refresh our analytics tables/dashboards daily. The whole infrastructu...

Kubernetes Configuration Provider to load data from Secrets and Config Maps

Using Kubernetes Configuration Provider to load data from Secrets and Config Maps When running Apache Kafka on Kubernetes, you will sooner or later probably need to use Config Maps or Secrets. Either to store something in them, or load them into your Kafka configuration. That is true regardless of whether you use Strimzi to manage your Apache Kafka cluster or something else. Kubernetes has its own way of using Secrets and Config Maps from Pods. But they might not be always sufficient. That is why in Strimzi, we created Kubernetes Configuration Provider for Apache Kafka which we will introduce in this blog post. Usually, when you need to use data from a Config Map or Secret in your Pod, you will either mount it as volume or map it to an environment variable. Both methods are configured in the spec section or the Pod resource or in the spec.template.spec section when using higher level resources such as Deployments or StatefulSets. When mounted as a volume, the contents of the Secr...