A member of our team articulated the importance of data extraction, that act of pushing or pulling data from a data management system to a person or another system, very succinctly. He said:
“The growth, development, and success of a data management system within an organization is heavily dependent on how efficiently it can ingest and distribute data in equal measures – with distribution potentially being more important.”
With this in mind, why is it that some firms struggle to get data out of their data management systems efficiently? It’s easy, and in some cases fair, to point to the systems themselves and highlight deficiencies, but when solving for your data extraction needs it’s important to ask yourself a few questions:
- What data am I pulling?
- How much data am I pulling?
- When, or how often, am I pulling this data?
- To whom is it being delivered?
In this piece, we’ll outline the different data flows, the different ways of publishing them, and the best ways for each type of data flow.
Types of Data Flows
There are three main types of data flows:
- Continuous Transmission (Event Based) – This flow feeds data throughout the day, deploying a medium-low volume stream of data in consecutive messages to software on the other end. Think of this as your big enterprise data pipeline – any type of data can be on it, and once the message has been sent it can be picked up by any downstream systems that are listening.
- Batch – This is an often high volume, data set extracted on a set schedule. The schedule is based on a waterfall of dependencies regarding when data will be available, when it will be actioned and extracted, service level agreements and who the end user is.
- Ad Hoc – This flow is self-explanatory; someone in the firm needs some information, puts a call or query out, hits the API, and gets it. Usually a unique, one-time set of low volume of data.
Types of Data Extraction
Currently, there are three main ways of extracting data:
- Flat files – these are basic files types such as .txt, .csv, .xml, and JSON.
- Queues – data is delivered in a series of messages, which themselves could be flat files, to another system.
- APIs – “hooks” built into the system through which you can access the same functionality of the receiving software in the form of a UI.
What is the best data extraction type for me?
To identify the best data extraction type for your needs, one thing to ask is who is the end user of this data? Is it a human or another system?
- System-to-person – In general, people extract data for their own use through queuing, while also utilizing API tools for ad hoc reports in small volumes for internal communication materials and/or analytics.
- System-to-system – when extracting data to another system, it’s best to use one of the more generic flat files, like xml or JSON. These file types are so simple and widely used that it smooths out the seams between systems by being platform neutral.
Now, looking at both the flow and extraction types, let’s explore which is best for which situation:
- Flat Files – best used for batch data extraction. Systems can easily handle the relatively low-maintenance file types despite the high volume. This is best managed in a system-to-system environment.
- Queue – works best for continuous transmission. The constant but low-volume stream of flat files is perfect for important data in a hurry, such as independent price verification and a bond maturity/expiry date. This extraction type also works for customer and entity data, as this type of data often connects to CRM systems which get called upon when needed. This works for both system-to-system and system-to-person extraction.
- API – APIs are great for managing ad hoc data extraction, as the very low volume of data and significant variance of request/reply data calls, more often than not by people, requires a lightweight UI with the same functionality as the receiving software. Generally works best in system-to-person situations.
Contact us if you’d like to learn more about best practices in data extraction.