There are many ways to process data, but it all depends on what data you’re using, the method in which data is collected, and what your goals are. For those working with collections of datasets on a regular basis, it’s important to have a structure or system as to how they process their data time and time again.
The method of batch processing is common, and essential, in many business practices.
Batch processing is a term that has been around for a long time. Put simply, it describes any kind of process that manipulates or analyzes stored collections of data (aka. batches) periodically. In many cases, batch processing can have tight timelines and require significant resources, especially if the batch processing job is processing big data. Perhaps it’s best to start off with some examples of data collections and what you may be asked to do:
- Produce a daily report summarizing a list of sales in a retail store
- Clean and validate data in a spreadsheet in preparation for an annual audit
- Produce a sales report for the director of sales when requested
- Read data from a sensor network every 30 minutes and report any warnings
- Summarize 90 base datasets on a project area in 15 minutes (this actually happened!)
With the examples listed above, a few common patterns emerge.
The first being that it is usually the case that data has already been collected and stored over time, perhaps over an hour, a day, or any other period of time (the data collection process is entirely separate).
The second pattern that emerges is the requirement to process data on a regular schedule or by request. Data is constantly being collected and stored, but analysis on the data is performed only at specific points in time. It’s sensible and straightforward to use a batch processing technique here where data can be pulled from data sources, analyzed, transformed, and output as desired.
How Data Flows, and an Example
Let’s go into it in more detail here with the scenario about the retail store. A store makes sales throughout the day and each transaction is recorded by the point-of-sale (POS) system. At the end of the day, a batch process runs and produces a daily report summarizing the sales for that day.
The data in this example are the sales transactions. They flow into the system as each purchase is made and entered into the POS system. Since purchase transactions can be made at any time of the day and at a moment’s notice, you can consider the data collection at this point in time as real-time data. Though the sales transactions occur in real time, the POS system is only responsible for collecting and storing the records in a data storage system.
Let’s now look at the end of day when the batch processing script runs and produces the daily report. In terms of data flow, the script is responsible for fetching the data that it will process. For example, if the data storage system is a database, then the script may use SQL statements to query for the retail store’s transactions. Make note of the direction here: data is pulled from the data storage system in order to be processed, unlike the previous step where the transactional POS system pushes data into the system.
When the script is complete, a daily sales report is generated and used by the retail store employees to make business decisions. This report can also be viewed as data for a possible next step in an analysis workflow. It can, for example, be pushed into another data storage system, where it can be used directly, or be the input to a future batch processing script. And, the cycle continues.
Implementing Batch Processing
Now, with a firm understanding of batch processing and how data flows, what’s next? What do you need to do to be able to process your data in batches? You’ll first need to look at the data, systems and the staff that’s available to you. Then, you can select from a couple of batch processing options that will suit your situation.
Let’s start with the data that’ll you need to pull in for batch processing. Here are some questions you may consider:
- How will you access your data? Is the data accessible as files on the network, via a connection to a database, through a web service, or something else?
- Is your data source fixed, or does it get updated and change over time?
- Do you maintain the data? If not, who on your team does?
The answers to the above questions will play an influence on how you build your batch process workflow. Generally, you’ll have two main approaches that you can take:
Option 1: Writing a Script
Using scripting languages for your batch processing may be a viable option. For example, Python scripts are capable of extracting, transforming, and integrating your data.
The key barriers to this approach are the requirement of someone with both the programming skills for the scripting language and the understanding of the data within your systems. If you’re not a developer yourself, you will need to ask one in your organization for their assistance or hire one for your team.
Option 2: Using a Software Tool
If, on the other hand, you don’t have access to developer expertise, then look for visual software tools that you can use to process data yourself. You’ve already identified where your data comes from, so select flexible tools, like FME, that can support those systems.
FME is equipped with a variety of built-in functionality that will allow you to extract the exact data that is required, transform it in the way that you need, and create the output that you’ll deliver.
Automating Batch Processing
The next step to your batch processing workflow is to make the process repeatable so that you’ll be able to leverage it with little effort. Being able to repeat your process will naturally lead to consistent results.
Here are some things to consider when looking to make your workflow repeatable:
- Are you running it on a regular schedule, or running it manually when your process is needed? If on a schedule, what system will be in charge of running it reliably?
- Do the inputs to your process change each time your process is run? Examples of inputs may include things like datasets to consider or time periods to consider.
Since these considerations are additional layers upon your original batch processing workflow, you may want to start with a manual process first. Once you have the manual process figured out, adding the complexity of repeatability afterwards is a good next step.
At some point, you’ll want to run your workflow often, reliably, and without worry. Here, automation is key. Look toward tools, like FME Server, with scheduling and automation features. Tools like this make it easy to run and manage batch processing workflows.
FME Server Schedules, for example, make it easy for batch processing workflows that need to run on a scheduled basis. Once you’ve published your workspace and configured a schedule, then FME Server will handle running your batch processing workflow every night, week or month.
What’s the Future for Batch Processing?
The type of batch processing workflows that we have gone through is extremely common today. And you can see why: they’re easy to understand, fairly easy to set up, and produce timely, predictable results. So, batch processing will always be an easy-to-grab tool in your toolkit.
As the world around us grows more reliant on technology, the amount of (and demand for) data means data collection and processing will play an increased role. What if we want a quicker analysis of sales? Turnaround times of less than a day, or even real-time? Can batch processing help us here?
Yes, but the only control we have is to increase the frequency at which batch processing workflows run and pull data. From daily, to hourly, or even down to the minute or less.
Eventually, though, there’s a limit. Instead of daily sales reports on retail store sales, imagine a dashboard of live eCommerce sales. If only our systems could react to each and every sales transaction in real-time…
The key here is the ability to work with a continuous inflow of data, or, data streams. Reducing the time gap between data arrival and its analysis will help give you and your organization a competitive edge. In some businesses where products have a short shelf life (i.e. a bakery) knowing in real-time how things are selling could enable you to adjust prices on the fly to reduce the amount of waste.
While batched data and stream processing both require handling data, the key differences in approach focus on when and how frequently the processing occurs. With batch processing, data has already been collected and stored, and processing data requires pulling when requested. With stream processing, data is unbounded and continuously pushed through processing applications as it arrives into your system.
Try it Yourself
Are you ready to create your own workflow and implement batch processing in your workplace? Try FME for free to get started.
Here are some resources to help you learn more:
Stephen WongStephen is a senior software developer and the back-end lead for FME Server. He’s a regular at Safe’s lunchtime soccer matches and enjoys a beer or two with teammates after work.