Debatching data is the process of turning one huge pile of data into many small piles of data. Why is it better to shovel one ton of data using two thousand, one pound shovels instead of one big load from a huge power shovel? After all, large commercial databases and the attendant bulk loader or SQL Loader programs are designed to do just that: insert huge loads of data in a single shot. The bulk load approach works under certain tightly constrained circumstances. They are as follows:
- The "bulk" data comes to you already matching the table structure of the destination system. Of course, this may mean that it was debatched before it gets to your system.
- The destination system can accept some, potentially significant, error rate when individual rows fail to load.
- There are no updates or deletes, just inserts.
- Your destination system can handle bulk loads. Certain systems (for example, some legacy medical systems or other proprietary systems) cannot handle bulk operations.
As the vast majority of data transfer situations will not meet these criteria, we must consider various options. First, one must consider which side of the database event horizon one should perform these tasks. One could, for example, simply dump an entire large file into a staging table on SQL Server, and then debatch using SQL to move the data to the "permanent" tables. There are, of course, multiple tools for debatching large bulk data loads including BizTalk Server and SQL Server Integration Services (SSIS). One can use such tools to break up large batches of data, manipulate it as needed, and send it on to its next reincarnation (for example, into an API, a relational database, or a text file). In this chapter, Packt Enterprise will take a look at options for processing large data sets.
Packt Enterprise books can be summed up through the tagline "Professional Expertise Distilled". They take in-the-trenches knowledge from experienced software professionals, and distil it into a single, easy to follow manuscript.