This Coding with Kip episode explains a very old approach to doing join processes, called a match-merge.
Match-merge required the data files to be joined be in sort order by the key. In a sense this is the oldest type of “relational” databases, where entire files were processed, but the records had known keys and defined uniqueness that were not enforced by the computer but rather by the application and developers. Putting the file in order might require some pre-process, but sort utilities historically were very, very efficient ways of doing this work.
>>> Related Post: Historical Sorting Processes <<<
Once the files are in sorted order, the program opens both files, and reads a single record from each. The key comparison has two possible outcomes:
1-Equal, which means that the two records should be joined
2-Not Equal, which means the programmer has to decide what should be done on a not found condition. This might mean skipping using that record, or using default values in some way.
If the files have many to one relationships, then the logic is built to advance one file to the next record and test again.
Match-merge can be done with more than 2 files, combining many, many files. This is possible because memory is very efficiently used, as only one record from each file at a time has to be in memory for the comparison.
Quite frequently when working with young programmer I find their default for doing this kind of work is to first open the files and load them all into memory. The ubiquity and lower costs of memory today makes this an easy answer–for smaller scale problems. But the match-merge approach to joining can be done on any size of data files. Yes, you read that correctly, any size of data to be joined. In fact, the more difficult process is sorting very large files, not the actual join processing. But because sort processes can break files into pieces to sort, and then do a match-merge process themselves to create larger and larger files, there is no limit to that ability either.
If one of the input files is designated a master file, and the other a transaction file, then the output from the process might produce an updated master file. This would be known as a special case of match-merge, called a posting process.
>>> Related Post: Historical Posting Processes <<<
The video has a second part where it shows the results of using this approach to join together 100 million records in about 3 minutes on a 2 core 3.1 Gig MacBook Pro. The data used in this demo was the transaction and vendor master files produced as part of the Posting Process POC written in Scala. The source code for the demo is available on GitHub VAPostAnalysisRSCDemo. (This code is not maintained and is intended for educational purposes only. It is released by IBM under Apache 2.0 License.)
This is another episode of the new Coding with Kip series, the technical sub-series of Conversations with Kip, the best financial system vlog there is. Literally learn more–about ledgers and financial systems–at FinancialSystemsEducation.com.
Watch the series in order at the Coding with Kip Playlist