In the next three week’s episodes of Conversations with Kip, I focus on Apache Spark, a technology that recognizes the power of scanning large transactional data sets to produce analytical outputs.

This week we focus on the Input Processes for analytical processes. The larger the amounts of data we can scan efficiently and quickly, the greater our potential analytics. Yet the scan process can be separated from how the data is stored. Scanning data in a deeply encrypted manner means that, although Spark may be faster than other methods of getting at the data, the amount of data we can process is reduced.

Storage choices for data are important. Often choices about the storage of data are not based upon access patterns for the data, but rather upon the skills of the developers involved. Certainly data stored in a format that is unusable by anyone is of no use, but more often the usage patterns are not evaluated. Indexed access methods, like SQL, are often useful in transaction capture, and perhaps in individual balance presentation, but infrequently in posting processes.

Poor or lazy data storage choices perpetuates our poor data availability as much as lack of compute resources in decades past.

Watch the 95th episode of Conversations with Kip here, the best financial system vlog there is.