Apache Spark IO and Index Access

In the next three week’s episodes of Conversations with Kip, I focus on Apache Spark, a technology that recognizes the power of scanning large transactional data sets to produce analytical outputs.

This week we focus on the Input Processes for analytical processes. The larger the amounts of data we can scan efficiently and quickly, the greater our potential analytics. Yet the scan process can be separated from how the data is stored. Scanning data in a deeply encrypted manner means that, although Spark may be faster than other methods of getting at the data, the amount of data we can process is reduced.

Storage choices for data are important. Often choices about the storage of data are not based upon access patterns for the data, but rather upon the skills of the developers involved. Certainly data stored in a format that is unusable by anyone is of no use, but more often the usage patterns are not evaluated. Indexed access methods, like SQL, are often useful in transaction capture, and perhaps in individual balance presentation, but infrequently in posting processes.

Poor or lazy data storage choices perpetuates our poor data availability as much as lack of compute resources in decades past.

Watch the 95th episode of Conversations with Kip here, the best financial system vlog there is.

One thought on “Apache Spark IO and Index Access

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s