Apache Spark Parallelism and Financial Analytics

This week we continue the Apache Spark usefulness and limitations for financial analytics. Last week we discussed input data formats. This week we focus on consolidation of financial data.

Indexed access typically helps when searching for single answer to a question, be it a search engine query URL, or an individual financial transaction or balance. This compute pattern utilizes parallelism very effectively. But many types of financial analytics are not searching for a single record, but rather creating accumulations of many transactions.

Posting processes by their nature aggregate transactions to make balances. The answer to many of our queries may be satisfied by an existing posting engine. But if not, we have to create that position, which will require bringing together the results of aggregations from potentially many transactions into a single result.

This process sits between the indexed access of creating or updating transactional data, and finding the final aggregated position in a balance. Spark is effective at scanning larges amounts of data looking for the “needle in the haystack”, but if not used correctly it can be applied poorly to this compute pattern of creating balances through a posting process.

Watch episode 96 of Conversations with Kip here, the best financial system vlog there is.