BIG DATA : THE RIVER AND THE LAKE
A river is a natural flowing watercourse, usually freshwater, flowing towards an ocean, sea, lake or another river. (source Wikipedia)
A lake is an area filled with water, localized in a basin, surrounded by land, apart from any river or other outlet that serves to feed or drain the lake. (source Wikipedia)
Is river a synonym for lake? I don't think so. Yes, they both consist of water, but there it stops. They are fundamentally different. One flows, the other stands still.
Let's now look at the banks and how they see data as the fuel that will feed the engines addressing its two main challenges : compliance and digitalization. Are those two the same? Certainly not. Can both issues be addressed by simply creating a big lake of data and making sure that whatever or whoever gets access to that data?
Let's have a look at compliance first.
We need to know our customers, we need to conduct transaction screening on any sort of transactions and on the combination of different sorts of transactions (e.g. equity trades, derivative trades and payments). We need to report to the regulators, we need to guarantee data privacy and on top of that we want to conduct all sorts of analytics.
Transactional data flows like a river, throughout systems and applications. It enters the bank, gets enriched with static data, triggers a number of actions and at some point leaves the organization under a different format (e.g. market data triggers a buy or sell order going to the market).
Data, stored as a result of a transaction, will perfectly serve the need for reporting and analytics. Is that appropriate for the obligation to screen transactions for potential market abuse, money laundering etc...? No it is not, because the data only becomes available after the transactional event. Screening and monitoring needs to happen in the flow. Action needs to be taken before the transaction is closed. Then only it becomes really effective. If a transaction is malicious, we do not want it to happen. If we only get alerted after the event, the harm is done. We are sorry, not safe...
Now we will have a sharper look at digitalization of the banking services. The digitalization is taking place with two specific goals : collecting as much information of the client as possible and provide the client with an excellent service : enable fast and easy transactions as well as provide a 24/24 overview of the client's assets and debts. Data gathering and asset overviews are relatively static, transactions though are... well...transactional. The lake and the river...
How come now that most of the Chief Data Officers I talk to, seem to be fully occupied either digging big data lakes (data swamps?) or else building seemingly well organized arrays of data pools (they call them warehouses), thinking that those will serve any purpose as long as it is related to data?
Fundamentally the difference between the lake and the river stands for the difference between real-time and historical data. They need to be handled in a totally different way. I agree that in many cases, an application or a user will require a combination of static and streaming data.
Well, let this issue now be the perfect use case for a data virtualization tool.
If a fisherman spots a splendid salmon swimming in the river, he won't wait until it ends up in the lake amongst dozens of other fish, to catch it.