There is nothing wrong with those systems as long as it fulfil business requirements. All systems that fulfil our business need are good systems. If there are simple, it is even better.
At this stage, there are multiple ways of doing data analysis:
- Simply submit queries to OLTP database’s replica node. (Not recommended).
- Enabling CDC(Change Data Capture) of OLTP databse and ingest those data to OLAP database. Come to the option of ingestion service for CDC logs, you can choose based on the OLAP database you have selected. For example, Flink data streaming with CDC connectors is a way to handle this. Many enterprise services come with their own suggested solution, e.g. Snowpipe for Snowflake. It is also recommended to load data from replica node to preserve the CPU/IO bandwidth of master node for online traffic.
In this stage, ML workloads might be running in your local environment. You can set up a Jupyter notebook locally, and load structured data from OLAP Database, then train your ML model locally.
The potential challenges of this architecture are but not limited to:
- It is hard to manage unstructured or semi-structured data with OLAP database.
- OLAP might have performance regression when come to massive data processing. (more than TB data required for a single ETL task)
- Lack of supporting for various compute engines, e.g. Spark or Presto. Most of compute engine do support connecting to OLAP with JDBC endpoint, but the parallel processing will be badly limited by the IO bottleneck of JDBC endpoint itself.
- The cost of storing massive data in OLAP database is high.
You might know the direction to solve this already. Build a Data lake! Bringing in Data lake do not necessary mean you need to completely sunset OLAP Database. It is still common to see company having two systems co-exist for different use-cases.
A data lake allows you to persist unstructured and semi-structure data, and performs schema-on-write. It allows you reduce cost by storing large data volume with specialised storage solution and spun up compute cluster based on your demand. It further allows you to manage TB/PB dataset effortlessly by scaling up the compute clusters.
There is how your infrastructure might looks like: