Lean practices mean non-functional requirements like business data quality, information security, and test automation get relatively little attention at first. Firms employing lean focus on what’s most important to customers today and de-emphasize planning and documentation. Challenges surface when growth and survival says those technical debts have to be repaid – and often repaid fast.
In a prior post Minimal Viable Information Security, we observed how information risks can suddenly pop to the fore. Here are our observations on how data quality technical debt grows over time, and how to anticipate and mitigate them before your lean business data “breaks bad”.
Business Needs Drive Data Maturity
Selecting the right repository for business data is determined by factors including but not limited to:
- Who accesses the data (users and software entities, like web servers);
- How is the data accessed (via network protocols like HTTP, or direct access APIs);
- What operations are performed on the data (reads, writes, updates, calculations, etc.);
- How fast do data transactions need to occur (user speed or reporting, etc.);
- How long the data needs to be retained;
Lean startups typically deploy an open source SQL database for all user and usage variations. Data volumes are low at first, and the SQL abstraction is very well known and works for pretty much any software storage need in any programming language.
With growth however, business users will begin to feel pain:
- Development resources will be focused on front-end users and their data, much less on back-end business users, and for simplicity teams will want to use only one SQL database for everyone;
- Front-end data access is usually localized and used in real-time within an app session. Access has to be fast, and is highly transactional;
- In comparison, back-end data retrieval covers non-localized spans of time, and may require complex calculations that cannot be performed easily in real-time;
- Front-end data is focused on the present, while back-end data needs may be entirely focused on the past.
- Front-end data access is all about a single user, while back-end business data cares deeply about who accessed or updated business data and why.
Migrating from a single SQL database architecture to one that natively supports both end users and business users needs to be done stepwise, or the your plan to pay down technical debt can have very high interest rates.
Business Data Evolution Begins With Copy
These conditions are so common that a standard architecture for business data maturity has emerged, where data is migrated from one store to another for different business needs. The following diagrams chart this architectural pattern.
Figure 1. Saas Application accessing single front-end SQL database
This first evolution takes the front-end production data and copies it periodically to a second SQL database for use by business users. This simple move solves the immediate need to give business users their own copy of production data.
While mitigating immediate needs, the front-end production data schema is not designed to permit calculations and reporting that meet business needs. Business users at this stage are forced to copy data out into spreadsheets to perform necessary calculations.
Furthermore, any changes to the front-end data schema are immediately reflected in any back-end copies. These schema changes can cause failures or unexpected results in business reporting or auditing.
ETL into a Data Warehouse for BI
The next evolution adds a software ETL (Extract, Transform and Load) step that extracts, transforms and loads business data into a star or snowflake schema designed for read only business intelligence reporting with common calculations:
Figure 2. Front-end to back-end copy
Figure 3. Front-end to back-end data warehouse via ETL data flow programming
There can be resistance to a data warehouse for two reasons: The obvious one is cost, since there are now two different repositories and software.
But another is reduced agility: Now, any changes to the upstream database must consider the consequences to both the ETL software and data warehouse schema. Development teams must bifurcate, and start practicing new data change and access control rules.
The introduction of a data warehouse is the time when an enterprise data architect becomes critical to the organization. Most developers will now work with only the production database, while the enterprise architect tracks and anticipate needs across the entire firm.
Many companies can get by for quite some time at this FE/DW (Front End / Data Warehouse) stage, especially if there is rarely any need to change the DW data.
But transactional complexity in business processes can start to cause further problems: For example, an initial transaction in the front-end may need to be modified or evolve over time (e.g., payments are cancelled, business data proves to be in error, customer relations require updates or cancellations, etc.). The sheer volume of transactional data may begin to overwhelm.
If the DW is not the business’s “source of truth”, what is?
Adding an Operational Data Store
The next evolution adds an operational data store (ODS) in between the front-end transaction database to the back-end DW.
Figure 4. Front-end to back-end replication with ODS, ETL layers and Data Warehouse
Any changes to transaction data for whatever reason are made in the ODS. The ODS becomes the transactional “source of truth” for all business BI reporting and analytics. An ODS can also be a safety valve for firms with extreme transaction volumes.
The ODS can be implemented as a SQL or NoSQL store (e.g., Hadoop), depending on the firm, data volume and its needs. With the ODS in place, downstream reporting is now almost exclusively read-only — a common financial audit requirement.
Data Marts, Archives and NoSQL Optimizations
Even when an ODS is not called for, a fully realized business data architecture often includes further optimizations for archival and departmental business needs:
Figure 5. Front-end to back-end replication with ODS, ETL layers, Data Warehouse and Data Mart
Data archives are implemented purely for low-cost long-term storage and retrieval, typically due to regulatory or disaster recovery requirements. Archives are not necessarily on-line at all times.
Data Marts can be in many forms; popular ones include visual/interactive BI environments, departmental SQL warehouses, OLAP databases, NoSQL network databases, and machine learning environments.
NoSQL stores now exist to optimize almost any application, from web content servers, to transactional caches, big data Hadoop, and on and on. Expect data mart technologies especially to proliferate.
Some Business Data Life Lessons
The consultants at Telegraph Hill Software have developed lean business data products for over twenty years while helping grow dozens of startups. Some hard-won lessons:
- Because lean startups meet non-functional requirements initially in the platform stack, soon after founding, select a stack that can take you as far into future growth as possible. And yes, that might mean paying up for an enterprise license.
- Complexity kills agility and adds cost. Introduce multiple data repositories only after other optimizations are no longer viable.
- When your stack no longer meets your firm’s non-functional requirements, development teams must address technical and organization debt quickly, or risk losing credibility.
- Developing software with multiple databases is complex, even with simple replication. Your software development processes must quickly mature and development teams specialize.
- Selectively implement existing archetypes (DW, ODS, DM, etc.) for handling business data maturity. There’s usually no need for massive innovation, use tried and true.
- Be wary of the usefulness and risks of brand new NoSQL products.
- Recruit and nurture your enterprise data architect — he or she is critical to success.