Data Quality

A few days ago, I came across a LinkedIn post featuring a BI environment where the author was interacting via AI prompts. Nothing new, except the data source was Business Central. Scrolling through, I noticed the database was named “BC-Replica”.

Interesting.

Repicate data from On Premises

From a BI perspective, this is a “lucky” scenario 😅. The SQL database is directly accessible, and a replica might not even be necessary:

Direct database access
Log shipping (in standby mode) to a secondary server, so queries don’t impact primary server performance

Just a heads-up: we’re talking about reading raw BC table data, not building reports directly on the source. A “replica” is usually just the first step in feeding a Data Warehouse, which is where reports and dashboards should actually live.

Replicate data from SaaS

In SaaS, while everything else gets simpler, data access actually becomes more complicated. There’s no direct database access; the only way to extract data is via APIs. In this case, these are OData endpoints used to filter and read table contents.

That is, of course, assuming Microsoft or the developer has exposed the table via API and made the relevant fields visible.

For example, here is the link to Microsoft’s standard APIs for querying the “Item Ledger Entry” table: https://learn.microsoft.com/en-us/dynamics365/business-central/dev-itpro/api-reference/v2.0/resources/dynamics_itemledgerentry

And… what about the performance? 🥶 Microsoft says:

When calling APIs and OData web services, there are many strategies that you can use to speed up your queries, like:

Limiting the set ($filter or $top) if you’re using an expensive $expand statement

Using OData transaction $batch

Using Data Access Intent Read-only with OData

They add:

When refreshing a query in a Power BI semantic model, you might experience an error such as DataSource.Error: OData: Unable to read data from the transport connection: existing connection was forcibly closed by the remote host.

This error can happen either when the query refresh was interrupted by a transient error (in this case, just implement a retry) or if the query can’t finish within the limit defined for web service calls on the Business Central server. In the latter case, the Business Central server will cancel the call.

If this happens consistently, you need to tune your query, either by applying data filters, by reducing the number of columns (do you use all of them?), or maybe by partitioning the table (using custom partitioning through the XMLA endpoint for the model).

Perfect, they’ve removed database access, so the only way to get data out is by dumping entire tables, maybe with a few filters. Brilliant! 😅

This might go unnoticed on small databases, but with high data volumes, it becomes a serious issue.

The problem gets even worse when you consider deletions. They aren’t tracked at all, so an incremental replica based only on timestamp can easily become inconsistent.

Over time, every partner has developed their own techniques to effectively replicate a SaaS database. I’ve shared our approach before (https://simonegiordano.it/synchronization-pattern) and there was even a Microsoft version once (now deprecated: https://github.com/microsoft/bc2adls).

Alternatively, if you like gambling with consistency, Microsoft suggests using webhooks to receive a callback whenever something changes. But if you’re unfortunately offline for too long, here is what happens:

If Business Central can’t reach the subscriber, several retries are attempted over the next 36 hours. The subscriber must respond with following error codes: 408 – Request Timeout, 429 – Too Many Requests or any error in 500-599 range (5xx). If subscriber responds with any other code than listed, no retries are attempted and the subscription is deleted.

Data Quality (DQ) is the iterative and repetitive process of ensuring your database contains accurate, consistent, and actionable data for business decision-making.

“Repetitive'”means DQ isn’t a one-time task; countless events can later trigger inconsistencies, like a new setup, a new company, or simply a new colleague with a different posting style.

DQ requires a much rawer dataset than a standard BI project. This is why we focus on full replicas, especially including all fields.

A practical example: DQ must ensure that matching purchase receipts to vendor invoices is handled correctly. This relies on the “Completely Invoiced” field in the “Item Ledger Entry” table. However, this field is: 1) not exposed by standard APIs 2) a flag that can be updated outside of any incremental date-based filters. 😅

There are hundreds of similar cases where having a full replica of the database and all its fields is absolutely essential!

Apache Superset

Moving forward… let’s say we have a full replica of a BC database, whether On-Premises or SaaS. We can now query it using SQL to hunt for issues!

Why SQL? Because the goal is Data Quality, not BI. We need to select and cross-reference data to pinpoint problems, not present visuals or calculate complex formulas.

Once the SQL query is ready, we need to make the process repeatable and accessible to others, even power users.

Apache Superset is a fantastic open-source tool that lets you expose SQL queries via the web (with proper access controls) to generate tables, pivots, and dashboards.

https://superset.apache.org

Installation is very straightforward 😅

pip install apache_superset

Queries can be enhanced with Jinja, which is ideal for multi-company contexts:

{% for company in ['Contoso', 'Fabrikam', 'Cronus IT', 'Cronus UK'] %}

SELECT 
  '{{ company }}' AS [Company], 
  CAST(v.[Item Ledger Entry Quantity] AS float) [Inventory qty.],
  CAST((v.[Cost Amount (expected)] + v.[Cost Amount (actual)]) AS float) [Inventory value],
  l.[Code] AS [Location code],
  l.[Name] AS [Location name],
  i.[No_] AS [Item No.],
  i.[Description] AS [Item description],
  CAST(i.[Unit Cost] AS float) [Unit cost],
  CASE WHEN 
    (e.[Positive] = 1) AND ((v.[Cost Amount (expected)] + v.[Cost Amount (actual)]) = 0) 
  THEN 1 ELSE 0 END [Has zero values]
FROM 
  [{{ company }}$Value Entry$437dbf0e-84ff-417a-965d-ed2bb9650972] v INNER JOIN
  [{{ company }}$Item Ledger Entry$437dbf0e-84ff-417a-965d-ed2bb9650972] e ON
    v.[Item Ledger Entry No_] = e.[Entry No_] INNER JOIN  
  [{{ company }}$Location$437dbf0e-84ff-417a-965d-ed2bb9650972] l ON
    e.[Location Code] = l.[Code] INNER JOIN
  [{{ company }}$Item$437dbf0e-84ff-417a-965d-ed2bb9650972] i ON
    e.[Item No_] = i.[No_]

{% if not loop.last %} UNION ALL {% endif %}
{% endfor %}

Finally, the queries are published to the web and made accessible to users for self-service. And the result looks great!

Conclusion

We are in an era where users expect to ask the system: “Hey, tell me if there are any issues with inventory valuation” 😅 While certainly possible, the prerequisite is a consistent, error-free database. Periodic Data Quality checks allow for timely corrections and give everyone the confidence that the aggregated data they see is correct.

Apache Superset can be a powerful ally in this!