Let's walk through creating a standard Build VM
Part 17 - Over Using Computed Fields
What happens when you add lots of computed fields to a data entity? Let's take a look
In part 16, we talked about using computed columns. Next, let's take this to next logical conclusion - using them way too much. We're going to compare 4 different scenarios from a performance perspective when working with customers. We'll have the following data entities:
- CustCustomerV2Entity - we'll use this entity to establish a baseline
- AAXCustCustomerV3Entity - this is a copy of CustCustomerV2Entity with 1 additional computed field; "Sales Last Year"
- AAXCustCustomerV4Entity - just like AAXCustCustomerV3Entity but has 2 additional computed fields; "Sales Last Year" and "Sales This Year"
- AAXCustCustomerV5Entity - Just like AAXCustCustomerV3Entity but has 22 additional competed fields to get sales values from 2000 to 2022 plus the 2 extra fields from the V4 version.
Are Computed Fields Bad?
No - not specifically. But like most things in life, they should be used in moderation. In our scenario we're adding lots of SQL to get a customer's total sales by year right on the customer entity. This sort of data should be pre-calculated in batch and presented from that pre-calculated table. This is actually already partially done for us with entity "Customer Aged Balances" and "Customer Balance Statistics". However, when adding computed fields to an entity, there are 2 things to consider:
- Do I need to calculate this every time I call the entity or should I stage this data somewhere else in batch?
- Is where I am adding the computed field a good place to add it?
Calculate Now Vs Stage Earlier
When deciding how to present data to for an endpoint, regardless of how it's used, you'll want to consider what data you're summarizing or computing and what it represents. In our scenario, we're looking at prior years sales. I wouldn't expect that number to change hardly at all once we've moved into the year calendar year. Sales Figures for a customer in years 2000 - 2021 would remain largely fixed, save for a correcting entry of some kind into last year - which would be a rare and probably corner case. An example that we may want to calculate in real time is the number of orders on credit hold. That would change day to day - hour to hour potentially and we'd want our data consumers to be aware of those updates. For data that doesn't change very much over time, we can pre-calculate it and stage it for consumption, as needed, rather than crunch the data each time something else. For data that changes frequently and rapidly, we should considering including that in a computer column.
Next, we should consider where we are placing our computer columns. In our example, we're placing statistical data right in our customer entity. When interacting with the customer entity, you may like to see that statistical data but chances are most of your consumers don't need it. Unless they are using a SELECT(), they'll get it. Customer specifics is master data and customer balances is analytical data. In general, our data entities are build around 1 data type so we should keep that data entity as close to that same type as we can.
To test the impact of the number of computed fields, we'll but fetching 100 customers at random from a development environment that has Contoso data loaded for each of the entities listed above.
Results below are in milliseconds.
From right to left, we have our OOTB entity and our entity with 1 extra computed column fairing about the same along with our entity that had 2 extra computed columns. Lastly, our entity that had 22 extra did show that it was slower than the other entities. Since the entities were alike in all regards except the computed columns, we can assume the difference is from the computed columns.
This didn't show the huge performance his that could happen when using computed fields in the way I wanted to show it. When using Contoso data, you have a very limited set of data to work with the Microsoft has curated and upgraded throughout the years. The most important aspect of the dataset I'm testing with is that it is very small. I think if we were to drastically change the data layout of the orders by customer, we'd start to see some different results. I'll be using ATL in my next article to create, pick/pack and ship lots of sales orders to see if we can get a different result from our testing.
Blog Main Tag: