What is data completeness?
Data completeness is an essential metric of data quality in addition to consistency, validity, uniqueness, timeliness, integrity, and accuracy. The data is complete without missing information or duplicates.
Why is data completeness required?
Resolution Intelligence Cloud supports data completeness in Behaviour Analytics to ensure that the predictions are accurate and the overall performance of the ML models is improved.
With data completeness, you can gain enhanced transparency in your Behaviour Analytical models and detect data ingestion impediments throughout the data pipelines.
How do you measure Data completeness in Behavior Analytics?
Measuring Data completeness depends on two factors:
- Model aggregation interval
- Data load type (incremental or initial load)
Three practical methods of measuring data completeness are implemented in Behaviour Analytics as follows:
When you set the aggregation level to daily or hourly or a specific hour in a day with incremental data load, the model considers the data for a day and calculates the completeness using the formula.
Completeness = (Ceiling Hour of the event / 24) * 100
As the formula states, the model considers only non-empty columns with the last event recorded on a specific day and takes the ceiling value of the hour at which the last event occurred.
For the initial load or initial hard filter load, with the aggregation levels set to weekly, day of the week, or monthly, the model considers the data for each day and averages data completeness values based on the aggregation levels.
In addition to daily analysis as described above, the model takes the average score of data completeness values across the selected intervals and gives you the final score.
Overall data completeness score = (Sum of daily values / Interval days)
Combo of daily and averaged analysis
Beyond the initial load or initial hard filter loads and the aggregation levels set to a monthly interval, the model calculates the data completeness score based on the three unique bucket intervals. These intervals include the current bucket and the two preceding ones. The specific determination of the previous buckets depends on the aggregation interval of the model.
- For Daily, Hourly, or Hour_of_day aggregation: When the model's aggregation interval is set to daily, hourly, or hour_of_day, each bucket interval spans one day. The two previous buckets are calculated by considering the two days immediately preceding the current bucket.
- For Weekly or Day_of_week Aggregation: When the model's aggregation interval is set to weekly or day_of_week, each bucket interval spans seven days, representing a full week. In this case, the two previous buckets are calculated by considering the previous two weeks leading up to the current week.