Making sure data is consistent
This is when we ensure that any data being loaded does not contradict the information already within the data warehouse. In some cases, the changes could be valid - for example, updating information about the product hierarchy - but the changes need to be controlled carefully, so as not to render meaningless any of the existing information already in the data warehouse. For example, if we compared the existing customer list with a previous version already in the data warehouse, we would expect inconsistencies to be rationalized by specific customer events. If the events don’t exist, we need to determine whether event or customer transactions are missing.
Once the data has been cleaned, the next task within the c1ean-and-transform process is to convert the source data in the temporary data store into a structure that is designed to balance query performance and operational cost. The techniques applied to the data to convert it into a form that is suitable for long-term storage need to be applied. Every data warehouse solution will use these techniques to varying degrees; the precise mix will vary depending on the specific business requirements. Information in the data warehouse must be structured to support the performance requirement from the business, and also to minimize the ongoing operational cost. These requirements essentially pull the design in opposite directions, prompting you either to structure to improve query performance, or to structure to minimize operational management.