Change Data/Schema capture in AWS Glue
In today’s world, organizations are collecting an unprecedented amount of data from all kinds of different data sources, such as transactional data stores, clickstreams, log data, IoT data, and more.
In many use cases, I have found that the data teams responsible for building the data pipeline don’t have any control of the source schema, and they need to build a solution to identify changes in the source schema in order to be able to build the process or automation around it. This might include various challenges as :
- Sending notifications of changes to the teams dependent on the source schema, building an auditing solution to log all the schema changes.
- Building an automation or change request process to propagate the change in the source schema to downstream applications such as an ETL tool or BI dashboard.
- Sometimes, to control the number of schema versions, you may want to delete the older version of the schema when there are no changes detected between it and the newer schema.
there many more ways and challenges to change data or schema capture.
For example, assume you’re receiving claim files from different external partners in the form of flat files, and you’ve built a solution to process claims based on these files. However, because these files were sent by external partners, you don’t have much control over the schema and data format. For example, columns such as customer_id and claim_id were changed to customerid and claimid by one partner, and another partner added new columns such as customer_age and earning and kept the rest of the columns the same. You need to identify such changes in advance so you can edit the ETL job to accommodate the changes, such as changing the column name or adding the new columns to process the claims.
To all these changes Glue is smart enough to fix it at crawler level. So while defining crawler for our source we just need to do some config changes as :
So we have various options in crawler based on your requirement. Hope you have learnt something new today. Enjoy Glueing to your data!

Comments
Post a Comment