This task aims as experimentation and, to acquire understanding on warehouse data staging activity via Glue.
Items:
- Please follow the Glue local development setup - https://github.com/umccr/orcahouse/tree/main/infra/glue
- To understand basic about PySpark API and, how it process the data, Spark dataframe & its data parallelism concept, etc.
- Please exercise to see whether you can reuse
skel template, to say, if you would process and stage another data source for warehouse
- Observe terraform from deploy directory and its Glue ETL scripts deployment in prod account (try invoke the job run, etc - no worries, ETL scripts are idempotent)
- Observe existing Glue script on spreadsheet processing
- In current Glue ETL script; prior loading into the warehouse staging database, investigate whether we can also perform "datalake" the output data (processed or unprocessed/source as-is or etc)? Historical, archival, etc.
- Perhaps, better yet, S3 Table bucket with Iceberg over datalake?
- To discuss pros/cons and, use when it fits the use case, etc
- Choose Python or Scala for the activity
This task aims as experimentation and, to acquire understanding on warehouse data staging activity via Glue.
Items:
skeltemplate, to say, if you would process and stage another data source for warehouse