Skip to content

Commit 5b95335

Browse files
include proposal for migration procedure
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
1 parent 0163e07 commit 5b95335

1 file changed

Lines changed: 85 additions & 1 deletion

File tree

proposals/2078-optimization-ol-facets.md

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Proposal: Optimize query performance for OpenLineage facets
22

3-
Author: Willy Lulciuc ([@wslulciuc](https://github.com/wslulciuc))
3+
Author: Willy Lulciuc ([@wslulciuc](https://github.com/wslulciuc)), Paweł Leszczyński ([@pawel-big-lebowski](https://github.com/pawel-big-lebowski))
44

55
Created: 2022-08-18
66

@@ -115,3 +115,87 @@ The implementation requires:
115115
1. Schema changes to create the facet tables outlined above.
116116
2. Using the facet tables instead the `lineage_events` table to query for facets.
117117
3. Lazy migration, the facet tables will be queried, and if no facets are returned, then the `lineage_events` table; this approach avoids a backfill, but one will still be needed.
118+
119+
## Migration procedure
120+
121+
Following challenges need to be addressed to provide a successful migration procedure:
122+
* Rewriting existing `lineage_events` row to lineage datasets can be expensive DB operation for large Marquez instances.
123+
* Migration procedure should minimize downtime for processing new OL events.
124+
* Users need to be able to revert in case of migration failure.
125+
126+
Based on that, a migration procedure will be split into two jumps. The first revertible step will:
127+
* create new tables: `dataset_facets`, `job_facets` and `run_facets`,
128+
* introduce code change that writes into two new tables and reads from the new ones,
129+
* provide procedure to migrate facets from `lineage_events` table into `dataset_facets`, `job_facets` and `run_facets`.
130+
131+
The second irreversible step will perform the cleanup of `lineage_events` table.
132+
The second migration step will be published a release after.
133+
134+
We distinguish two migration modes:
135+
* `BASIC` when there is up to 100K rows in `lineage_events`.
136+
* `PRO` for Marquez instances with more than 100K records.
137+
138+
Migration script will run `BASIC` version automatically if the condition is met.
139+
`PRO` mode will require extra manual steps.
140+
141+
### BASIC MODE < 100K records
142+
143+
A flyaway DB script to create tables and migrate data will be created. This will be recommended for users who
144+
have up to 100K records in `lineage_events` table assuming each event is ~20KB size.
145+
Performance test for such scenario should be run during implementation phase.
146+
Such users may experience a few minute long downtime and should be OK with that while being
147+
clearly informed on that in `CHANGELOG`.
148+
149+
### PRO MODE > 100K records
150+
151+
Flyway migration engine runs the migration in transactions.
152+
Updating whole `lineage_events` table in a single transaction could be dangerous.
153+
The upgrade procedure will look the same as `BASIC` mode except for data migration from `lineage_events` table to
154+
`dataset_facets`, `job_facets` and `run_facets` tables which will not be triggered automatically nor done
155+
in a single run. Migration procedure command will be introduced and require manual trigger.
156+
Migrating facets from `lineage_events` will be done in chunks and the chunk size will be configurable.
157+
API will be able to receive incoming OpenLineage events while the data migration script will be running.
158+
159+
An extra single-row `migration_lock` table will be introduced:
160+
161+
| **COLUMN** | **TYPE** |
162+
|------------|---------------|
163+
| run_uuid | `UUID` |
164+
| created_at | `TIMESTAMPTZ` |
165+
166+
Data migration will be run in chunks:
167+
```
168+
WITH events_chunk AS (
169+
SELECT e.* FROM lineage_events e
170+
JOIN migration_lock m
171+
WHERE e.created_at < b.created_at OR (e.created_at = b.created_at AND e.run_uuid < m.run_uuid)
172+
ORDER BY e.created_at, e.run_uuid DESC
173+
LIMIT :chunk_size
174+
),
175+
insert_datasets AS (
176+
INSERT INTO dataset_facets
177+
SELECT ... FROM events_chunk
178+
),
179+
insert_runs AS (
180+
INSERT INTO run_facets
181+
SELECT ... FROM events_chunk
182+
),
183+
insert_jobs AS (
184+
INSERT INTO job_facets
185+
SELECT ... FROM events_chunk
186+
),
187+
TRUNCATE TABLE migration_lock,
188+
INSERT INTO migration_lock
189+
SELECT events_chunk.created_at, event_chunk_run_uuid
190+
FROM events_chunk
191+
ORDER BY created_at, run_uuid DESC
192+
LIMIT 1
193+
```
194+
Such a query will be run until `created_at` and `created_at` in `migration_lock` will equal:
195+
```
196+
SELECT run_uuid, created_at FROM lineage_events ORDER BY created_at, run_uuid DESC LIMIT 1;
197+
```
198+
The second migration step will not start unless the condition is met.
199+
For users, who attempt to run two migration steps in a single run,
200+
the second step will fail and ask to manually run data migration command and retry migration after
201+
the command runs successfully. Table `migration_lock` will be dropped at the end of second migration step.

0 commit comments

Comments
 (0)