optimizing current runs query for lieage api #2211
Conversation
|
Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md). |
6ca59e1 to
bc2349d
Compare
fea1168 to
b39fcc5
Compare
| runAssert | ||
| .extracting(Run::getOutputVersions, InstanceOfAssertFactories.list(DatasetVersionId.class)) | ||
| .hasSize(1); | ||
| .hasSize(0); |
There was a problem hiding this comment.
I think this is a little misleading since we're not asking the database for this information anymore. Should we just remove this assertion or change the type definition?
There was a problem hiding this comment.
I agree and had initially removed this completely but then added it back again, because getInputVersions also had an assertion. I guess i should remove both of them now.
| runAssert | ||
| .extracting(Run::getOutputVersions, InstanceOfAssertFactories.list(DatasetVersionId.class)) | ||
| .hasSize(1); |
There was a problem hiding this comment.
We are asserting on the Lineage object returned from lineageService.lineage call, and it is the response payload of GET lineage API. Does this mean that API response is being changed?
There was a problem hiding this comment.
Yeah, I don't like all the information being returned here, but breaking API compatibility is not good. If we want a lighter-weight version of the lineage API, I think it's better to either include an optional parameter to exclude the superfluous data or to create a new API and deprecate the old one
There was a problem hiding this comment.
I have updated the PR and added a withRunFacets flag to get runs with all the superfluous data and this flag always true in Marquez apis, so no affect on the API compatibility.
601278b to
52ebaa3
Compare
| } | ||
|
|
||
| public Lineage lineage(NodeId nodeId, int depth) { | ||
| public Lineage lineage(NodeId nodeId, int depth, boolean withRunFacets) { |
There was a problem hiding this comment.
This is a good change, but I worry that we'll want to add more options to this method (e.g., include job facets? dataset facets? exclude runs altogether?). I don't think we should take this on now, but let's add a TODO to make the input parameters here more easily extendable so that we can add those other options later one.
There was a problem hiding this comment.
Couldn't agree more. I had small aversion to adding a flag to make this work, but there was no other better option. I also thought in future if more changes like this come along that alter api significantly, we could add these as options to api query parameters, or create more broken apis to get specific data.
| @SqlQuery( | ||
| "SELECT DISTINCT on(r.job_name, r.namespace_name) r.*, jv.version as job_version\n" | ||
| + " FROM runs_view r\n" | ||
| + " INNER JOIN job_versions jv ON jv.uuid=r.job_version_uuid\n" | ||
| + " INNER JOIN jobs_view j ON j.uuid=jv.job_uuid\n" | ||
| + " WHERE j.uuid in (<jobUuid>) OR j.symlink_target_uuid IN (<jobUuid>)\n" | ||
| + " ORDER BY r.job_name, r.namespace_name, created_at DESC\n") |
There was a problem hiding this comment.
I think we should use the more readable syntax variant as we update our queries.
"""
SELECT DISTINCT on(r.job_name, r.namespace_name) r.*, jv.version as job_version
FROM runs_view
INNER JOIN job_versions jv ON jv.uuid=r.job_version_uuid
INNER JOIN jobs_view j ON j.uuid=jv.job_uuid
WHERE j.uuid in (<jobUuid>) OR j.symlink_target_uuid IN (<jobUuid>)
ORDER BY r.job_name, r.namespace_name, created_at DESC
"""There was a problem hiding this comment.
updated the syntax as you asked
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
4662e51 to
a330beb
Compare
|
Great job! Congrats on your first merged pull request in the Marquez project! |
* optimizing current runs query for lieage api Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io> * removing assert Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io> * add new lightweight current runs query Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io> * addressing review comments Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io> Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
Problem
Introduce a simpler alternate getCurrentRuns query that gets only simple runs from DB without the additional data from tables such as run_args, job_context, facets, input/output versions etc which required the extra table joins in the old getCurrentRuns query. This new getCurrentRuns DAO is NOT being used in Marquez as of now.
Closes: #4425
Solution
getCurrentRunsDAO is renamed togetCurrentRunsWithFacetswithout any change to the sql query .getCurrentRunsand is NOT called from /lineage api as of now so NO change is required to the /lineage api response spec.withRunFacetsis also introduced as parameter to lineage api, which is always set totrueto callgetCurrentRunsWithFacets, so that /lineage api and hence all the tests still call the old DAO .Checklist
CHANGELOG.mdwith details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sqldatabase schema migration according to Flyway's naming convention (if relevant)