pyDKB: merge 'master' branch by mgolosova · Pull Request #165 · PanDAWMS/dkb

mgolosova · 2018-08-31T07:17:21Z

Background:

This PR is aimed in bringing documentation stuff to the pyDKB branch, to make it possible to revise and improve documentation directly in the latest version of the library.
Documentation was requested by @Evildoor for pyDKB unit tests development.
Unit tests were requested by @mgolosova to simplify further development of the library (as development without proper testing may sometimes lead to situations when things are broken and nobody knows about it for a long time; early fixes are much easier than late ones).

ToDo:

check conflicts resolution:
- Stage 091;
- pyDKB.common.custom_readline;
update pyDKB docs according to the code in destination branch;
check --config parameter in pyDKB and 009.

Can not be merged to pyDKB unless pyDKB itself is fixed (see PR #170).
After the merge pyDKB branch should be merged here again.
Everything that could be fixed was fixed.

Sometimes we need to run two or more `data4es` processes with different configurations (say, regular updates every hour, reverse update for data repair, weekly update to get updated data from original sources, where timestamp was not updated properly). To make it possible, we need to at least be able to set "where to read configuration from" without changing code. Now it is done via shell environment (DATA4ES_CONFIG_PATH variable). `data4es-start` looks for config files by name in given directory (or directories). Multiply directories to be separated with colon (`:`).

Sometimes we need to run two or more data4es processes with different configurations (say, regular updates every hour, reverse update for data repair, weekly update to get updated data from original sources, where timestamp was not updated properly). To make it possible, we need to at least be able to set "where to read configuration from" without changing code.

DF/09: bug fix (fail to output log message).

To have possibility to run two or more different processes for `data4es` dataflow process, we need to separate their home directories (where the process stores temp files, including pidfile).

data4es: individual home directory for a process.

Stage 093 used to get data format from the fifth field of the dataset name. ATLAS Dataset nomentlature (https://dune.bnl.gov/w/images/9/9e/Gen-int-2007-001_%28NOMENCLATURE%29.pdf) states, that data format is the last but one field and that the number of fields may vary from project to project. E.g., for "user" datasets, there may be more fields: "user.cgottard.D17p3388.00336852.physics_Main.f871_m1879_p3388_output.root.168308563" 1 2 3 4 5 6 7 8 However sometimes there seem to be no data type field at all: "user.ffedotov.HNL_NEOSIG_11_00309440_PromptBkg.168254771" Now we have decided to get data format form: * for 'user' and 'group' datasets: * 7th field, or * None (if there's less then 7 fields); * for other datasets: * 5th field (as it used to be).

Stage 19, used in data4es process, never took config file from command line and thus for any data4es process all ES settings were the same.

As there are some changes coming, we need to have an actual sample of output data (to ensure that changes in code have ruined nothing).

Previous version could return information about one attribute at a time; if we need more attributes, it will send number of identic requests to Rucio, and it is not good.

It sounds better to keep all metadata adjustments (like replacing `'null'` with `None` or `bytes` handling) in one place, so it is wrapped into a separate function.

Mapping for these indices is essentially the same (as we store there same data from same sources); thus there is no need to have two identical mapping templates, when we can use same pattern for both indices. Having two files with mapping is not good: changing mapping we need to change it twice. ES prior to 6.0 does not allow to have "OR" in the template: we can not say "prodsys_rucio_ami OR analysis" in any way. Thus we rename indices accordingly to: "tasks_production" and "tasks_analysis", and set index template to "tasks_*" for the mapping.

We need two almost identical queries for production and analysis tasks; they differ only it the condition on `pr_id` field. To avoid duplicate files in the repository (which may lead to mistakes like "changed one file and forget about another"), the query must be parametrized in some way. As we already have config file for this stage, seems reasonable to put query parameters into it. Parametric pieces of queries must be specified with `%(<parameter_name>)s`, and values specified in `queries.params` section of the config file: `<parameter_name>: <parameter_value>`.` When parameters are used, one must make sure that no precent signs in the query will produce error and escape them, if needed ('%%' instead of '%').

DF/91: refactoring

`data4es` process for analysis tasks: * changed default ES index name; * Stage 009 now can work with parametric SQL queries; * Stage 093 patched to work with analysis dataset names (slightly different nomenclature).

Query parameters: project name, AMI tag, output dataset format. Result example is given in <query-name>-r.json file. To run the query use command: curl -X GET -H 'Content-Type: application/json' -d @deriv-ratio-q.json \ 'http://ES_HOST:ES_PORT'

Jobs were missed because they were started and finished before the moment specified as the task 'start_time' and falled into previous day's index. For this not to happen, the previous index should always be added to the indices list of the query.

If interval between start and end time is too long, there will be too many indices in the request, which leads to the TooLongFrame exception: TransportError(400, u'too_long_frame_exception', u'An HTTP line is larger than 4096 bytes.') For example: task 6793089. To reduce number of indices, instead of YYYY-MM-DD* suffix we can use YYYY-MM* (or even YYYY*, if necessary; but it was not implemented yet).

Using the fixed term both in error messages and config example should make it easier for user to find a problem with query parameters when it occurs.

Look for '%(' construction instead of non-double '%'. This construction is mandatory in parametric queries, and less likely to occur in non-parametric ones.

Clarify that the warning should only be taken into account when a parametric query is used.

Stage 009: Improve query saving procedure.

Stage 025 update (total HS06)

Conflicts: Utils/Dataflow/091_datasetsRucio/datasets_processing.py Utils/Dataflow/pyDKB/common/custom_readline.py

NOTE: the only thing done here is `make clean; make all`. To make documentation fully consistent with the code, we should go through the docs thoroughly; and it is to be done during the pyDKB documentation revision.

mgolosova · 2019-03-28T11:59:26Z

As there were quite a lot of changes, both in master and pyDKB, the branch was reset to the new version of pyDKB before merging master to it.
Stage 091 conflict resolution was performed by the same scenario as in the last time, so I checked this item in the PR description.
But there was another conflict -- in pyDKB.common.custom_readline (docstrings).

The bug was introduced in the following commit: ec0725f "DF: remove error handling from stages (016_task2es)."

The bug appeared due to the merge of two too long-living branches: * the submodule was moved in the `pyDKB` branch (5e3003e), and * used for the second time in `master` (a0226ff). It tells us, by the way, that the better idea would be to import this module in the beginning, as long as it is used more than one time...

mgolosova · 2019-06-06T12:08:00Z

check conflicts resolution: pyDKB.common.custom_readline;

/Utils/Dataflow/test/pyDKB/test.sh revealed no error so I believe it means that everything is fine.

mgolosova · 2019-06-06T12:10:22Z

check --config parameter in pyDKB and 009.

No conflict here, as 009 does not utilize pyDKB.dataflow.stage functionality yet.

Evildoor · 2019-06-07T14:11:46Z

I tested the data4es (with stages 025 and 069 disabled) and some separate stages, they seem to be working. I also skimmed through the changes and failed to find anything of interest.

mgolosova · 2019-06-08T04:25:23Z

@Evildoor,

I tested the data4es (with stages 025 and 069 disabled) and some separate stages, they seem to be working. I also skimmed through the changes and failed to find anything of interest.

Thank you! Just a note to clarify why disabling the stages was fine for testing: * Stage 069 shouldn’t be affected or reveal any issues (as long as 019 produces same output as before). * Stage 025 *had* been affected in fact (as added in `master` and was not adapted to the new version of `pyDKB` along with others), but it was fixed in last commits.

anastasiakaida

Didn't find anything strange, so here is my approval.

mgolosova · 2019-06-10T10:57:12Z

@anastasiakaida, @Evildoor, thank you both for checking this PR.

Will merge it now.

mgolosova added 30 commits June 7, 2018 13:34

DF/069: make config file configurable via command line parameter.

8387526

data4es: define config file for Stage 69 in start script.

877ecce

DF/09: bug fix (fail to output log message).

d5c63cc

Merge pull request #139 from PanDAWMS/stage-009-fix

98d2219

DF/09: bug fix (fail to output log message).

data4es: individual home directory for a process.

2c72fb4

To have possibility to run two or more different processes for `data4es` dataflow process, we need to separate their home directories (where the process stores temp files, including pidfile).

Merge pull request #138 from PanDAWMS/data4es-home-dir

6b521cf

data4es: individual home directory for a process.

DF/19: make config file configurable via command line.

4f085db

DF/09: add SQL queries and ES mapping for analysis tasks.

51f155c

Merge pull request #142 from PanDAWMS/data4es-fix

596e533

Stage 19, used in data4es process, never took config file from command line and thus for any data4es process all ES settings were the same.

DF/91: update output data sample.

d79618d

As there are some changes coming, we need to have an actual sample of output data (to ensure that changes in code have ruined nothing).

DF/91: refactoring.

667713e

DF/91: turn get_metadata_attribute() into get_metadata().

8c34f7f

Previous version could return information about one attribute at a time; if we need more attributes, it will send number of identic requests to Rucio, and it is not good.

DF/91: add adjust_metadata() function.

d545e5e

It sounds better to keep all metadata adjustments (like replacing `'null'` with `None` or `bytes` handling) in one place, so it is wrapped into a separate function.

DF/91: fix docstrings for process().

7dab616

DF/91: add info about output dataset 'events' parameter to ES.

12b725f

DF/91: add possibility to get metadata for input dataset.

a0226ff

data4es: add Stage 91 for input dataset to the process.

9250917

Merge pull request #144 from PanDAWMS/stage-091-refactoring

bf3cbf4

DF/91: refactoring

Merge pull request #143 from PanDAWMS/data4es-analysis

ca44183

`data4es` process for analysis tasks: * changed default ES index name; * Stage 009 now can work with parametric SQL queries; * Stage 093 patched to work with analysis dataset names (slightly different nomenclature).

Merge pull request #145 from 'stage-091-events'

5c7c14b

es/query: add condition for job status.

6bccb75

es/query: add condition on input dataset size (>0).

9cfc35c

es/query: add condition on proper output dataset presence.

b82ca43

es/query: add events aggregation.

1aaf69a

mgolosova and others added 11 commits February 15, 2019 08:14

DF/025: bug fix (undefined variables).

de6f42b

es/query: use "toths06" instead of "hs06*total_events".

6a75f90

Replace 'open/(missing)close' with 'with open as'.

958d8cd

Introduce a fixed term 'query parameter'.

e4b2d39

Using the fixed term both in error messages and config example should make it easier for user to find a problem with query parameters when it occurs.

Update '%' check.

489c8d8

Look for '%(' construction instead of non-double '%'. This construction is mandatory in parametric queries, and less likely to occur in non-parametric ones.

Improve the warning about missing configuration.

1018fb1

Clarify that the warning should only be taken into account when a parametric query is used.

Merge pull request #209 from PanDAWMS/stage-009-params-errors

e80d036

Stage 009: Improve query saving procedure.

Merge pull request #191 from PanDAWMS/stage-025-update

21d1999

Stage 025 update (total HS06)

Merge 'master' into pyDKB-master-merge

fb7e780

Conflicts: Utils/Dataflow/091_datasetsRucio/datasets_processing.py Utils/Dataflow/pyDKB/common/custom_readline.py

mgolosova force-pushed the pyDKB-master-merge branch from 132b720 to fb7e780 Compare March 28, 2019 11:31

docs: make documentation correspond to the current branch.

9e9fc54

NOTE: the only thing done here is `make clean; make all`. To make documentation fully consistent with the code, we should go through the docs thoroughly; and it is to be done during the pyDKB documentation revision.

mgolosova force-pushed the pyDKB-master-merge branch from ae41ed3 to 9e9fc54 Compare March 28, 2019 11:48

mgolosova changed the title ~~[pending] pyDKB: merge 'master' branch~~ pyDKB: merge 'master' branch Mar 28, 2019

mgolosova added 3 commits June 6, 2019 12:18

DF/025: make stage compatible with the new version of pyDKB.

e0ab760

DF/016: bug fix (undefined variable).

472ab51

The bug was introduced in the following commit: ec0725f "DF: remove error handling from stages (016_task2es)."

mgolosova force-pushed the pyDKB-master-merge branch from 6685fbd to f415e34 Compare June 6, 2019 12:03

pyDKB: update quickstart guide.

abd340a

mgolosova force-pushed the pyDKB-master-merge branch from f415e34 to abd340a Compare June 6, 2019 12:14

anastasiakaida self-requested a review June 10, 2019 10:21

anastasiakaida approved these changes Jun 10, 2019

View reviewed changes

mgolosova merged commit 82c9eac into pyDKB Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyDKB: merge 'master' branch#165

pyDKB: merge 'master' branch#165
mgolosova merged 185 commits intopyDKBfrom
pyDKB-master-merge

mgolosova commented Aug 31, 2018 •

edited

Loading

Uh oh!

mgolosova commented Mar 28, 2019

Uh oh!

mgolosova commented Jun 6, 2019

Uh oh!

mgolosova commented Jun 6, 2019

Uh oh!

Evildoor commented Jun 7, 2019

Uh oh!

mgolosova commented Jun 8, 2019 via email

Uh oh!

anastasiakaida left a comment

Uh oh!

mgolosova commented Jun 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mgolosova commented Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgolosova commented Mar 28, 2019

Uh oh!

mgolosova commented Jun 6, 2019

Uh oh!

mgolosova commented Jun 6, 2019

Uh oh!

Evildoor commented Jun 7, 2019

Uh oh!

mgolosova commented Jun 8, 2019 via email

Uh oh!

anastasiakaida left a comment

Choose a reason for hiding this comment

Uh oh!

mgolosova commented Jun 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mgolosova commented Aug 31, 2018 •

edited

Loading