Skip to content

pyDKB: merge 'master' branch#165

Merged
mgolosova merged 185 commits intopyDKBfrom
pyDKB-master-merge
Jun 10, 2019
Merged

pyDKB: merge 'master' branch#165
mgolosova merged 185 commits intopyDKBfrom
pyDKB-master-merge

Conversation

@mgolosova
Copy link
Copy Markdown
Collaborator

@mgolosova mgolosova commented Aug 31, 2018

Background:

  • This PR is aimed in bringing documentation stuff to the pyDKB branch, to make it possible to revise and improve documentation directly in the latest version of the library.

  • Documentation was requested by @Evildoor for pyDKB unit tests development.

  • Unit tests were requested by @mgolosova to simplify further development of the library (as development without proper testing may sometimes lead to situations when things are broken and nobody knows about it for a long time; early fixes are much easier than late ones).


ToDo:

  • check conflicts resolution:
    • Stage 091;
    • pyDKB.common.custom_readline;
  • update pyDKB docs according to the code in destination branch;
  • check --config parameter in pyDKB and 009.

Can not be merged to pyDKB unless pyDKB itself is fixed (see PR #170).
After the merge pyDKB branch should be merged here again.

Everything that could be fixed was fixed.

mgolosova added 30 commits June 7, 2018 13:34
Sometimes we need to run two or more `data4es` processes with different
configurations (say, regular updates every hour, reverse update for data
repair, weekly update to get updated data from original sources, where
timestamp was not updated properly). To make it possible, we need to at
least be able to set "where to read configuration from" without changing
code.

Now it is done via shell environment (DATA4ES_CONFIG_PATH variable).
`data4es-start` looks for config files by name in given directory (or
directories). Multiply directories to be separated with colon (`:`).
Sometimes we need to run two or more data4es processes with different
configurations (say, regular updates every hour, reverse update for data
repair, weekly update to get updated data from original sources, where
timestamp was not updated properly). To make it possible, we need to at
least be able to set "where to read configuration from" without changing
code.
DF/09: bug fix (fail to output log message).
To have possibility to run two or more different processes for `data4es`
dataflow process, we need to separate their home directories (where the
process stores temp files, including pidfile).
data4es: individual home directory for a process.
Stage 093 used to get data format from the fifth field of the dataset
name. ATLAS Dataset nomentlature
(https://dune.bnl.gov/w/images/9/9e/Gen-int-2007-001_%28NOMENCLATURE%29.pdf)
states, that data format is the last but one field and that the number
of fields may vary from project to project.
E.g., for "user" datasets, there may be more fields:
"user.cgottard.D17p3388.00336852.physics_Main.f871_m1879_p3388_output.root.168308563"
 1    2        3        4        5            6                       7    8

However sometimes there seem to be no data type field at all:
"user.ffedotov.HNL_NEOSIG_11_00309440_PromptBkg.168254771"

Now we have decided to get data format form:
* for 'user' and 'group' datasets:
  * 7th field, or
  * None (if there's less then 7 fields);
* for other datasets:
  * 5th field (as it used to be).
Stage 19, used in data4es process, never took config file from command
line and thus for any data4es process all ES settings were the same.
As there are some changes coming, we need to have an actual sample of
output data (to ensure that changes in code have ruined nothing).
Previous version could return information about one attribute at a time;
if we need more attributes, it will send number of identic requests to
Rucio, and it is not good.
It sounds better to keep all metadata adjustments (like replacing
`'null'` with `None` or `bytes` handling) in one place, so it is wrapped
into a separate function.
Mapping for these indices is essentially the same (as we store there
same data from same sources); thus there is no need to have two
identical mapping templates, when we can use same pattern for both
indices.

Having two files with mapping is not good: changing mapping we need to
change it twice.

ES prior to 6.0 does not allow to have "OR" in the template: we can not
say "prodsys_rucio_ami OR analysis" in any way. Thus we rename indices
accordingly to: "tasks_production" and "tasks_analysis", and set index
template to "tasks_*" for the mapping.
We need two almost identical queries for production and analysis tasks;
they differ only it the condition on `pr_id` field.
To avoid duplicate files in the repository (which may lead to mistakes
like "changed one file and forget about another"), the query must be
parametrized in some way. As we already have config file for this stage,
seems reasonable to put query parameters into it.

Parametric pieces of queries must be specified with
`%(<parameter_name>)s`, and values specified in `queries.params` section
of the config file: `<parameter_name>: <parameter_value>`.`

When parameters are used, one must make sure that no precent signs in the
query will produce error and escape them, if needed ('%%' instead of '%').
`data4es` process for analysis tasks:
* changed default ES index name;
* Stage 009 now can work with parametric SQL queries;
* Stage 093 patched to work with analysis dataset names
  (slightly different nomenclature).
Query parameters: project name, AMI tag, output dataset format.
Result example is given in <query-name>-r.json file.

To run the query use command:

  curl -X GET -H 'Content-Type: application/json' -d @deriv-ratio-q.json \
       'http://ES_HOST:ES_PORT'
mgolosova and others added 11 commits February 15, 2019 08:14
Jobs were missed because they were started and finished before the
moment specified as the task 'start_time' and falled into previous day's
index. For this not to happen, the previous index should always be added
to the indices list of the query.
If interval between start and end time is too long, there will be too
many indices in the request, which leads to the TooLongFrame exception:

  TransportError(400, u'too_long_frame_exception', u'An HTTP line is
  larger than 4096 bytes.')

For example: task 6793089.

To reduce number of indices, instead of YYYY-MM-DD* suffix we can use
YYYY-MM* (or even YYYY*, if necessary; but it was not implemented yet).
Using the fixed term both in error messages and config example should
make it easier for user to find a problem with query parameters when it
occurs.
Look for '%(' construction instead of non-double '%'. This construction is
mandatory in parametric queries, and less likely to occur in non-parametric
ones.
Clarify that the warning should only be taken into account when a
parametric query is used.
Stage 009: Improve query saving procedure.
Conflicts:
  Utils/Dataflow/091_datasetsRucio/datasets_processing.py
  Utils/Dataflow/pyDKB/common/custom_readline.py
@mgolosova mgolosova force-pushed the pyDKB-master-merge branch from 132b720 to fb7e780 Compare March 28, 2019 11:31
NOTE: the only thing done here is `make clean; make all`. To make
documentation fully consistent with the code, we should go through the
docs thoroughly; and it is to be done during the pyDKB documentation
revision.
@mgolosova mgolosova force-pushed the pyDKB-master-merge branch from ae41ed3 to 9e9fc54 Compare March 28, 2019 11:48
@mgolosova
Copy link
Copy Markdown
Collaborator Author

As there were quite a lot of changes, both in master and pyDKB, the branch was reset to the new version of pyDKB before merging master to it.
Stage 091 conflict resolution was performed by the same scenario as in the last time, so I checked this item in the PR description.
But there was another conflict -- in pyDKB.common.custom_readline (docstrings).

@mgolosova mgolosova changed the title [pending] pyDKB: merge 'master' branch pyDKB: merge 'master' branch Mar 28, 2019
mgolosova added 3 commits June 6, 2019 12:18
The bug was introduced in the following commit:

ec0725f "DF: remove error handling from stages (016_task2es)."
The bug appeared due to the merge of two too long-living branches:
* the submodule was moved in the `pyDKB` branch (5e3003e), and
* used for the second time in `master` (a0226ff).

It tells us, by the way, that the better idea would be to import this
module in the beginning, as long as it is used more than one time...
@mgolosova mgolosova force-pushed the pyDKB-master-merge branch from 6685fbd to f415e34 Compare June 6, 2019 12:03
@mgolosova
Copy link
Copy Markdown
Collaborator Author

  • check conflicts resolution: pyDKB.common.custom_readline;

/Utils/Dataflow/test/pyDKB/test.sh revealed no error so I believe it means that everything is fine.

@mgolosova
Copy link
Copy Markdown
Collaborator Author

  • check --config parameter in pyDKB and 009.

No conflict here, as 009 does not utilize pyDKB.dataflow.stage functionality yet.

@mgolosova mgolosova force-pushed the pyDKB-master-merge branch from f415e34 to abd340a Compare June 6, 2019 12:14
@Evildoor
Copy link
Copy Markdown
Contributor

Evildoor commented Jun 7, 2019

I tested the data4es (with stages 025 and 069 disabled) and some separate stages, they seem to be working. I also skimmed through the changes and failed to find anything of interest.

@mgolosova
Copy link
Copy Markdown
Collaborator Author

mgolosova commented Jun 8, 2019 via email

@anastasiakaida anastasiakaida self-requested a review June 10, 2019 10:21
Copy link
Copy Markdown
Contributor

@anastasiakaida anastasiakaida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find anything strange, so here is my approval.

@mgolosova
Copy link
Copy Markdown
Collaborator Author

@anastasiakaida, @Evildoor, thank you both for checking this PR.

Will merge it now.

@mgolosova mgolosova merged commit 82c9eac into pyDKB Jun 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants