Conversation
Sometimes we need to run two or more `data4es` processes with different configurations (say, regular updates every hour, reverse update for data repair, weekly update to get updated data from original sources, where timestamp was not updated properly). To make it possible, we need to at least be able to set "where to read configuration from" without changing code. Now it is done via shell environment (DATA4ES_CONFIG_PATH variable). `data4es-start` looks for config files by name in given directory (or directories). Multiply directories to be separated with colon (`:`).
Sometimes we need to run two or more data4es processes with different configurations (say, regular updates every hour, reverse update for data repair, weekly update to get updated data from original sources, where timestamp was not updated properly). To make it possible, we need to at least be able to set "where to read configuration from" without changing code.
DF/09: bug fix (fail to output log message).
To have possibility to run two or more different processes for `data4es` dataflow process, we need to separate their home directories (where the process stores temp files, including pidfile).
data4es: individual home directory for a process.
Stage 093 used to get data format from the fifth field of the dataset name. ATLAS Dataset nomentlature (https://dune.bnl.gov/w/images/9/9e/Gen-int-2007-001_%28NOMENCLATURE%29.pdf) states, that data format is the last but one field and that the number of fields may vary from project to project. E.g., for "user" datasets, there may be more fields: "user.cgottard.D17p3388.00336852.physics_Main.f871_m1879_p3388_output.root.168308563" 1 2 3 4 5 6 7 8 However sometimes there seem to be no data type field at all: "user.ffedotov.HNL_NEOSIG_11_00309440_PromptBkg.168254771" Now we have decided to get data format form: * for 'user' and 'group' datasets: * 7th field, or * None (if there's less then 7 fields); * for other datasets: * 5th field (as it used to be).
Stage 19, used in data4es process, never took config file from command line and thus for any data4es process all ES settings were the same.
As there are some changes coming, we need to have an actual sample of output data (to ensure that changes in code have ruined nothing).
Previous version could return information about one attribute at a time; if we need more attributes, it will send number of identic requests to Rucio, and it is not good.
It sounds better to keep all metadata adjustments (like replacing `'null'` with `None` or `bytes` handling) in one place, so it is wrapped into a separate function.
Mapping for these indices is essentially the same (as we store there same data from same sources); thus there is no need to have two identical mapping templates, when we can use same pattern for both indices. Having two files with mapping is not good: changing mapping we need to change it twice. ES prior to 6.0 does not allow to have "OR" in the template: we can not say "prodsys_rucio_ami OR analysis" in any way. Thus we rename indices accordingly to: "tasks_production" and "tasks_analysis", and set index template to "tasks_*" for the mapping.
We need two almost identical queries for production and analysis tasks;
they differ only it the condition on `pr_id` field.
To avoid duplicate files in the repository (which may lead to mistakes
like "changed one file and forget about another"), the query must be
parametrized in some way. As we already have config file for this stage,
seems reasonable to put query parameters into it.
Parametric pieces of queries must be specified with
`%(<parameter_name>)s`, and values specified in `queries.params` section
of the config file: `<parameter_name>: <parameter_value>`.`
When parameters are used, one must make sure that no precent signs in the
query will produce error and escape them, if needed ('%%' instead of '%').
DF/91: refactoring
`data4es` process for analysis tasks: * changed default ES index name; * Stage 009 now can work with parametric SQL queries; * Stage 093 patched to work with analysis dataset names (slightly different nomenclature).
Query parameters: project name, AMI tag, output dataset format.
Result example is given in <query-name>-r.json file.
To run the query use command:
curl -X GET -H 'Content-Type: application/json' -d @deriv-ratio-q.json \
'http://ES_HOST:ES_PORT'
Jobs were missed because they were started and finished before the moment specified as the task 'start_time' and falled into previous day's index. For this not to happen, the previous index should always be added to the indices list of the query.
If interval between start and end time is too long, there will be too many indices in the request, which leads to the TooLongFrame exception: TransportError(400, u'too_long_frame_exception', u'An HTTP line is larger than 4096 bytes.') For example: task 6793089. To reduce number of indices, instead of YYYY-MM-DD* suffix we can use YYYY-MM* (or even YYYY*, if necessary; but it was not implemented yet).
Using the fixed term both in error messages and config example should make it easier for user to find a problem with query parameters when it occurs.
Look for '%(' construction instead of non-double '%'. This construction is
mandatory in parametric queries, and less likely to occur in non-parametric
ones.
Clarify that the warning should only be taken into account when a parametric query is used.
Stage 009: Improve query saving procedure.
Stage 025 update (total HS06)
Conflicts: Utils/Dataflow/091_datasetsRucio/datasets_processing.py Utils/Dataflow/pyDKB/common/custom_readline.py
132b720 to
fb7e780
Compare
NOTE: the only thing done here is `make clean; make all`. To make documentation fully consistent with the code, we should go through the docs thoroughly; and it is to be done during the pyDKB documentation revision.
ae41ed3 to
9e9fc54
Compare
|
As there were quite a lot of changes, both in |
The bug was introduced in the following commit: ec0725f "DF: remove error handling from stages (016_task2es)."
The bug appeared due to the merge of two too long-living branches: * the submodule was moved in the `pyDKB` branch (5e3003e), and * used for the second time in `master` (a0226ff). It tells us, by the way, that the better idea would be to import this module in the beginning, as long as it is used more than one time...
6685fbd to
f415e34
Compare
|
No conflict here, as 009 does not utilize |
f415e34 to
abd340a
Compare
|
I tested the data4es (with stages 025 and 069 disabled) and some separate stages, they seem to be working. I also skimmed through the changes and failed to find anything of interest. |
|
I tested the data4es (with stages 025 and 069 disabled) and some separate stages, they seem to be working. I also skimmed through the changes and failed to find anything of interest.
Thank you!
Just a note to clarify why disabling the stages was fine for testing:
* Stage 069 shouldn’t be affected or reveal any issues (as long as 019 produces same output as before).
* Stage 025 *had* been affected in fact (as added in `master` and was not adapted to the new version of `pyDKB` along with others), but it was fixed in last commits.
|
anastasiakaida
left a comment
There was a problem hiding this comment.
Didn't find anything strange, so here is my approval.
|
@anastasiakaida, @Evildoor, thank you both for checking this PR. Will merge it now. |
Background:
This PR is aimed in bringing documentation stuff to the
pyDKBbranch, to make it possible to revise and improve documentation directly in the latest version of the library.Documentation was requested by @Evildoor for pyDKB unit tests development.
Unit tests were requested by @mgolosova to simplify further development of the library (as development without proper testing may sometimes lead to situations when things are broken and nobody knows about it for a long time; early fixes are much easier than late ones).
ToDo:
pyDKB.common.custom_readline;--configparameter in pyDKB and 009.Can not be merged topyDKBunlesspyDKBitself is fixed (see PR #170).After the merge
pyDKBbranch should be merged here again.Everything that could be fixed was fixed.