Skip to content

Distributed API#885

Merged
jesuslinares merged 127 commits into3.7from
dev-3.2-distributed-api
Sep 20, 2018
Merged

Distributed API#885
jesuslinares merged 127 commits into3.7from
dev-3.2-distributed-api

Conversation

@mgmacias95
Copy link
Copy Markdown
Contributor

Hello team,

This PR fixes #670 and adds framework support for wazuh/wazuh-api#126.

Best regards,
Marta

Marta Gómez Macías and others added 30 commits May 28, 2018 12:23
The request to get agents in the cluster has been replaced with the distributed api request so it's no longer necessary
@mgmacias95 mgmacias95 force-pushed the dev-3.2-distributed-api branch from 3e2263a to b998caf Compare September 7, 2018 11:03
@jesuslinares jesuslinares removed the request for review from Lifka September 17, 2018 08:32
@mgmacias95
Copy link
Copy Markdown
Contributor Author

mgmacias95 commented Sep 17, 2018

Testing with 1K agents and 3 nodes. Found errors:

  • Error when filtering by node name in cluster control:

    # /var/ossec/bin/cluster_control -a -fn wazuh-cluster-client1 -d
    ERROR: 'list' object has no attribute 'split'
    Traceback (most recent call last):
    File "/var/ossec/bin/cluster_control", line 356, in <module>
        print_agents(args.filter_status, args.filter_node, is_master)
    File "/var/ossec/bin/cluster_control", line 250, in print_agents
        agents = get_agents(filter_status, filter_node, is_master)
    File "/var/ossec/bin/../framework/wazuh/cluster/control.py", line 102, in get_agents
        select={'fields':['id','ip','name','status','node_name']})
    File "/var/ossec/bin/../framework/wazuh/agent.py", line 802, in get_agents_overview
        data = db_query.run()
    File "/var/ossec/bin/../framework/wazuh/utils.py", line 810, in run
        self._add_filters_to_query()
    File "/var/ossec/bin/../framework/wazuh/utils.py", line 756, in _add_filters_to_query
        self._parse_filters()
    File "/var/ossec/bin/../framework/wazuh/utils.py", line 727, in _parse_filters
        self._parse_legacy_filters()
    File "/var/ossec/bin/../framework/wazuh/agent.py", line 137, in _parse_legacy_filters
        WazuhDBQuery._parse_legacy_filters(self)
    File "/var/ossec/bin/../framework/wazuh/utils.py", line 719, in _parse_legacy_filters
        for name, value in self.legacy_filters.items() for subvalue in value.split(',') if not self._pass_filter(subvalue)]
    AttributeError: 'list' object has no attribute 'split'
  • Error when registering an agent with authd enabled:

    # curl -u foo:bar "localhost:55000/agents/test_agent1?pretty" -XPUT
    {
    "error": 1014,
    "message": "Error communicating with socket: /var/ossec/queue/ossec/auth"
    }
  • The syscollector API call times out but every API call done after this one will time out as well:

    # curl -u foo:bar "localhost:55000/experimental/syscollector/packages?pretty"
    {
    "error": 3009,
    "message": "Error executing request to internal socket: Timeout expired while waiting for a response"
    }
  • After doing a request to restart all agents, only one of the workers receives the request:

    2018-09-17 11:25:36,744 DEBUG   : [Worker] [Request-R  ]: 'dapi'.
    2018-09-17 11:25:36,744 DEBUG   : [DistributedAPI] Receiving request: None 3829636576 {"function": "PUT/agents/restart", "ossec_path": "/var/ossec", "from_cluster": true, "arguments": {"restart_all": "True"}}
    

@mgmacias95
Copy link
Copy Markdown
Contributor Author

Yesterday's errors summary:

  • The error using cluster_control has been fixed in commits da5b655 f35904e 2a93cc6 d808f6e.
  • The error using authd has been fixed in Fix the communication between manage_agents and ossec-authd #1272.
  • The error where a restart request was forwarded to only one manager has been fixed in commits 649b44b 50e39ab.
  • The error that blocked the API after doing an experimental syscollector request has been found. It gets blocked in line 261:
    def run(self):
    while not self.stopper.is_set() and self.running:
    name, id, request = self.request_queue.get(block=True).split(' ', 2)
    result = distribute_function(json.loads(request), from_master=True)
    try:
    self.send_string(result, id, name)
    except Exception as e:
    self.send_request(command='err-is', data=str(e), id=id, name=name)

    More precisely, when it executes the function locally. Since it has to retrieve information from 1000 different agents, the function takes a long time to complete and it makes the API requests queue to remain blocked. This error has not been fixed yet.

Comment thread framework/wazuh/cluster/dapi/dapi.py
Comment thread framework/wazuh/cluster/dapi/dapi.py Outdated
Comment thread framework/wazuh/cluster/dapi/dapi.py Outdated
old_limit = common.database_limit if 'limit' not in input_json['arguments'] else input_json['arguments']['limit']
if old_offset > 0:
input_json['arguments']['offset'] = 0
input_json['arguments']['limit'] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to get all the information? You can forward the current limit and offset, then in the merge function, you will need to do it again, right?

Comment thread framework/wazuh/cluster/dapi/dapi.py Outdated
else:
if node_name == 'unknown' or node_name == '':
raise WazuhException(3017)
command = 'dapi_forward {}'.format(node_name) if node_name != master_name else 'dapi'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code?. Is it in 'forward_list'?

Comment thread framework/wazuh/cluster/dapi/dapi.py Outdated

def get_solver_node(input_json, master_name):
"""
Gets the node that can solve a request.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve description.

import wazuh.ciscat as ciscat
import wazuh.active_response as active_response


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: Describe every type.

response = self.get_worker_info(c_name)['handler'].execute(command, data)
yield c_name, response

class AbstractWorker(Handler):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "AbstractClient". Then, use Worker for Worker nodes, and Client for InternalSocket.

return self.my_connected


class WorkerHandler(AbstractWorker):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change AbstractWorker

time.sleep(2)


class InternalSocketWorker(communication.AbstractWorker):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be AbstractClient

Comment thread framework/wazuh/cluster/dapi/dapi.py Outdated
def run(self):
while not self.stopper.is_set() and self.running:
name, id, request = self.request_queue.get(block=True).split(' ', 2)
result = distribute_function(json.loads(request), from_master=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue will be blocked until this function ends. What happens if it never ends (long time)?.

  • Add a timeout would require to change every framework function to add an "exit condition" in every part of the code that could be blocked (io wait, db, etc).
  • The queue could create a thread for every request:
    • After a time, continue with the next request (returning a timeout error)
    • Log a warning
    • Problem: The previous thread will not be killed. What will happen when it ends?.
    • Set a limit of 20 "zombie threads" -> Warning and block queue? -> Next requests: "queue is blocked".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same API call (PUT/agents/restart):

  • With threads: 1.0136449337 seconds
  • Without threads: 0.634363174438 seconds

Is this really worth it? I don't think so.

Marta Gómez Macías added 4 commits September 20, 2018 08:59
…ndler

Worker is a word reserved to refer to a type of cluster node. AbstractWorker and WorkerHandler classes represented a client in the cluster protocol, not a worker node.
@jesuslinares jesuslinares merged commit 1d6e675 into 3.7 Sep 20, 2018
@jesuslinares jesuslinares deleted the dev-3.2-distributed-api branch September 20, 2018 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants