Distributed API by mgmacias95 · Pull Request #885 · wazuh/wazuh

mgmacias95 · 2018-07-02T19:41:55Z

Hello team,

This PR fixes #670 and adds framework support for wazuh/wazuh-api#126.

Best regards,
Marta

The request to get agents in the cluster has been replaced with the distributed api request so it's no longer necessary

…inct API requests

mgmacias95 · 2018-09-17T11:27:36Z

Testing with 1K agents and 3 nodes. Found errors:

Error when filtering by node name in cluster control:

# /var/ossec/bin/cluster_control -a -fn wazuh-cluster-client1 -d
ERROR: 'list' object has no attribute 'split'
Traceback (most recent call last):
File "/var/ossec/bin/cluster_control", line 356, in <module>
    print_agents(args.filter_status, args.filter_node, is_master)
File "/var/ossec/bin/cluster_control", line 250, in print_agents
    agents = get_agents(filter_status, filter_node, is_master)
File "/var/ossec/bin/../framework/wazuh/cluster/control.py", line 102, in get_agents
    select={'fields':['id','ip','name','status','node_name']})
File "/var/ossec/bin/../framework/wazuh/agent.py", line 802, in get_agents_overview
    data = db_query.run()
File "/var/ossec/bin/../framework/wazuh/utils.py", line 810, in run
    self._add_filters_to_query()
File "/var/ossec/bin/../framework/wazuh/utils.py", line 756, in _add_filters_to_query
    self._parse_filters()
File "/var/ossec/bin/../framework/wazuh/utils.py", line 727, in _parse_filters
    self._parse_legacy_filters()
File "/var/ossec/bin/../framework/wazuh/agent.py", line 137, in _parse_legacy_filters
    WazuhDBQuery._parse_legacy_filters(self)
File "/var/ossec/bin/../framework/wazuh/utils.py", line 719, in _parse_legacy_filters
    for name, value in self.legacy_filters.items() for subvalue in value.split(',') if not self._pass_filter(subvalue)]
AttributeError: 'list' object has no attribute 'split'

Error when registering an agent with authd enabled:

# curl -u foo:bar "localhost:55000/agents/test_agent1?pretty" -XPUT
{
"error": 1014,
"message": "Error communicating with socket: /var/ossec/queue/ossec/auth"
}

The syscollector API call times out but every API call done after this one will time out as well:

# curl -u foo:bar "localhost:55000/experimental/syscollector/packages?pretty"
{
"error": 3009,
"message": "Error executing request to internal socket: Timeout expired while waiting for a response"
}

After doing a request to restart all agents, only one of the workers receives the request:

2018-09-17 11:25:36,744 DEBUG   : [Worker] [Request-R  ]: 'dapi'.
2018-09-17 11:25:36,744 DEBUG   : [DistributedAPI] Receiving request: None 3829636576 {"function": "PUT/agents/restart", "ossec_path": "/var/ossec", "from_cluster": true, "arguments": {"restart_all": "True"}}

API calls wasnt correctly forwarded to nodes that have agents with ids higher than 500.

…call

mgmacias95 · 2018-09-18T13:28:25Z

Yesterday's errors summary:

The error using cluster_control has been fixed in commits da5b655 f35904e 2a93cc6 d808f6e.
The error using authd has been fixed in Fix the communication between manage_agents and ossec-authd #1272.
The error where a restart request was forwarded to only one manager has been fixed in commits 649b44b 50e39ab.

The error that blocked the API after doing an experimental syscollector request has been found. It gets blocked in line 261:

wazuh/framework/wazuh/cluster/dapi/dapi.py

Lines 258 to 265 in d808f6e

    
           def run(self): 
        
               while not self.stopper.is_set() and self.running: 
        
                   name, id, request = self.request_queue.get(block=True).split(' ', 2) 
        
                   result = distribute_function(json.loads(request), from_master=True) 
        
                   try: 
        
                       self.send_string(result, id, name) 
        
                   except Exception as e: 
        
                       self.send_request(command='err-is', data=str(e), id=id, name=name)

More precisely, when it executes the function locally. Since it has to retrieve information from 1000 different agents, the function takes a long time to complete and it makes the API requests queue to remain blocked. This error has not been fixed yet.

jesuslinares · 2018-09-19T08:29:32Z

+        old_limit = common.database_limit if 'limit' not in input_json['arguments'] else input_json['arguments']['limit']
+        if old_offset > 0:
+            input_json['arguments']['offset'] = 0
+            input_json['arguments']['limit'] = None


Why do you need to get all the information? You can forward the current limit and offset, then in the merge function, you will need to do it again, right?

jesuslinares · 2018-09-19T08:29:59Z

+    else:
+        if node_name == 'unknown' or node_name == '':
+            raise WazuhException(3017)
+        command = 'dapi_forward {}'.format(node_name) if node_name != master_name else 'dapi'


Duplicated code?. Is it in 'forward_list'?

jesuslinares · 2018-09-19T08:30:19Z

+
+def get_solver_node(input_json, master_name):
+    """
+    Gets the node that can solve a request.


Improve description.

jesuslinares · 2018-09-19T08:36:11Z

+import wazuh.ciscat as ciscat
+import wazuh.active_response as active_response
+
+


Comment: Describe every type.

jesuslinares · 2018-09-19T08:39:20Z

-            response = self.get_worker_info(c_name)['handler'].execute(command, data)
-            yield c_name, response
-
+class AbstractWorker(Handler):


It should be "AbstractClient". Then, use Worker for Worker nodes, and Client for InternalSocket.

jesuslinares · 2018-09-19T08:39:50Z

        return self.my_connected


+class WorkerHandler(AbstractWorker):


Change AbstractWorker

jesuslinares · 2018-09-19T08:40:14Z

+            time.sleep(2)
+
+
+class InternalSocketWorker(communication.AbstractWorker):


It should be AbstractClient

jesuslinares · 2018-09-19T09:03:16Z

+    def run(self):
+        while not self.stopper.is_set() and self.running:
+            name, id, request = self.request_queue.get(block=True).split(' ', 2)
+            result = distribute_function(json.loads(request), from_master=True)


The queue will be blocked until this function ends. What happens if it never ends (long time)?.

Add a timeout would require to change every framework function to add an "exit condition" in every part of the code that could be blocked (io wait, db, etc).

The queue could create a thread for every request:

After a time, continue with the next request (returning a timeout error)

Log a warning

Problem: The previous thread will not be killed. What will happen when it ends?.

Set a limit of 20 "zombie threads" -> Warning and block queue? -> Next requests: "queue is blocked".

The same API call (PUT/agents/restart):

With threads: 1.0136449337 seconds

Without threads: 0.634363174438 seconds

Is this really worth it? I don't think so.

…ndler Worker is a word reserved to refer to a type of cluster node. AbstractWorker and WorkerHandler classes represented a client in the cluster protocol, not a worker node.

Marta Gómez Macías and others added 30 commits May 28, 2018 12:23

Refactor process_request function in client's internal socket

8e2d75d

Refactor internal_socket code

079fd7b

Add dapi request to master's internal socket

2991973

Add dapi request to client's internal socket

16f4867

Add argument support in distributed api

fa9b28a

Add support for type of API calls local_master and local_any

3681162

Refactor GET/agent/:agent_id API request

3de99ce

Fix bug in GET/manager/info API call

d68c778

Fix import error in cluster_control

51f7a53

Add support for master_distributed requests from master

1ffc7a1

Refactor InternalSocket server code to use communication.py Server class

dee7af7

Fix typo in create_socket parameters

ae7c1b1

Refactor internal socket client and improve dapi protocol

57dafe7

Merge branch '3.2' into dev-3.2-distributed-api

a387a5a

Cluster: send_string request (#689)

5ce1f6c

Fix bug in cluster_control -i

3b438f3

Add filter by ID to GET/agents function

0c3d93c

Hide INFO messages in cluster_control in non debug mode

fa0e0a6

Distribute request with multiple agent_ids defined

2953032

Fix bug when non existent ids are included in the API request

868fdf0

Distribute request that don't have agent_id parameter

02f1f24

Parallelize API call distribution using a pool of threads

0e49b5e

Remove unused request

4e7aa55

The request to get agents in the cluster has been replaced with the distributed api request so it's no longer necessary

Add tags in abstract classes to improve log readibility

e446376

Fix bug executing forwarding requests from client nodes

02c6aec

Add a daemon thread in master node to receive and process API requests

500c343

Use send_string function to propagate distributed API requests results

e63efa3

Define distributed API types for each API call

7a89c7f

Merge branch '3.3' into dev-3.2-distributed-api

39ad00d

Fix typo

583bdc0

Prevent cluster to block when waiting for a message

b998caf

mgmacias95 force-pushed the dev-3.2-distributed-api branch from 3e2263a to b998caf Compare September 7, 2018 11:03

Marta Gómez Macías added 8 commits September 7, 2018 16:42

Fix bug in syscollector when no results were returned

e1553df

Fix error using limit and offset in syscollector API calls

b408553

Fix bug when receiving an error from cluster

1abf89e

Fix bug when a cluster master received multiple GET/agents/stats/dist…

b7aca47

…inct API requests

Increase timeout to 20s to support distributed requests

3b1bdef

Merge branch '3.7' into dev-3.2-distributed-api

3ba0481

Add PUT/active-response/:agent_id API call to distributed API

5d1f5f1

Merge branch '3.7' into dev-3.2-distributed-api

9966c9e

jesuslinares assigned mgmacias95 Sep 17, 2018

jesuslinares removed the request for review from Lifka September 17, 2018 08:32

Add multigroups and stats API calls

4761e5e

Marta Gómez Macías added 7 commits September 17, 2018 15:31

Fix bug in agent_control with -fn parameter

da5b655

Fix bug when the same filter was used multiple times in legacy filters

f35904e

Fix error when there are >500 agents in the cluster

649b44b

API calls wasnt correctly forwarded to nodes that have agents with ids higher than 500.

Prevent distributed api from sending request to "unknown" node

fb34901

Fix bug in cluster_control to filter using multiple statuses

2a93cc6

Fix bug when a WazuhException was raised in a distributed_master API …

50e39ab

…call

Fix bug when requesting agents from worker nodes

d808f6e

jesuslinares suggested changes Sep 19, 2018

View reviewed changes

Marta Gómez Macías added 4 commits September 20, 2018 08:59

Refactor forward_request function in dapi

1d4ee41

Improve comments

0b9af22

Rename AbstractWorker -> AbstractClient and WorkerHandler -> ClientHa…

2ce4f3b

…ndler Worker is a word reserved to refer to a type of cluster node. AbstractWorker and WorkerHandler classes represented a client in the cluster protocol, not a worker node.

Fix sorting bug when merging responses in distributed API

c30a6b9

jesuslinares approved these changes Sep 20, 2018

View reviewed changes

jesuslinares merged commit 1d6e675 into 3.7 Sep 20, 2018

jesuslinares deleted the dev-3.2-distributed-api branch September 20, 2018 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed API#885

Distributed API#885
jesuslinares merged 127 commits into3.7from
dev-3.2-distributed-api

mgmacias95 commented Jul 2, 2018

Uh oh!

mgmacias95 commented Sep 17, 2018 •

edited

Loading

Uh oh!

mgmacias95 commented Sep 18, 2018

Uh oh!

Uh oh!

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

jesuslinares Sep 19, 2018

Uh oh!

mgmacias95 Sep 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		import wazuh.ciscat as ciscat
		import wazuh.active_response as active_response

		return self.my_connected


		class WorkerHandler(AbstractWorker):

		time.sleep(2)


		class InternalSocketWorker(communication.AbstractWorker):

Conversation

mgmacias95 commented Jul 2, 2018

Uh oh!

mgmacias95 commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgmacias95 commented Sep 18, 2018

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgmacias95 commented Sep 17, 2018 •

edited

Loading