Skip to content

Commit 7f77d29

Browse files
committed
Security disclosure.
1 parent a1f9c11 commit 7f77d29

File tree

3 files changed

+38
-0
lines changed

3 files changed

+38
-0
lines changed

doc/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,5 +32,6 @@ Contents
3232
Julia Package <julia>
3333
C Package <c>
3434
C++ Interface <c++>
35+
security
3536
contrib/index
3637
changes/index

doc/security.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
###################
2+
Security disclosure
3+
###################
4+
5+
********************
6+
Use of Python pickle
7+
********************
8+
9+
We use ``pickle`` and ``cloudpickle`` in several places, including a convenient helper function for the ``broadcast`` collective operation to share a Python object. The method is not used internally during training but is here to assist with implementing custom metrics. Also, a distributed interface like PySpark might use pickle to load Python objects, like the callback functions. The security reports state the pickle is unsafe, which is true.
10+
11+
However, from our perspective, if someone else can control your network environment tamper with the data sent between XGBoost workers or the Spark executors, XGBoost should not be the place to provide security mitigation. As for all Python pickles in general, read the warning in the `pickle document <https://docs.python.org/3/library/pickle.html>`__.
12+
13+
Suggestion:
14+
15+
* Don’t train distributed XGBoost models in a public network, or at least don’t do it without a VPN.
16+
17+
***********************************************************
18+
The lack of authentication in the collective implementation
19+
***********************************************************
20+
21+
XGBoost uses TCP sockets for communication between workers during distributed model training. XGBoost is a numeric computation library; the collective module provides high-performance numeric operations (allreduce, allgather, etc.). We will NOT add TLS authentication or encryption into the collective implementation.
22+
23+
Suggestion:
24+
25+
* Don’t train distributed XGBoost models in a public network, or at least don’t do it without a VPN.
26+
27+
***************************************************
28+
The lack of sanitizing for inputs, including models
29+
***************************************************
30+
31+
If someone can manipulate XGBoost inputs, whether with an incorrect model or an altered numpy array, XGBoost will crash due to a memory read error (out-of-bounds access). The reports we received describe manipulating the JSON files to mislead XGBoost into reading overbound values or using conflicting tree indices. We acknowledge that we can add stronger sanitization to the JSON parser when loading from a file. But it’s impractical to check potential issue in the model file at the moment. We rely on mitigation provided by Modern OSs.
32+
33+
Suggestions:
34+
35+
* For most users, this should not cause a significant issue. Your Python program might crash when loading a manipulated JSON model file.
36+
* For hosting XGBoost in a public network environment, follow the usual security practice for loading unknown data.

src/gbm/gbtree_model.cc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ void GBTreeModel::LoadModel(Json const& in) {
9797

9898
common::ParallelFor(param.num_trees, ctx_->Threads(), [&](auto t) {
9999
auto tree_id = get<Integer const>(trees_json[t]["id"]);
100+
CHECK_EQ(tree_id, t);
100101
trees.at(tree_id).reset(new RegTree{});
101102
trees[tree_id]->LoadModel(trees_json[t]);
102103
});

0 commit comments

Comments
 (0)