-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathMultisetEncoder.py
More file actions
66 lines (49 loc) · 3.21 KB
/
MultisetEncoder.py
File metadata and controls
66 lines (49 loc) · 3.21 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from typing import Any
class MultisetEncoder:
"""
This is a base class for objects which encode length-n sets of k-vectors.
Encoders are not necessarily embedders since encoders do not need to be injective.
All embedders are encoders, however.
Strictly speaking wecencode "multisets" not "sets" since the containers can hold repeated objects
and retain knowledge of the number of repeats. However, we are sometimes guity of
abbreviating "multiset" to just "set".
The set to be encoded should be inputs as a 2D numpy array with shape (n,k).
The order of the vectors within the numpy array can be arbitrary.
E.g. to encode a multiset containing the 2-vectors (2,2), (4,5) and (1,2) one could call
encode(np.asarray([[2,2], [4,5], [1,2]]))
or
encode(np.asarray([[4,5], [2,2], [1,2]]))
and both should have the same output -- at least up to numerical precision. This leeway (permission to
have small deviations on account of floating point precision, rather than demanding bit-for-bit identical
embeddings) is granted to implementations in order to allow them to be faster (sometimes) than would be the
case if they were all required to canonicalise their input sets. Someone wanting bit-for-bit identical
output under permutations of input vectors could easily sort their vectors (in any way) prior to using any
encoder.
All encoders return a type containing:
(1) a one-dimensional arrays of real floats,
(2) the size (n,k) of the encoded data,
(3) some meta-data about the encoding method, or None.
In principle, a given encoder can embed sets of different sizes n and or k. However,
some encoders might wish to restrict themselves to certain fixed n or k at initialisation (e.g. if
an embedder were to need a significant amount of n and k dependent set-up cost that it wished to do
once only). Thus all encoders are expected to be able to tell callers what sizes of input they can and
cannot embed, and how long the possible embeddings will be. Derived classes do this by implementing the method
size_from_n_k(n: int, k, int) -> int:
for which:
* a return value of >=0 is the number of reals in an encoding if sets of size (n,k) are encodable, and
* a return value of -1 indicates that embedding for that n and k is impossible.
"""
def encode(self, data: np.ndarray, debug=False) -> (np.ndarray, (int, int), Any):
raise NotImplementedError()
def size_from_array(self, data: np.ndarray) -> int:
"""
This function returns the number of reals that the encoding would contain if the set represented by "data" were to be encoded. -1 is returned if data of the supplied type is not encodable by this encder.
"""
n,k = data.shape
return self.size_from_n_k(n,k)
def size_from_n_k(self, n: int, k: int) -> int:
"""
This function returns the number of reals that the encoding would contain if a set containing n "k-vectors" were to be encoded. Derived classes implmenting this method should return -1 if they are not able to encode sets for that n and k.
"""
raise NotImplementedError()