Skip to content

Commit d67517a

Browse files
committed
Add regex index extractors
These three files / scripts are 3 different implementations (python, regex, regex-filtered) of the same thing: taking a regex set and a bunch of needles, for each needle find the first matching regex, and output its index (0-indexed). This is the core loop of ua-parser, and allows validating that regex-filtered matches a more naive version of the same process. Happily I couldn't find any divergence although that means I did a fair amount of useless work. Also the python version is really slow compared to even the regex one, so probably don't use that... `paste` allows using it to combine index extraction of multiple domains as well as the original needle as TSV documents if that's of use. This could also be expanded to multi-index extraction if that's a need for anyone and should be checked more extensively. Note that only the python version supports stdin input at this point, I couldn't be arsed to do that with the Rust ones, but process substitution ought work fine anyway? The needles are read on the go so they should not need to be an actual file. This may not be in a state fit for performance checking as the output loop of the rust version is the worst (no buffering, no stdout-locking).
1 parent 1a769dd commit d67517a

3 files changed

Lines changed: 110 additions & 0 deletions

File tree

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
use clap::Parser;
2+
use std::io::BufRead;
3+
4+
#[derive(Parser)]
5+
struct Args {
6+
regexes: String,
7+
useragents: String,
8+
}
9+
10+
fn main() {
11+
let Args {
12+
regexes,
13+
useragents,
14+
} = Args::parse();
15+
let regexes: Vec<_> = std::io::BufReader::new(std::fs::File::open(regexes).unwrap())
16+
.lines()
17+
.map(|l| regex::Regex::new(&l.unwrap()).unwrap())
18+
.collect();
19+
20+
let mut uas = std::io::BufReader::new(std::fs::File::open(useragents).unwrap());
21+
let mut line = String::with_capacity(150);
22+
while let Ok(n) = uas.read_line(&mut line) {
23+
if n == 0 {
24+
break;
25+
}
26+
let line_ = line.strip_suffix("\n").unwrap_or(&line);
27+
let m = regexes
28+
.iter()
29+
.enumerate()
30+
.find(|(_, regex)| regex.is_match(line_));
31+
if let Some((i, _)) = m {
32+
println!("{i}");
33+
} else {
34+
println!("-");
35+
}
36+
line.clear();
37+
}
38+
}
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
use clap::Parser;
2+
use std::io::BufRead;
3+
4+
#[derive(Parser)]
5+
struct Args {
6+
regexes: String,
7+
useragents: String,
8+
}
9+
10+
fn main() {
11+
let Args {
12+
regexes,
13+
useragents,
14+
} = Args::parse();
15+
let regexes = regex_filtered::Builder::new()
16+
.push_all(
17+
std::io::BufReader::new(std::fs::File::open(regexes).unwrap())
18+
.lines()
19+
.map(Result::unwrap),
20+
)
21+
.unwrap()
22+
.build()
23+
.unwrap();
24+
25+
let mut uas = std::io::BufReader::new(std::fs::File::open(useragents).unwrap());
26+
let mut line = String::with_capacity(150);
27+
while let Ok(n) = uas.read_line(&mut line) {
28+
if n == 0 {
29+
break;
30+
}
31+
let line_ = line.strip_suffix("\n").unwrap_or(&line);
32+
let m = regexes.matching(line_).next();
33+
if let Some((i, _)) = m {
34+
println!("{i}");
35+
} else {
36+
println!("-");
37+
}
38+
line.clear();
39+
}
40+
}

scripts/matchindex

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#!/usr/bin/env python
2+
3+
import argparse
4+
import re
5+
6+
parser = argparse.ArgumentParser()
7+
parser.add_argument(
8+
'regexes',
9+
help="regexes to try on the user agents",
10+
)
11+
parser.add_argument(
12+
'useragents',
13+
type=argparse.FileType(),
14+
help="user agents to parse, `-` for stdin",
15+
)
16+
args = parser.parse_args()
17+
18+
with open(args.regexes) as r:
19+
regexes = [
20+
re.compile(pattern.rstrip('\n'))
21+
for pattern in r
22+
]
23+
24+
with args.useragents as r:
25+
for u in r:
26+
u = u.rstrip('\n')
27+
for i, p in enumerate(regexes):
28+
if p.search(u):
29+
print(i)
30+
break
31+
else:
32+
print('-')

0 commit comments

Comments
 (0)