Conversation
|
@SKalt wanna take this for a spin? |
|
Wow, thank you for this nice PR! 🚀
One more thing that could be debated is if |
|
to make it faster, something like an inverted index: https://swtch.com/~rsc/regexp/regexp4.html. then, to search for a string, you can just go through each word inside the search string, in turn, and look at its indices and check they're contiguous. but I don't know if we'll see text fragments often enough for preprocessing to make sense. if they're rare, matching on the fly as you go through the html cursor could be fine. I also agree that the two include fragment options should be folded together. |
|
@thomas-zahner and @katrinafyi Thank you for your feedback!
Do I understand correctly that you want to also see an implementation based on
This sounds like a fascinating idea, but I agree that text anchor might be too rare for this to make sense.
Can I challenge this a bit? In my experience, text fragments are less stable than anchor fragments. I could imagine users wanting to perform anchor fragment checking (as they are more stable and can be seen as an "API" of the webpage), but not text fragment checking (as they might be less stable and are definitely not an "API" of the webpage). Not sure this is a good enough reason to separate the two checks, but I felt a need to point it out. 😄 I'm fine with whatever is decided. |
True, that's a really good point, thanks for bringing it up. To be honest, I've never used text fragments myself. How about we combine it with --include-fragments but make it take an argument string that's none, anchor, text, or all? I think that's possible with clap and I think joining them together makes sense and makes the features more discoverable. You should also be able to write --include-fragments without a value and it would default to.. "anchor", probably. |
Text fragments are also supposed to get/become an API at some point. But it's true that it's a newer feature and maybe still not fully mature.
Yes, that would be a middle ground. Something like the following? enum FragmentMode {
/// Don't do any fragment checking
#[default]
None,
/// Check text anchors but not text fragments
Simple,
/// Check both text anchors and text fragments
Full,
}I'm not really sure/happy with the above variant names, but |
| return status; | ||
| } | ||
|
|
||
| let document = normalize_whitespace(&extract_visible_text(content)); |
There was a problem hiding this comment.
it seems unnecessarily slow to split then join the entire document. can we just avoid emitting redundant whitespace in the callback?
alternatively, the document can emit into a Vec of whitespace-separated words and you just have to split the fragment in the same way. this would avoid all the normalising atm
| }; | ||
|
|
||
| let [start] = parts.as_slice() else { | ||
| let [start, end] = parts.as_slice() else { |
There was a problem hiding this comment.
chained let/else can just be a match statement, I believe.
|
|
||
| fn extract_visible_text(input: &str) -> String { | ||
| #[derive(Default)] | ||
| struct TextExtractor { |
There was a problem hiding this comment.
I think the inner structure is too big. also, I think using default for an initial value is a misuse, just use pub fn new() :)
this feature is kinda interesting outside of lychee, so if possible it would be nice to clean up the API a little bit (even if we don't export it)
| let tag = String::from_utf8_lossy(name).into_owned(); | ||
| if TextExtractor::is_hidden_tag(&tag) { | ||
| self.hidden_stack.push(tag); | ||
| } |
There was a problem hiding this comment.
I was looking at the spec and a JS implementation and a "block" tag should terminate (prevent) any matches across its boundary. this is kinda tricky, so i'm fine with the lychee mplementation accepting too much. I just thought it was worth mentioning and maybe deserves a comment in the code.
in general, Id like to see some refrences to the spec in the comments :) is the matching algorithm based on some reference implementation?
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
you seem to support it in your code, so it would be nice to have a test where the start and end could be overlapping:
start=aa, end=aa should match in aaaa but not aaa or aa
also, where the start is part of a word
start=aa, end=b should match in aab, aaab.
actually, does the spec say anything about whether a fragment match can start in the middle of a word?
| }; | ||
| }; | ||
|
|
||
| let Some((element, text_directive)) = fragment.split_once(":~:text=") else { |
There was a problem hiding this comment.
duplicated with code in text_fragment.rs?
also I'd use "anchor" instead of element, i makes me think of html elements
There was a problem hiding this comment.
duplicated with code in text_fragment.rs?
Not really. This code only splits anchor from text fragment. The code in text_fragment.rs actually parses the text fragment into prefix, start, stop and suffix. I thought to keep this separation, as fully parsing the text directive at this point is not needed.
also I'd use "anchor" instead of element, i makes me think of html elements
Fully agree, nice catch!
| /// Whether to check the existence of text fragments in the response HTML files. | ||
| /// | ||
| /// Will be disabled if the request method is `HEAD`. | ||
| include_text_fragments: bool, |
There was a problem hiding this comment.
this is only in website checker ATM. as far as I can tell, it should be in FileChecker too. is there any way of doing this nicely?
honestly, fragment_checker should be be an Option field rather than carrying the adjacent include_fragments flag, and maybe the text fragments can be hooked into FragmentChecker somehow
I love this! I'll produce a PR as soon as I have time.
Sorry for not being clearer. What I meant with the "API" analogy is the following: As a website author, if I want to encourage people to link to specific parts of a page, I'd ask them to use anchors. I'd do my best to keep anchor unchanged, even if the section title has changed. So website authors are very likely to include anchors in their stable interface promise. In contrast, as a website author, I would definitely not want people to expect me to keep text stable, so that their text fragments don't break. |
don't know how hard it is to add html5gum support, but I would personally very much appreciate that. The reason is that we'd like to maintain feature parity between both engines. They each have their advantages and disadvantages. html5gum is faster and allows more direct/raw HTML access, while html5ever is very mature and provides some helpful high-level functionality. |
| fn parse(input: &str) -> Option<Self> { | ||
| let mut parts: Vec<&str> = input.split(',').collect(); | ||
| if parts.is_empty() { | ||
| return None; | ||
| } | ||
|
|
||
| let prefix = if parts.first().is_some_and(|part| part.ends_with('-')) && parts.len() >= 2 { | ||
| let prefix = parts.remove(0); | ||
| Some(prefix[..prefix.len() - 1].to_owned()) | ||
| } else { | ||
| None | ||
| }; | ||
|
|
||
| let suffix = if parts.last().is_some_and(|part| part.starts_with('-')) && parts.len() >= 2 { | ||
| let suffix = parts.pop().expect("checked length above"); | ||
| Some(suffix[1..].to_owned()) | ||
| } else { | ||
| None | ||
| }; | ||
|
|
||
| let [start] = parts.as_slice() else { | ||
| let [start, end] = parts.as_slice() else { | ||
| return None; | ||
| }; | ||
|
|
||
| return Some(Self { | ||
| prefix, | ||
| start: normalize_whitespace(start), | ||
| end: Some(normalize_whitespace(end)), | ||
| suffix: suffix.map(|s| normalize_whitespace(&s)), | ||
| }); | ||
| }; | ||
|
|
||
| Some(Self { | ||
| prefix: prefix.map(|s| normalize_whitespace(&s)), | ||
| start: normalize_whitespace(start), | ||
| end: None, | ||
| suffix: suffix.map(|s| normalize_whitespace(&s)), | ||
| }) | ||
| } |
There was a problem hiding this comment.
I personally found this a bit harder to read than it could be.
Tried my hand on a refactoring and added a few comments. Also turned the separator and markers into constants. I feel like some examples are also helpful.
/// Parses a text directive value (the part after `text=`) into its components.
///
/// The format is: `[prefix-,] start [,end] [,-suffix]`
///
/// # Examples
///
/// ```
/// // Exact match
/// let d = TextDirective::parse("an example").unwrap();
/// assert_eq!(d.start, "an example");
/// assert!(d.prefix.is_none() && d.end.is_none() && d.suffix.is_none());
///
/// // Range match
/// let d = TextDirective::parse("an example,text fragment").unwrap();
/// assert_eq!(d.start, "an example");
/// assert_eq!(d.end.unwrap(), "text fragment");
///
/// // Prefix and suffix context terms
/// let d = TextDirective::parse("this is-,an example,-text fragment").unwrap();
/// assert_eq!(d.prefix.unwrap(), "this is");
/// assert_eq!(d.start, "an example");
/// assert_eq!(d.suffix.unwrap(), "text fragment");
///
/// // Invalid: no start term
/// assert!(TextDirective::parse("").is_none());
/// ```
fn parse(input: &str) -> Option<Self> {
// The parameter separator splits the directive into its components:
// [prefix-,] start [,end] [,-suffix]
const PARAMETER_SEPARATOR: char = ',';
// The context term marker distinguishes prefix/suffix from start/end terms.
// e.g. "prefix-,start,-suffix"
const CONTEXT_TERM_MARKER: char = '-';
// Literal commas in the text are percent-encoded upstream, so splitting on ',' is safe.
let mut parts: Vec<&str> = input.split(PARAMETER_SEPARATOR).collect();
if parts.is_empty() {
return None;
}
// A prefix is indicated by a trailing context term marker on the first token
let prefix = if parts.first().is_some_and(|part| part.ends_with(CONTEXT_TERM_MARKER)) && parts.len() >= 2 {
let prefix = parts.remove(0);
Some(prefix.strip_suffix(CONTEXT_TERM_MARKER).unwrap_or(prefix).to_owned())
} else {
None
};
// A suffix is indicated by a leading context term marker on the last token
let suffix = if parts.last().is_some_and(|part| part.starts_with(CONTEXT_TERM_MARKER)) && parts.len() >= 2 {
let suffix = parts.pop()?;
Some(suffix.strip_prefix(CONTEXT_TERM_MARKER).unwrap_or(suffix).to_owned())
} else {
None
};
// After removing prefix/suffix, exactly 1 or 2 tokens must remain (start [,end])
let (start, end) = match parts.as_slice() {
[start] => (start, None),
[start, end] => (start, Some(normalize_whitespace(end))),
_ => return None,
};
Some(Self {
prefix: prefix.map(|s| normalize_whitespace(&s)),
start: normalize_whitespace(start),
end,
suffix: suffix.map(|s| normalize_whitespace(&s)),
})
}| fn matches(&self, document: &str) -> bool { | ||
| if self.start.is_empty() { | ||
| return false; | ||
| } | ||
|
|
||
| let mut start_offset = 0; | ||
| while let Some(relative_start) = document[start_offset..].find(&self.start) { | ||
| let match_start = start_offset + relative_start; | ||
| let mut match_end = match_start + self.start.len(); | ||
|
|
||
| if let Some(end) = &self.end { | ||
| let Some(relative_end) = document[match_end..].find(end) else { | ||
| return false; | ||
| }; | ||
| match_end += relative_end + end.len(); | ||
| } | ||
|
|
||
| let prefix_matches = self | ||
| .prefix | ||
| .as_ref() | ||
| .is_none_or(|prefix| document[..match_start].trim_end().ends_with(prefix)); | ||
| let suffix_matches = self | ||
| .suffix | ||
| .as_ref() | ||
| .is_none_or(|suffix| document[match_end..].trim_start().starts_with(suffix)); | ||
|
|
||
| if prefix_matches && suffix_matches { | ||
| return true; | ||
| } | ||
|
|
||
| start_offset = match_start + 1; | ||
| } | ||
|
|
||
| false | ||
| } | ||
| } |
There was a problem hiding this comment.
I think there's a bug. When end is not found, it immediately returns false, but it should continue to the next candidate match of start instead, since a later occurrence of start might have a valid end after it.
fn matches(&self, document: &str) -> bool {
let mut start_offset = 0;
// Find each occurrence of `start` and check whether the surrounding
// context (prefix, end, suffix) also matches, per the spec's algorithm.
while let Some(relative_start) = document[start_offset..].find(&self.start) {
let match_start = start_offset + relative_start;
let mut match_end = match_start + self.start.len();
// For range matches, find the first occurrence of `end` after `start`
if let Some(end) = &self.end {
let Some(relative_end) = document[match_end..].find(end.as_str()) else {
// No `end` found after this `start`; try the next occurrence of `start`
start_offset = match_start + 1;
continue;
};
match_end += relative_end + end.len();
}
let prefix_matches = self
.prefix
.as_ref()
.is_none_or(|prefix| document[..match_start].trim_end().ends_with(prefix.as_str()));
let suffix_matches = self
.suffix
.as_ref()
.is_none_or(|suffix| document[match_end..].trim_start().starts_with(suffix.as_str()));
if prefix_matches && suffix_matches {
return true;
}
start_offset = match_start + 1;
}
false
}| fn parse_text_directives(url: &Url) -> Vec<TextDirective> { | ||
| let Some(fragment) = url.fragment() else { | ||
| return Vec::new(); | ||
| }; | ||
| let Some((_, directive)) = fragment.split_once(":~:") else { | ||
| return Vec::new(); | ||
| }; | ||
|
|
||
| url::form_urlencoded::parse(directive.as_bytes()) | ||
| .filter(|(key, _)| key == "text") | ||
| .filter_map(|(_, value)| TextDirective::parse(value.as_ref())) | ||
| .collect() | ||
| } |
There was a problem hiding this comment.
Can we make :~: a constant? Maybe also the text directive type?
fn parse_text_directives(url: &Url) -> Vec<TextDirective> {
// The fragment directive delimiter separates the element fragment from the directive.
// e.g. "https://example.com#section:~:text=foo"
const FRAGMENT_DIRECTIVE_DELIMITER: &str = ":~:";
// The only directive type defined by the spec
const TEXT_DIRECTIVE_PREFIX: &str = "text";
let Some(fragment) = url.fragment() else {
return Vec::new();
};
let Some((_, directive)) = fragment.split_once(FRAGMENT_DIRECTIVE_DELIMITER) else {
return Vec::new();
};
url::form_urlencoded::parse(directive.as_bytes())
.filter(|(key, _)| key == TEXT_DIRECTIVE_PREFIX)
.filter_map(|(_, value)| TextDirective::parse(value.as_ref()))
.collect()
}Of course, we could move all constants to the beginning of the file to group them together, but I find reading code with meaningful constants easier in general.
| } | ||
|
|
||
| fn normalize_whitespace(input: &str) -> String { | ||
| input.split_whitespace().collect::<Vec<_>>().join(" ") |
There was a problem hiding this comment.
Soon 😉 (This would avoid extra allocations if only it was stable already.)
| return ParsedFragment { | ||
| anchor_fragment: None, | ||
| text_directive: None, | ||
| }; |
There was a problem hiding this comment.
Could be slightly cleaner if ParsedFragment implemented Default:
ParsedFragment::default()| }; | ||
| }; | ||
|
|
||
| let Some((element, text_directive)) = fragment.split_once(":~:text=") else { |
There was a problem hiding this comment.
I like constants. 😉
pub(crate) const TEXT_DIRECTIVE_SEPARATOR: &str = ":~:text=";| async fn check_default(&self, request: Request) -> Status { | ||
| let method = request.method().clone(); | ||
| let request_url = request.url().clone(); | ||
| let parsed_fragment = Self::parse_fragment(&request_url); |
There was a problem hiding this comment.
This should technically be parsed_fragments. I was confused for a moment while reading this. Thought it returned a single value and then you did parsed_fragment.anchor_fragment. The plural would make it clearer what's going on.
| && parsed_fragment.anchor_fragment.is_some(); | ||
| let check_text_fragments = self.include_text_fragments | ||
| && method == Method::GET | ||
| && parsed_fragment.text_directive.is_some(); |
There was a problem hiding this comment.
The wording is a little off. Here you use text_directive and two lines above it's include_text_fragments. Can we align that naming? E.g. maybe text_fragment?
But I could also see how the naming is just off because of the spec and lychee's cli params. Just wanted to raise that concern.
| let check_text_fragments = self.include_text_fragments | ||
| && method == Method::GET | ||
| && parsed_fragment.text_directive.is_some(); | ||
| let needs_body = check_anchor_fragments || check_text_fragments; |
There was a problem hiding this comment.
How about?
| let needs_body = check_anchor_fragments || check_text_fragments; | |
| let fetch_body = check_anchor_fragments || check_text_fragments; |
Fixes #1545
This PR adds
--include-text-fragmentswhich checks text fragments.--include-fragments's behavior is changed to only check the anchor part.Hence, for a fragment like
#anchor:~:text=some-text, the user may choose between three behaviors:--include-fragmentschecks only if#anchoris a valid anchor.--include-text-fragmentschecks only ifsome-textis a valid text.--include-fragments --include-text-fragmentschecks both.Decisions to be challenged:
html5ever. I found no good reason to also include an implementation based onhtml5gum.head,script,styleandtemplate. I believe this is a good start, in particularheadalso filters outtitleandmeta. However, I see this more like a starting point, than an academically correct response. I'm hoping for more user feedback here.