Add support for Text fragment feature (#1545)#1600
Add support for Text fragment feature (#1545)#1600thiru-appitap wants to merge 13 commits intolycheeverse:masterfrom
Conversation
|
I missed to run the clippy across the test modules - the related lint failure issues are now fixed and ready for review! |
lychee-lib/src/utils/url.rs
Outdated
| ); | ||
| match url { | ||
| Ok(url) => { | ||
| eprintln!( |
There was a problem hiding this comment.
Modified the tests to have asserts (instead of the manual checks that was earlier)
lychee-lib/src/checker/website.rs
Outdated
| let mut status = Status::new(&response, self.accepted.clone()); | ||
| if self.validate_text_fragments && has_fragment_directive { | ||
| if let Ok(res) = response.text().await { | ||
| info!("checking fragment directive..."); | ||
| if let Some(fd) = req_url.fragment_directive() { | ||
| info!("directive: {:?}", fd.text_directives); | ||
| match fd.check(&res) { | ||
| Ok(stat) => { | ||
| status = stat; | ||
| } | ||
| Err(e) => { | ||
| return e.into(); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| status |
There was a problem hiding this comment.
Maybe we can move that into a function/method?
There was a problem hiding this comment.
some tests for that part would also be nice
There was a problem hiding this comment.
the code is modified to move this into a separate function - as well, bulk of the logic is abstracted at the textfrag crate itself so that the websitechecker will deal with only error responses from the text fragment checker function.
lychee-lib/src/client.rs
Outdated
| assert!(res.status().is_success()); | ||
|
|
||
| // start with suffix | ||
| println!("\ntesting start with suffix..."); |
There was a problem hiding this comment.
you'll probably remove the println!s right?
There was a problem hiding this comment.
yes - the println's are removed now
|
|
||
| use crate::types::TextDirective; | ||
|
|
||
| const BLOCK_ELEMENTS: &[&str] = &[ |
There was a problem hiding this comment.
This is a long list. Does that mean we'd have to maintain the HTML keywords here? Maybe we can avoid that as it would be an uphill battle.
There was a problem hiding this comment.
I am yet to explore alternative approach to this and am open for suggestion - for now, this list is retained (reserving this change for a near future commit)
There was a problem hiding this comment.
I like that this functionality is isolated in its own module. But it's a looot of code. 😅 Not sure what to do here, but at least the ratio of code/tests could be improved.
There was a problem hiding this comment.
Perhaps we could move it into a separate crate or use an upstream crate for that? I think it would be a nice library to maintain individually as more applications could profit from it
There was a problem hiding this comment.
text fragment is now moved into its own crate, textfrag, in the lychee main tree itself - as I am new to the ecosystem, I'll need help in moving this into a separate crate - please suggest!
| let mut all_directives_found = false; | ||
| let directive = td.directive.borrow(); | ||
|
|
||
| 'directive_loop: while !all_directives_found { |
There was a problem hiding this comment.
The labels make it quite hard to read. Have you considered any alternatives?
There was a problem hiding this comment.
yes - the code logic was cumbersome and I've rewritten this now - please share your comments!
| pub(crate) const FRAGMENT_DIRECTIVE_DELIMITER: &str = ":~:"; | ||
| pub(crate) const TEXT_DIRECTIVE_DELIMITER: &str = "text="; | ||
|
|
||
| pub(crate) const TEXT_DIRECTIVE_REGEX: &str = r"(?s)^text=(?:\s*(?P<prefix>[^,&-]*)-\s*[,$]?\s*)?(?:\s*(?P<start>[^-&,]*)\s*)(?:\s*,\s*(?P<end>[^,&-]*)\s*)?(?:\s*,\s*-(?P<suffix>[^,&-]*)\s*)?$"; |
There was a problem hiding this comment.
Where does that regex come from? Did you write it yourself? If there's an "official" regex for those text fragments, we could perhaps add a link to the reference.
There was a problem hiding this comment.
this regex was built by myself - I started searching for any equivalent but couldn't land on one when I started this implementation - I am open to revisit this.
In fact, i want to replace regex with a simple parser for this requirement - the specification is not particular about the order of the directives, whereas the regex assumes the order imperatively - with a parser, we might be able to get away from the ordering constraints.
lychee-lib/src/utils/url.rs
Outdated
| } | ||
|
|
||
| /// Fragment Directive feature trait | ||
| /// we will use the extension trait pattern to extend the Url to support Text Fragment feature |
There was a problem hiding this comment.
Thank you - this has now moved into textfrag crate
lychee-lib/src/utils/url.rs
Outdated
| } | ||
|
|
||
| /// Fragment Directive feature trait | ||
| /// we will use the extension trait pattern to extend the Url to support Text Fragment feature |
There was a problem hiding this comment.
| /// we will use the extension trait pattern to extend the Url to support Text Fragment feature | |
| /// We will use the extension trait pattern to extend [`url::Url`] to support the text fragment feature |
There was a problem hiding this comment.
Incorporated this comment in to the textfrag::utils::url file
lychee-lib/src/utils/url.rs
Outdated
| /// Fragment Directive feature trait | ||
| /// we will use the extension trait pattern to extend the Url to support Text Fragment feature | ||
| pub(crate) trait UrlExt { | ||
| /// Checks if the url has a fragment and if the fragment is has the fragment directive delimiter embedded |
There was a problem hiding this comment.
| /// Checks if the url has a fragment and if the fragment is has the fragment directive delimiter embedded | |
| /// Checks if the url has a fragment and if the fragment has the fragment directive delimiter embedded |
There was a problem hiding this comment.
Updated the comments section (with slight rewording) in the textfrag crate
lychee-lib/src/utils/url.rs
Outdated
| } | ||
|
|
||
| impl UrlExt for Url { | ||
| /// Returns whether the URL has fragment directive or not |
There was a problem hiding this comment.
| /// Returns whether the URL has fragment directive or not | |
| /// Checks whether the URL has fragment directive or not |
There was a problem hiding this comment.
Have addressed this is in the textfrag::utils::url[:16]
mre
left a comment
There was a problem hiding this comment.
Left some comments. I like the overall structure. Good work so far!
@mre |
|
@thiru, any updates? Let me know in case you need any help. 😃 |
|
Thanks, sounds good! |
@mre my apologies for the delay - while refactoring, into a separate crate, I encountered few corner case issues and it took more time than planned - running the tests now and hoping to re-raise the pull-request by tomorrow! |
|
I am committing the changes into my fork and will initiate a pull-request shortly! |
lib's website checker continues to have the logic to validate text fragments clean-up of the tests were done review feedback incorporated - addressed structural, logic, tests and document comments
|
@mre request your help in addressing the CI / publish-check failure - please recommend if I've to make any changes on my end - thank you! |
lychee-lib/src/checker/website.rs
Outdated
|
|
||
| fn check_text_fragments(site_data: &str, url: &Url, mut status: Status) -> Status { | ||
| let res = check_text_fragments(site_data, url); | ||
| if res.is_err() { |
There was a problem hiding this comment.
You don't need the if condition here. You're matching on res.err() below, and in the case where res.err() is not set, it will be None. This case is already covered in your _res match arm since _res is just a placeholder, which also includes the value being None.
There was a problem hiding this comment.
Yes - I've updated the code
| @@ -0,0 +1,78 @@ | |||
| /// Defines the status of the Text Fragment search and extraction/search operation status | |||
There was a problem hiding this comment.
| /// Defines the status of the Text Fragment search and extraction/search operation status | |
| //! Defines the status of the Text Fragment search and extraction/search operation status |
There was a problem hiding this comment.
refactored code takes are of this now!
textfrag/src/types/url.rs
Outdated
|
|
||
| use crate::types::{FragmentDirective, FRAGMENT_DIRECTIVE_DELIMITER}; | ||
|
|
||
| /// Fragment Directive feature trait |
There was a problem hiding this comment.
| /// Fragment Directive feature trait | |
| /// Fragment Directive extension trait |
textfrag/src/utils/mod.rs
Outdated
There was a problem hiding this comment.
Maybe we don't need that file and can move the code closer to where it's used? (Assuming it's only used in one place.)
There was a problem hiding this comment.
removed the folder as part of the refactoring exercise
There was a problem hiding this comment.
This module is so large, I'd assume a relatively large doc-block as well.
In there, I'd answer a few questions:
- What does the module do?
- How is it supposed to be used (including an example)?
- What were the design tradeoffs?
- What are the possible error conditions?
There was a problem hiding this comment.
working on it - will commit this documentation for review!
|
|
||
| use crate::types::{TextDirective, TextDirectiveKind, TextDirectiveStatus}; | ||
|
|
||
| const BLOCK_ELEMENTS: &[&str] = &[ |
There was a problem hiding this comment.
I'm still not too happy that we have to maintain this list.
Looked around for alternatives, and I found https://github.com/servo/html5ever/blob/main/html5ever/src/tree_builder/tag_sets.rs.
Not sure if it can be used, but I wanted to mention it.
If it's too much hassle to integrate, we can keep the current implementation.
There was a problem hiding this comment.
unfortunately, the html5ever does not expose these macros for us to consume and so couldn't use it - i don't want to keep a copy of this file inside lychee repository and so ruling it out. And i did search around to check on alternative approaches and found either we need to have a headless browser to manage/identify the element type (as block) or stay with this current approach.
My recommendation, for now, is to keep this code - in future, if the html5ever makes these macros as public, we can potentially be freed up of maintaining this list.
|
I've added a few more comments. As for the error, it's currently failing because there is no
I'd vote for option 3, which is the easiest right now and allows us to keep maintaining the code inside lychee. It's still reasonably well encapsuled in its own module and we can always make it a separate crate later once the code is mature enough. The downside is longer compile-times because it would be in the same codegen unit as the rest of the library code. Detailed instructions for option 3To merge the
This approach has several benefits:
Let me know what you think. |
@mre I understand and agree with your recommendation - I am addressing this and, as well, the other review comments and republish for review - thanks for your patience! |
|
@thiru-appitap, I saw that you did some work. Can you close the resolved conversations already? It becomes a bit hard to keep track of the open TODOs. 😉 |
- moved (back) the textfrag as a module into the lychee-lib crate - added documents and ran doctests to verify its working - added a cli test to validate the text fragment functionality
@mre was juggling between couple of priorities, along with local travels and so couldn't resolve & commit the changes earlier itself - now on I should be able to get back with much faster turnaround on the feedback |
|
This branch kindly asks for a rebase. Running Would be nice to see this arrive. |
I have rebased the branch to the latest but am facing lint failures - working on to fix it and recommit by this weekend. |
|
I guess you can just do another rebase. We fixed the lint errors in |
|
@thiru-appitap, would you like to continue working on this? Alternatively, we can close the PR. I still believe that it would be great to have this feature, but I'm not sure if anyone has the bandwidth to look into it at the moment. |
fcdf77c to
e0912ab
Compare
|
Thanks for working on this feature, It would be a great addition to Lychee |
The truth is that no one is working on this right now. 😞 We are looking for someone to pick up the work and fix the remaining issues mentioned in the comments of this PR. |
|
Well, I'm sorry to hear that and that I'm not a developer able to help here. But thanks anyway for the people who started the work, and for the future person* with enough free time and interest who get the functionality to the finish line. I have faith on the FLOSS community. PD: And of course to everyone involved on this project. |
|
Out of curiosity, what's your use-case for that feature? |
Hi, We have started using it on this GH Action [1] and this Each feature should have a link to an explanation. And we had lots of errors and redirects before implementing Lychee 🫶 [5] The idea it's to point to the more accurate part of each page to make it easy to understand the tag/feature, given that it's a bit difficult for non tech people. Example errors on [6] [1] https://github.com/Podcastindex-org/web-ui/blob/master/.github/workflows/links.yml |
|
@Marzal Thank you for letting us know the details. This really helps to understand the use-case and this way we already have test cases to cover. In summary (for people working on this PR, or my future self) with The workaround is the following .lycheeignore content: Affected linksI might actually find the time in the next few months to try and push this over the finish line, but no guarantees. |
|
Happy to help in any way I can like testing. |
Text Fragment feature implementation pull request. The feature follows the published URL Fragment Text Directives specification (https://wicg.github.io/scroll-to-text-fragment/).
If the fragment directive is not found, a
TextDirectiveNotFounderror will be returned.Below changes are completed:
fancy-regexinclude-text-fragmentsis added to support the featureUrlExttrait is implemented to enhance Url's to support Fragment Directive#:~:text=linked%20URL,-'s%20format&text=Deprecated-,attributes,attribute)