Skip to content

Ignore low-confidence CharlockHolmes guesses when parsing link cards#9510

Merged
Gargron merged 3 commits intomastodon:masterfrom
ClearlyClaire:fixes/fetch-link-card-charset
Dec 17, 2018
Merged

Ignore low-confidence CharlockHolmes guesses when parsing link cards#9510
Gargron merged 3 commits intomastodon:masterfrom
ClearlyClaire:fixes/fetch-link-card-charset

Conversation

@ClearlyClaire
Copy link
Copy Markdown
Contributor

Fixes #9466 by ignoring low-confidence CharlockHolmes charset guesses.

High-confidence guesses will still override server or document-specified charset, and low confidence guesses won't be used at all even if there is no server or document-specified charset, so this might not be the best way to proceed.

Note also that the “hint” given to CharlockHolmes as very little importance, and feeding it the correct charset may not prevent it from outputing an incorrect, low-confidence guess.

@ClearlyClaire ClearlyClaire force-pushed the fixes/fetch-link-card-charset branch from 0e527e7 to d823ceb Compare December 12, 2018 13:45
Comment thread app/services/fetch_link_card_service.rb Outdated

guess = detector.detect(@html, @html_charset)
page = Nokogiri::HTML(@html, nil, guess&.fetch(:encoding, nil))
encoding = guess&.fetch(:confidence, 0) > 60 ? guess&.fetch(:encoding, nil) : nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nil > 60 will throw no method error

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! Addressed.

@Gargron Gargron merged commit e709b8d into mastodon:master Dec 17, 2018
@jomo
Copy link
Copy Markdown
Contributor

jomo commented Dec 17, 2018

Thank you, @ThibG!

@ClearlyClaire ClearlyClaire deleted the fixes/fetch-link-card-charset branch March 14, 2019 15:42
hiyuki2578 pushed a commit to ProjectMyosotis/mastodon that referenced this pull request Oct 2, 2019
…mastodon#9510)

* Add failing test for windows-1251 link cards

* Ignore low-confidence CharlockHolmes guesses

Fixes mastodon#9466

* Fix no method error when charlock holmes cannot detect charset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants