Skip to content

Commit 3c83389

Browse files
authored
Consider #hash links in link density (#646)
* don't count #hash links to link density * detect hash links correctly * update existing test cases * use coefficient for link density as it better plays with `cleanConditionally` * improve isList detection
1 parent fc78270 commit 3c83389

File tree

13 files changed

+3022
-8
lines changed

13 files changed

+3022
-8
lines changed

Readability.js

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ Readability.prototype = {
124124
okMaybeItsACandidate: /and|article|body|column|content|main|shadow/i,
125125

126126
positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i,
127-
negative: /hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i,
127+
negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i,
128128
extraneous: /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility/i,
129129
byline: /byline|author|dateline|writtenby|p-author/i,
130130
replaceFonts: /<(\/?)font[^>]*>/gi,
@@ -135,6 +135,7 @@ Readability.prototype = {
135135
prevLink: /(prev|earl|old|new|<|«)/i,
136136
whitespace: /^\s*$/,
137137
hasContent: /\S$/,
138+
hashUrl: /^#.+/,
138139
srcsetUrl: /(\S+)(\s+[\d.]+[xw])?(\s*(?:,|$))/g,
139140
b64DataUrl: /^data:\s*([^\s;,]+)\s*;\s*base64\s*,/i,
140141
// See: https://schema.org/Article
@@ -1745,7 +1746,9 @@ Readability.prototype = {
17451746

17461747
// XXX implement _reduceNodeList?
17471748
this._forEachNode(element.getElementsByTagName("a"), function(linkNode) {
1748-
linkLength += this._getInnerText(linkNode).length;
1749+
var href = linkNode.getAttribute("href");
1750+
var coefficient = href && this.REGEXPS.hashUrl.test(href) ? 0.3 : 1;
1751+
linkLength += this._getInnerText(linkNode).length * coefficient;
17491752
});
17501753

17511754
return linkLength / textLength;
@@ -2007,8 +2010,6 @@ Readability.prototype = {
20072010
if (!this._flagIsActive(this.FLAG_CLEAN_CONDITIONALLY))
20082011
return;
20092012

2010-
var isList = tag === "ul" || tag === "ol";
2011-
20122013
// Gather counts for other typical elements embedded within.
20132014
// Traverse backwards so we can remove nodes at the same time
20142015
// without effecting the traversal.
@@ -2020,6 +2021,14 @@ Readability.prototype = {
20202021
return t._readabilityDataTable;
20212022
};
20222023

2024+
var isList = tag === "ul" || tag === "ol";
2025+
if (!isList) {
2026+
var listLength = 0;
2027+
var listNodes = this._getAllNodesWithTag(node, ["ul", "ol"]);
2028+
this._forEachNode(listNodes, (list) => listLength += this._getInnerText(list).length);
2029+
isList = listLength / this._getInnerText(node).length > 0.9;
2030+
}
2031+
20232032
if (tag === "table" && isDataTable(node)) {
20242033
return false;
20252034
}

test/test-pages/bug-1255978/expected.html

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,6 @@
4242
<p>Zeev Sharon said that the old rule of thumb is that for every $1000 invested in a room, the hotel should charge $1 in average daily rate. So a room that cost $300,000 to build, should sell on average for $300/night.</p>
4343
<h3>5. Beware the wall-mounted hairdryer</h3>
4444
<p>It contains the most germs of anything in the room. Other studies have said the TV remote and bedside lamp switches are the most unhygienic. “Perhaps because it's something that's easy for the housekeepers to forget to check or to squirt down with disinfectant,” Forrest Jones said.</p>
45-
<div data-scald-gallery="3739501">
46-
<h2><span></span>Business news in pictures</h2>
47-
</div>
4845
<h3>6. Mini bars almost always lose money</h3>
4946
<p>Despite the snacks in the minibar seeming like the most overpriced food you have ever seen, hotel owners are still struggling to make a profit from those snacks. "Minibars almost always lose money, even when they charge $10 for a Diet Coke,” Sharon said.</p>
5047
<div>

test/test-pages/mercurial/expected-metadata.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"title": "Shared Mutable History — evolve extension for Mercurial",
33
"byline": null,
44
"dir": null,
5-
"excerpt": "Once you have mastered the art of mutable history in a single repository (see the user guide), you can move up to the next level: shared mutable history. evolve lets you push and pull draft changesets between repositories along with their obsolescence markers. This opens up a number of interesting possibilities.",
5+
"excerpt": "Contents",
66
"siteName": null,
77
"readerable": true
88
}

test/test-pages/mercurial/expected.html

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,66 @@
11
<div id="readability-page-1" class="page">
22
<div id="evolve-shared-mutable-history">
3+
<div id="contents">
4+
<p> Contents </p>
5+
<ul>
6+
<li>
7+
<a href="#evolve-shared-mutable-history" id="id4">Evolve: Shared Mutable History</a>
8+
<ul>
9+
<li>
10+
<a href="#sharing-with-a-single-developer" id="id5">Sharing with a single developer</a>
11+
<ul>
12+
<li>
13+
<a href="#publishing-and-non-publishing-repositories" id="id6">Publishing and non-publishing repositories</a>
14+
</li>
15+
<li>
16+
<a href="#setting-up" id="id7">Setting up</a>
17+
</li>
18+
<li>
19+
<a href="#example-1-amend-a-shared-changeset" id="id8">Example 1: Amend a shared changeset</a>
20+
</li>
21+
<li>
22+
<a href="#example-2-amend-again-locally" id="id9">Example 2: Amend again, locally</a>
23+
</li>
24+
</ul>
25+
</li>
26+
<li>
27+
<a href="#sharing-with-multiple-developers-code-review" id="id10">Sharing with multiple developers: code review</a>
28+
<ul>
29+
<li>
30+
<a href="#id2" id="id11">Setting up</a>
31+
</li>
32+
<li>
33+
<a href="#example-3-alice-commits-and-amends-a-draft-fix" id="id12">Example 3: Alice commits and amends a draft fix</a>
34+
</li>
35+
<li>
36+
<a href="#example-4-bob-implements-and-publishes-a-new-feature" id="id13">Example 4: Bob implements and publishes a new feature</a>
37+
</li>
38+
<li>
39+
<a href="#example-5-alice-integrates-and-publishes" id="id14">Example 5: Alice integrates and publishes</a>
40+
</li>
41+
</ul>
42+
</li>
43+
<li>
44+
<a href="#getting-into-trouble-with-shared-mutable-history" id="id15">Getting into trouble with shared mutable history</a>
45+
<ul>
46+
<li>
47+
<a href="#id3" id="id16">Setting up</a>
48+
</li>
49+
<li>
50+
<a href="#example-6-divergent-changesets" id="id17">Example 6: Divergent changesets</a>
51+
</li>
52+
<li>
53+
<a href="#phase-divergence-when-a-rewritten-changeset-is-made-public" id="id18">Phase-divergence: when a rewritten changeset is made public</a>
54+
</li>
55+
</ul>
56+
</li>
57+
<li>
58+
<a href="#conclusion" id="id19">Conclusion</a>
59+
</li>
60+
</ul>
61+
</li>
62+
</ul>
63+
</div>
364
<p> Once you have mastered the art of mutable history in a single repository (see the <a href="http://fakehost/test/user-guide.html">user guide</a>), you can move up to the next level: <em>shared</em> mutable history. <tt><span>evolve</span></tt> lets you push and pull draft changesets between repositories along with their obsolescence markers. This opens up a number of interesting possibilities. </p>
465
<p> The simplest scenario is a single developer working across two computers. Say you’re working on code that must be tested on a remote test server, probably in a rack somewhere, only accessible by SSH, and running an “enterprise-grade” (out-of-date) OS. But you probably prefer to write code locally: everything is setup the way you like it, and you can use your preferred editor, IDE, merge/diff tools, etc. </p>
566
<p> Traditionally, your options are limited: either </p>

test/test-pages/mozilla-1/expected.html

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,17 @@ <h2>Designed to <br />be redesigned</h2>
1717
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/animations/flexible-bottom-fallback.cafd48a3d0a4.png" alt="" /></p>
1818
</div>
1919
</div>
20+
<div id="customize" data-ga-label="More ways to customize">
21+
<h2>More ways to customize</h2>
22+
<ul id="customizer-list" role="tablist">
23+
<li> <a id="customize-themes" href="#themes"> Themes </a>
24+
</li>
25+
<li> <a id="customize-addons" href="#add-ons"> Add-ons </a>
26+
</li>
27+
<li> <a id="customize-awesomebar" href="#awesome-bar"> Awesome Bar </a>
28+
</li>
29+
</ul>
30+
</div>
2031
<div id="customizers-wrapper">
2132
<div id="themes" role="tabpanel" aria-labelledby="customize-themes">
2233
<div>

test/test-pages/nytimes-2/expected.html

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@
2222
<p><a href="#story-continues-1">Continue reading the main story</a>
2323
</p>
2424
</div>
25+
<div id="story-continues-1">
26+
<p><a href="#story-continues-2">Continue reading the main story</a></p>
27+
</div>
2528
<div>
2629
<p data-para-count="602" data-total-count="1935" id="story-continues-2">In the second step, at the closing, <a href="https://www.sec.gov/Archives/edgar/data/1011006/000119312516656036/d178500dex22.htm">Yahoo will sell the stock</a> in the single subsidiary to Verizon. At that point, Yahoo will change its name to something without “Yahoo” in it. My favorite is simply Remain Co., the name Yahoo executives are using. Remain Co. will become a holding company for the Alibaba and Yahoo Japan stock. Included will also be $10 billion in cash, plus the Excalibur patent portfolio and a number of minority investments including Snapchat. Ahh, if only Yahoo had bought Snapchat instead of Tumblr (indeed, if only Yahoo had bought Google or Facebook when it had the chance).</p>
2730
<p data-para-count="262" data-total-count="2197" id="story-continues-3">Because it is a sale of a subsidiary, the $4.8 billion will be paid to Yahoo. Its shareholders will not receive any money unless Yahoo pays it out in a dividend (after paying taxes). Instead, Yahoo shareholders will be left holding shares in the renamed company.</p>

test/test-pages/nytimes-4/expected.html

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,15 @@
1111
</figcaption>
1212
</figure>
1313
</div>
14+
<div>
15+
<ul>
16+
<li>
17+
<time datetime="2018-09-25">Sept. 25, 2018</time>
18+
</li>
19+
<li>
20+
</li>
21+
</ul>
22+
</div>
1423
</header>
1524
<section name="articleBody" itemprop="articleBody">
1625
<div>
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"title": "Simple Anomaly Detection Using Plain SQL",
3+
"byline": "Haki Benita",
4+
"dir": null,
5+
"excerpt": "Many developers think that having a critical bug in their code is the worse thing that can happen. Well, there is something much worst than that: Having a critical bug in your code and not knowing about it! Using some high school level statistics and a fair knowledge of SQL, I implemented a very simple anomaly detection system.",
6+
"siteName": "Haki Benita",
7+
"readerable": true
8+
}

0 commit comments

Comments
 (0)