Skip to content

Commit f2bb3b0

Browse files
authored
docs: expand README (#2384)
Document all parser events, options, and common workflows (searching, modifying, serializing the DOM) Closes #1765
1 parent 9008dfd commit f2bb3b0

1 file changed

Lines changed: 112 additions & 21 deletions

File tree

README.md

Lines changed: 112 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,37 @@ JS! Hooray!
8383
That's it?!
8484
```
8585

86-
This example only shows three of the possible events.
87-
Read more about the parser, its events and options in the [wiki](https://github.com/fb55/htmlparser2/wiki/Parser-options).
86+
### Parser events
87+
88+
All callbacks are optional. The handler object you pass to `Parser` may implement any subset of these:
89+
90+
| Event | Description |
91+
| -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
92+
| `onopentag(name, attribs, isImplied)` | Opening tag. `attribs` is an object mapping attribute names to values. `isImplied` is `true` when the tag was opened implicitly (HTML mode only). |
93+
| `onopentagname(name)` | Emitted for the tag name as soon as it is available (before attributes are parsed). |
94+
| `onattribute(name, value, quote)` | Attribute. `quote` is `"` / `'` / `null` (unquoted) / `undefined` (no value, e.g. `disabled`). |
95+
| `onclosetag(name, isImplied)` | Closing tag. `isImplied` is `true` when the tag was closed implicitly (HTML mode only). |
96+
| `ontext(data)` | Text content. May fire multiple times for a single text node. |
97+
| `oncomment(data)` | Comment (content between `<!--` and `-->`). |
98+
| `oncdatastart()` | Opening of a CDATA section (`<![CDATA[`). |
99+
| `oncdataend()` | End of a CDATA section (`]]>`). |
100+
| `onprocessinginstruction(name, data)` | Processing instruction (e.g. `<?xml ...?>`). |
101+
| `oncommentend()` | Fires after a comment has ended. |
102+
| `onparserinit(parser)` | Fires when the parser is initialized or reset. |
103+
| `onreset()` | Fires when `parser.reset()` is called. |
104+
| `onend()` | Fires when parsing is complete. |
105+
| `onerror(error)` | Fires on error. |
106+
107+
### Parser options
108+
109+
| Option | Type | Default | Description |
110+
| ------------------------ | --------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
111+
| `xmlMode` | `boolean` | `false` | Treat the document as XML. This affects entity decoding, self-closing tags, CDATA handling, and more. Set this to `true` for XML, RSS, Atom and RDF feeds. |
112+
| `decodeEntities` | `boolean` | `true` | Decode HTML entities (e.g. `&amp;` -> `&`). |
113+
| `lowerCaseTags` | `boolean` | `!xmlMode` | Lowercase tag names. |
114+
| `lowerCaseAttributeNames`| `boolean` | `!xmlMode` | Lowercase attribute names. |
115+
| `recognizeSelfClosing` | `boolean` | `xmlMode` | Recognize self-closing tags (e.g. `<br/>`). Always enabled in `xmlMode`. |
116+
| `recognizeCDATA` | `boolean` | `xmlMode` | Recognize CDATA sections as text. Always enabled in `xmlMode`. |
88117

89118
### Usage with streams
90119

@@ -106,25 +135,100 @@ htmlStream.pipe(parserStream).on("finish", () => console.log("done"));
106135

107136
## Getting a DOM
108137

109-
The `DomHandler` produces a DOM (document object model) that can be manipulated using the [`DomUtils`](https://github.com/fb55/DomUtils) helper.
138+
The `parseDocument` helper parses a string and returns a DOM tree (a [`Document`](https://github.com/fb55/domhandler) node).
110139

111140
```js
112141
import * as htmlparser2 from "htmlparser2";
113142

114-
const dom = htmlparser2.parseDocument(htmlString);
143+
const dom = htmlparser2.parseDocument(
144+
`<ul id="fruits">
145+
<li class="apple">Apple</li>
146+
<li class="orange">Orange</li>
147+
</ul>`,
148+
);
149+
```
150+
151+
`parseDocument` accepts an optional second argument with both parser and [DOM handler options](https://github.com/fb55/domhandler):
152+
153+
```js
154+
const dom = htmlparser2.parseDocument(data, {
155+
// Parser options
156+
xmlMode: true,
157+
158+
// domhandler options
159+
withStartIndices: true, // Add `startIndex` to each node
160+
withEndIndices: true, // Add `endIndex` to each node
161+
});
162+
```
163+
164+
### Searching the DOM
165+
166+
The [`DomUtils`](https://github.com/fb55/domutils) module (re-exported on the main `htmlparser2` export) provides helpers for finding nodes:
167+
168+
```js
169+
import * as htmlparser2 from "htmlparser2";
170+
171+
const dom = htmlparser2.parseDocument(`<div><p id="greeting">Hello</p></div>`);
172+
173+
// Find elements by ID, tag name, or class
174+
const greeting = htmlparser2.DomUtils.getElementById("greeting", dom);
175+
const paragraphs = htmlparser2.DomUtils.getElementsByTagName("p", dom);
176+
177+
// Find elements with custom test functions
178+
const all = htmlparser2.DomUtils.findAll(
179+
(el) => el.attribs?.class === "active",
180+
dom,
181+
);
182+
183+
// Get text content
184+
htmlparser2.DomUtils.textContent(greeting); // "Hello"
185+
```
186+
187+
For CSS selector queries, use [`css-select`](https://github.com/fb55/css-select):
188+
189+
```js
190+
import { selectAll, selectOne } from "css-select";
191+
192+
const results = selectAll("ul#fruits > li", dom);
193+
const first = selectOne("li.apple", dom);
115194
```
116195
117-
The `DomHandler`, while still bundled with this module, was moved to its [own module](https://github.com/fb55/domhandler).
118-
Have a look at that for further information.
196+
Or, if you'd prefer a jQuery-like API, use [`cheerio`](https://github.com/cheeriojs/cheerio).
197+
198+
### Modifying and serializing the DOM
119199
120-
## Parsing Feeds
200+
Use `DomUtils` to modify the tree, and [`dom-serializer`](https://github.com/cheeriojs/dom-serializer) (also available as `DomUtils.getOuterHTML`) to serialize it back to HTML:
201+
202+
```js
203+
import * as htmlparser2 from "htmlparser2";
204+
205+
const dom = htmlparser2.parseDocument(
206+
`<ul><li>Apple</li><li>Orange</li></ul>`,
207+
);
208+
209+
// Remove the first <li>
210+
const items = htmlparser2.DomUtils.getElementsByTagName("li", dom);
211+
htmlparser2.DomUtils.removeElement(items[0]);
212+
213+
// Serialize back to HTML
214+
const html = htmlparser2.DomUtils.getOuterHTML(dom);
215+
// "<ul><li>Orange</li></ul>"
216+
```
217+
218+
Other manipulation helpers include `appendChild`, `prependChild`, `append`, `prepend`, and `replaceElement` -- see the [`domutils` docs](https://github.com/fb55/domutils) for the full API.
219+
220+
## Parsing feeds
121221
122222
`htmlparser2` makes it easy to parse RSS, RDF and Atom feeds, by providing a `parseFeed` method:
123223
124224
```javascript
125-
const feed = htmlparser2.parseFeed(content, options);
225+
const feed = htmlparser2.parseFeed(content);
126226
```
127227
228+
This returns an object with `type`, `title`, `link`, `description`, `updated`, `author`, and `items` (an array of feed entries), or `null` if the document isn't a recognized feed format.
229+
230+
The `xmlMode` option is enabled by default for `parseFeed`. If you pass custom options, make sure to include `xmlMode: true`.
231+
128232
## Performance
129233
130234
After having some artificial benchmarks for some time, **@AndreasMadsen** published his [`htmlparser-benchmark`](https://github.com/AndreasMadsen/htmlparser-benchmark), which benchmarks HTML parses based on real-world websites.
@@ -147,20 +251,7 @@ saxes : 45.7921 ms/file ± 128.691
147251
html5 : 120.844 ms/file ± 153.944
148252
```
149253
150-
## How does this module differ from [node-htmlparser](https://github.com/tautologistics/node-htmlparser)?
151-
152-
In 2011, this module started as a fork of the `htmlparser` module.
153-
`htmlparser2` was rewritten multiple times and, while it maintains an API that's mostly compatible with `htmlparser`, the projects don't share any code anymore.
154-
155-
The parser now provides a callback interface inspired by [sax.js](https://github.com/isaacs/sax-js) (originally targeted at [readabilitySAX](https://github.com/fb55/readabilitysax)).
156-
As a result, old handlers won't work anymore.
157-
158-
The `DefaultHandler` was renamed to clarify its purpose (to `DomHandler`). The old name is still available when requiring `htmlparser2` and your code should work as expected.
159-
160-
The `RssHandler` was replaced with a `getFeed` function that takes a `DomHandler` DOM and returns a feed object. There is a `parseFeed` helper function that can be used to parse a feed from a string.
161-
162254
## Security contact information
163255
164256
To report a security vulnerability, please use the [Tidelift security contact](https://tidelift.com/security).
165257
Tidelift will coordinate the fix and disclosure.
166-

0 commit comments

Comments
 (0)