Raw HTML is usable by AI, but contains huge amount of token noise, implodes costs and latency.
One of the approaches is to just strip all the HTML tags and leave only the text, which still leaves some useless information like style
and script
tags.
The other and actually really good approach is to use HTML -> Markdown converter. But we can still gain a bit of performance and reduce tokens as AI does not really care about proper markdown, spaces etc.
This library is experimental and in personal tests showed the same quality results as plain HTML or Markdown, but much less tokens and a bit faster.
There is also a native version of the library, which is faster and smaller in size but in beta stage. html-ai-ready-native
# yarn
yarn add @mrspartak/html-ai-ready
# npm
npm i @mrspartak/html-ai-ready
# pnpm
pnpm add @mrspartak/html-ai-ready
# bun
bun add @mrspartak/html-ai-ready
import { htmlToAiReady, PRESET_QUALITY } from "@mrspartak/html-ai-ready";
const html = "<p>Hello, world!</p>";
const aiReady = htmlToAiReady(html, PRESET_QUALITY);
console.log(aiReady);
The main point of this package is to be fast, give the smallest result in terms of token size and also still maintain the context to answer questions. It is compared to a couple of other methods I saw so far.
pnpm benchmark
When comparing the output size across all tested pages (average percentage of original HTML size):
Method | Average Size (% of original) |
---|---|
HTML_TO_AI_FAST | 24.69% |
HTML_TO_AI_QUALITY | 7.76% |
HTML_TO_AI_NATIVE | 12.19% |
NODE_HTML_MARKDOWN | 13.88% |
CHEERIO_QUALITY_PARSED | 19.31% |
Performance comparison across all pages combined:
Method | Operations/sec | Mean time (ms) | Comparison |
---|---|---|---|
htmlToAiReady NATIVE | 73.31 | 13.63 | Fastest |
htmlToAiReady FAST | 28.76 | 34.77 | 2.55x slower than NATIVE |
htmlToAiReady QUALITY | 15.36 | 65.09 | 4.83x slower than NATIVE |
cheerioParse | 7.31 | 136.76 | 9.93x slower than NATIVE |
node-html-markdown | 6.30 | 158.77 | 11.54x slower than NATIVE |
# don't forget to add OPENAI_API_KEY to .env file first
pnpm aiq
To test real-world effectiveness, we used 3 HTML pages as context for AI and asked deterministic questions. The results show accuracy rates, token usage, and AI response times:
Method | Accuracy | Avg Tokens | Avg Response Time |
---|---|---|---|
htmlToAiReadyTextQuality | 15/20 (75.00%) | 10,759 | 618.75ms |
cherioText | 15/20 (75.00%) | 12,931 | 5,377.55ms |
nodeHtmlMarkdownText | 14/20 (70.00%) | 27,389 | 2,099.20ms |
As shown in the benchmarks, the QUALITY preset not only maintains the same accuracy as Cheerio while using fewer tokens, but it also delivers responses significantly faster. The FAST preset offers the best performance while the QUALITY preset provides the smallest output size with excellent accuracy, giving you options depending on your priority.
To determing tags that I would like to strip, first of course I gathered tags that would not make any context for AI. Those are style, head, iframe etc. But stripping the tags is costly operation, so I wanted to actually know if stripping them makes any difference. So I gathered a list of ~800 random websites, crawled and parsed them. Here are some details:
Metric | Value |
---|---|
Minimum | 242 bytes |
Maximum | 11,647,892 bytes |
Average | 517,804 bytes |
Median | 346,929 bytes |
Metric | Value |
---|---|
Average | 3,202 ms |
Median | 2,504 ms |
Ordered by total size across all pages:
Element | Average Size (bytes) | % of Page | % of Body | Total Size (bytes) |
---|---|---|---|---|
body | 406,357 | 76.11% | - | 349,061,078 |
head | 110,629 | 23.55% | 73.40% | 95,030,079 |
links | 98,256 | 19.18% | 24.71% | 84,401,659 |
svgs | 83,426 | 12.76% | 15.38% | 71,662,569 |
nav | 103,673 | 12.17% | 15.00% | 89,055,316 |
script | 73,849 | 10.84% | 14.09% | 63,435,956 |
images | 31,274 | 6.18% | 8.17% | 26,864,083 |
footer | 17,350 | 4.21% | 5.81% | 14,903,361 |
style | 16,031 | 3.82% | 4.80% | 13,770,735 |
forms | 14,827 | 3.29% | 4.33% | 12,736,770 |
button | 12,395 | 2.54% | 3.47% | 10,647,352 |
comments | 7,163 | 1.67% | 2.28% | 6,152,769 |
aside | 6,539 | 1.04% | 1.19% | 5,617,229 |
noscript | 1,663 | 0.32% | 0.44% | 1,428,938 |
iframe | 902 | 0.27% | 0.40% | 775,052 |
video | 74 | 0.03% | 0.03% | 63,815 |
canvas | 4 | 0.00% | 0.00% | 3,149 |
This project wouldn't be possible without the valuable contributions and support from:
I welcome contributions from the community! Whether it's improving the documentation, adding new features, or reporting bugs, please feel free to make a pull request or open an issue.
This project is licensed under the MIT License - see the LICENSE file for details.