Jsoup

A powerful BoxLang module that provides HTML parsing and cleaning capabilities using [Jsoup](https://jsoup.org/).

A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup. This module enables developers to safely parse, manipulate, and clean HTML content with ease.

Features

  • HTML Parsing: Parse HTML strings into manipulable document objects

  • HTML Cleaning: Sanitize HTML content with customizable safety levels

  • XSS Protection: Built-in protection against Cross-Site Scripting attacks

  • Flexible Safelists: Multiple predefined safety levels from strict to relaxed

  • CSS Selectors: Extract elements using familiar CSS selector syntax

  • Relative Link Handling: Control how relative links are processed during cleaning

Installation

This module can be installed using CommandBox or the BoxLang Installer Scripts

# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoup

Available BIFs (Built-in Functions)

htmlParse( html )

Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.

Parameters:

  • html (string, required): The HTML string to parse

Returns: A BoxDocument object with methods for HTML manipulation

Example:

// Parse HTML content
htmlContent = "<html><head><title>My Page</title></head><body><h1>Hello World</h1></body></html>";
doc = htmlParse( htmlContent );

// Access document properties
title = doc.title(); // Returns "My Page"
bodyText = doc.body().text(); // Returns "Hello World"

// Use CSS selectors
htmlContent = "<ul><li class='item'>Item 1</li><li class='item'>Item 2</li></ul>";
doc = htmlParse( htmlContent );
items = doc.select( ".item" ); // Returns elements with class 'item'

// Extract text content
textContent = doc.text(); // Returns plain text without HTML tags

// Enhanced BoxDocument methods
htmlContent = "<div class='container'><h1>Title</h1><p>Content</p></div>";
doc = htmlParse( htmlContent );

// Convert to JSON
jsonString = doc.toJSON(); // Compact JSON
prettyJson = doc.toJSON( true ); // Pretty-printed JSON

// Convert to XML
xmlString = doc.toXML(); // Compact XML
prettyXml = doc.toXML( true, 2 ); // Pretty-printed XML with 2-space indentation

Standard Jsoup Document Methods:

  • title() – Get the contents of the <title> tag

  • select(selector) – Find elements using CSS selectors

  • text() – Get the combined text of the entire document

  • outerHtml() – Get the HTML of the entire document

  • body() – Get the <body> element

  • head() – Get the <head> element

  • getElementById(id) – Find an element by its ID attribute

  • getElementsByTag(tagName) – Get all elements with the given tag

  • getElementsByClass(className) – Get all elements with the given class

  • getElementsByAttribute(attrName) – Get elements that have the specified attribute

  • html() – Get the inner HTML of the document body

  • createElement(tagName) – Create a new element with the given tag

Enhanced BoxDocument Methods:

  • toJSON() – Convert the document to a compact JSON representation

  • toJSON(prettyPrint) – Convert to JSON with optional pretty-printing

  • toXML() – Convert the document to a compact XML representation

  • toXML(prettyPrint, indentFactor) – Convert to XML with optional pretty-printing and custom indentation

Enhanced Methods Examples:

// Sample HTML for examples
htmlContent = `
<div class="product" id="item-1">
    <h2>Product Name</h2>
    <p class="description">Product description here</p>
    <span class="price">$19.99</span>
</div>
`;
doc = htmlParse( htmlContent );

// Convert to JSON (compact)
jsonCompact = doc.toJSON();
// Result: {"tag":"html","children":[{"tag":"head"},{"tag":"body","children":[{"tag":"div","attributes":{"class":"product","id":"item-1"},"children":[{"tag":"h2","children":[{"text":"Product Name"}]},{"tag":"p","attributes":{"class":"description"},"children":[{"text":"Product description here"}]},{"tag":"span","attributes":{"class":"price"},"children":[{"text":"$19.99"}]}]}]}]}

// Convert to JSON (pretty-printed)
jsonPretty = doc.toJSON( true );
// Result: Formatted JSON with proper indentation

// Convert to XML (compact)
xmlCompact = doc.toXML();
// Result: <html><head></head><body><div class="product" id="item-1"><h2>Product Name</h2>...</div></body></html>

// Convert to XML (pretty-printed with 4-space indentation)
xmlPretty = doc.toXML( true, 4 );
// Result:
// <html>
//     <head></head>
//     <body>
//         <div class="product" id="item-1">
//             <h2>Product Name</h2>
//             <p class="description">Product description here</p>
//             <span class="price">$19.99</span>
//         </div>
//     </body>
// </html>

Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.

Parameters:

  • html (string, required): The HTML string to clean

  • safeList (string, optional): The safety level to apply (default: "relaxed")

  • preserveRelativeLinks (boolean, optional): Whether to preserve relative links (default: false)

  • baseUri (string, optional): Base URI for resolving relative links (default: "")

Returns: A cleaned HTML string

Safelist Options:

  • none: Maximum cleaning, removes all tags and returns plain text only

  • simpletext: Allows very limited inline formatting tags like <b>, <i>, <br>

  • basic: Basic cleaning, removes all tags except for a few safe ones

  • basicwithimages: Basic cleaning but allows images

  • relaxed: More lenient cleaning, allows more tags (default)

Examples:

// Basic cleaning with default "relaxed" safelist
dirtyHtml = "<script>alert('XSS')</script><p>Hello World!</p>";
cleanHtml = htmlClean( dirtyHtml );
// Result: "<p>Hello World!</p>"

// Strict cleaning with "basic" safelist
cleanHtml = htmlClean(
    html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
    safeList: "basic"
);
// Result: "<p>Hello World!</p>"

// Allow images with "basicwithimages" safelist
cleanHtml = htmlClean(
    html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
    safeList: "basicwithimages"
);
// Result: "<img src='image.jpg' /><p>Hello World!</p>"

// Plain text only with "none" safelist
cleanHtml = htmlClean(
    html: "<p><strong>Bold text</strong> and <em>italic text</em></p>",
    safeList: "none"
);
// Result: "Bold text and italic text"

// Preserve relative links
cleanHtml = htmlClean(
    html: "<a href='page.html'>Link</a>",
    preserveRelativeLinks: true
);
// Result: "<a href='page.html'>Link</a>"

// Convert relative links to absolute
cleanHtml = htmlClean(
    html: "<a href='page.html'>Link</a>",
    baseUri: "https://example.com/",
    preserveRelativeLinks: false
);
// Result: "<a href='https://example.com/page.html'>Link</a>"

Use Cases

Content Management Systems

Clean user-generated content before storing or displaying:

userContent = "<p>Great article! <script>alert('hack')</script></p>";
safeContent = htmlClean( userContent );
// Store or display safeContent safely

Web Scraping

Parse and extract data from HTML content:

scrapedHtml = "<div class='product'><h2>Product Name</h2><span class='price'>$19.99</span></div>";
doc = htmlParse( scrapedHtml );
productName = doc.select( ".product h2" ).text();
price = doc.select( ".price" ).text();

Data Transformation

Convert HTML to different formats using BoxDocument's enhanced methods:

// Parse HTML content
htmlContent = `
<article>
    <header>
        <h1>Article Title</h1>
        <meta name="author" content="John Doe">
    </header>
    <section class="content">
        <p>First paragraph of the article.</p>
        <p>Second paragraph with <em>emphasis</em>.</p>
    </section>
</article>
`;
doc = htmlParse( htmlContent );

// Convert to structured JSON for API responses
jsonData = doc.toJSON( true );
// Use jsonData in REST APIs or data processing

// Convert to XML for legacy systems
xmlData = doc.toXML( true, 2 );
// Use xmlData for XML-based integrations

// Extract plain text for search indexing
textContent = doc.text();
// Use textContent for full-text search

Email Template Processing

Clean HTML emails before sending:

emailTemplate = "<p>Hello {{name}}, <script>malicious()</script></p>";
cleanTemplate = htmlClean( emailTemplate, "basic" );
// Process cleanTemplate safely

GitHub Repository and Reporting Issues

Visit the GitHub repository: https://github.com/ortus-boxlang/bx-jsoup for release notes. You can also file a bug report or improvement suggestion via Jira

Last updated

Was this helpful?