Jsoup
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup.
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup. This module enables developers to safely parse, manipulate, and clean HTML content with ease.
Features
- HTML Parsing: Parse HTML strings into manipulable document objects 
- HTML Cleaning: Sanitize HTML content with customizable safety levels 
- XSS Protection: Built-in protection against Cross-Site Scripting attacks 
- Flexible Safelists: Multiple predefined safety levels from strict to relaxed 
- CSS Selectors: Extract elements using familiar CSS selector syntax 
- Relative Link Handling: Control how relative links are processed during cleaning 
Installation
This module can be installed using CommandBox or the BoxLang Installer Scripts
# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoupAvailable BIFs (Built-in Functions)
htmlParse( html )
Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.
Parameters:
- html(string, required): The HTML string to parse
Returns: A BoxDocument object with methods for HTML manipulation
Example:
// Parse HTML content
htmlContent = "<html><head><title>My Page</title></head><body><h1>Hello World</h1></body></html>";
doc = htmlParse( htmlContent );
// Access document properties
title = doc.title(); // Returns "My Page"
bodyText = doc.body().text(); // Returns "Hello World"
// Use CSS selectors
htmlContent = "<ul><li class='item'>Item 1</li><li class='item'>Item 2</li></ul>";
doc = htmlParse( htmlContent );
items = doc.select( ".item" ); // Returns elements with class 'item'
// Extract text content
textContent = doc.text(); // Returns plain text without HTML tags
// Enhanced BoxDocument methods
htmlContent = "<div class='container'><h1>Title</h1><p>Content</p></div>";
doc = htmlParse( htmlContent );
// Convert to JSON
jsonString = doc.toJSON(); // Compact JSON
prettyJson = doc.toJSON( true ); // Pretty-printed JSON
// Convert to XML
xmlString = doc.toXML(); // Compact XML
prettyXml = doc.toXML( true, 2 ); // Pretty-printed XML with 2-space indentationStandard Jsoup Document Methods:
- title()– Get the contents of the- <title>tag
- select(selector)– Find elements using CSS selectors
- text()– Get the combined text of the entire document
- outerHtml()– Get the HTML of the entire document
- body()– Get the- <body>element
- head()– Get the- <head>element
- getElementById(id)– Find an element by its ID attribute
- getElementsByTag(tagName)– Get all elements with the given tag
- getElementsByClass(className)– Get all elements with the given class
- getElementsByAttribute(attrName)– Get elements that have the specified attribute
- html()– Get the inner HTML of the document body
- createElement(tagName)– Create a new element with the given tag
Enhanced BoxDocument Methods:
- toJSON()– Convert the document to a compact JSON representation
- toJSON(prettyPrint)– Convert to JSON with optional pretty-printing
- toXML()– Convert the document to a compact XML representation
- toXML(prettyPrint, indentFactor)– Convert to XML with optional pretty-printing and custom indentation
Enhanced Methods Examples:
// Sample HTML for examples
htmlContent = `
<div class="product" id="item-1">
    <h2>Product Name</h2>
    <p class="description">Product description here</p>
    <span class="price">$19.99</span>
</div>
`;
doc = htmlParse( htmlContent );
// Convert to JSON (compact)
jsonCompact = doc.toJSON();
// Result: {"tag":"html","children":[{"tag":"head"},{"tag":"body","children":[{"tag":"div","attributes":{"class":"product","id":"item-1"},"children":[{"tag":"h2","children":[{"text":"Product Name"}]},{"tag":"p","attributes":{"class":"description"},"children":[{"text":"Product description here"}]},{"tag":"span","attributes":{"class":"price"},"children":[{"text":"$19.99"}]}]}]}]}
// Convert to JSON (pretty-printed)
jsonPretty = doc.toJSON( true );
// Result: Formatted JSON with proper indentation
// Convert to XML (compact)
xmlCompact = doc.toXML();
// Result: <html><head></head><body><div class="product" id="item-1"><h2>Product Name</h2>...</div></body></html>
// Convert to XML (pretty-printed with 4-space indentation)
xmlPretty = doc.toXML( true, 4 );
// Result:
// <html>
//     <head></head>
//     <body>
//         <div class="product" id="item-1">
//             <h2>Product Name</h2>
//             <p class="description">Product description here</p>
//             <span class="price">$19.99</span>
//         </div>
//     </body>
// </html>htmlClean( html, safeList, preserveRelativeLinks, baseUri )
Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.
Parameters:
- html(string, required): The HTML string to clean
- safeList(string, optional): The safety level to apply (default: "relaxed")
- preserveRelativeLinks(boolean, optional): Whether to preserve relative links (default: false)
- baseUri(string, optional): Base URI for resolving relative links (default: "")
Returns: A cleaned HTML string
Safelist Options:
- none: Maximum cleaning, removes all tags and returns plain text only
- simpletext: Allows very limited inline formatting tags like- <b>,- <i>,- <br>
- basic: Basic cleaning, removes all tags except for a few safe ones
- basicwithimages: Basic cleaning but allows images
- relaxed: More lenient cleaning, allows more tags (default)
Examples:
// Basic cleaning with default "relaxed" safelist
dirtyHtml = "<script>alert('XSS')</script><p>Hello World!</p>";
cleanHtml = htmlClean( dirtyHtml );
// Result: "<p>Hello World!</p>"
// Strict cleaning with "basic" safelist
cleanHtml = htmlClean(
    html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
    safeList: "basic"
);
// Result: "<p>Hello World!</p>"
// Allow images with "basicwithimages" safelist
cleanHtml = htmlClean(
    html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
    safeList: "basicwithimages"
);
// Result: "<img src='image.jpg' /><p>Hello World!</p>"
// Plain text only with "none" safelist
cleanHtml = htmlClean(
    html: "<p><strong>Bold text</strong> and <em>italic text</em></p>",
    safeList: "none"
);
// Result: "Bold text and italic text"
// Preserve relative links
cleanHtml = htmlClean(
    html: "<a href='page.html'>Link</a>",
    preserveRelativeLinks: true
);
// Result: "<a href='page.html'>Link</a>"
// Convert relative links to absolute
cleanHtml = htmlClean(
    html: "<a href='page.html'>Link</a>",
    baseUri: "https://example.com/",
    preserveRelativeLinks: false
);
// Result: "<a href='https://example.com/page.html'>Link</a>"Use Cases
Content Management Systems
Clean user-generated content before storing or displaying:
userContent = "<p>Great article! <script>alert('hack')</script></p>";
safeContent = htmlClean( userContent );
// Store or display safeContent safelyWeb Scraping
Parse and extract data from HTML content:
scrapedHtml = "<div class='product'><h2>Product Name</h2><span class='price'>$19.99</span></div>";
doc = htmlParse( scrapedHtml );
productName = doc.select( ".product h2" ).text();
price = doc.select( ".price" ).text();Data Transformation
Convert HTML to different formats using BoxDocument's enhanced methods:
// Parse HTML content
htmlContent = `
<article>
    <header>
        <h1>Article Title</h1>
        <meta name="author" content="John Doe">
    </header>
    <section class="content">
        <p>First paragraph of the article.</p>
        <p>Second paragraph with <em>emphasis</em>.</p>
    </section>
</article>
`;
doc = htmlParse( htmlContent );
// Convert to structured JSON for API responses
jsonData = doc.toJSON( true );
// Use jsonData in REST APIs or data processing
// Convert to XML for legacy systems
xmlData = doc.toXML( true, 2 );
// Use xmlData for XML-based integrations
// Extract plain text for search indexing
textContent = doc.text();
// Use textContent for full-text searchEmail Template Processing
Clean HTML emails before sending:
emailTemplate = "<p>Hello {{name}}, <script>malicious()</script></p>";
cleanTemplate = htmlClean( emailTemplate, "basic" );
// Process cleanTemplate safelyGitHub Repository and Reporting Issues
Visit the GitHub repository: https://github.com/ortus-boxlang/bx-jsoup for release notes. You can also file a bug report or improvement suggestion via Jira
Last updated
Was this helpful?
