Jsoup
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using [Jsoup](https://jsoup.org/).
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup. This module enables developers to safely parse, manipulate, and clean HTML content with ease.
Features
HTML Parsing: Parse HTML strings into manipulable document objects
HTML Cleaning: Sanitize HTML content with customizable safety levels
XSS Protection: Built-in protection against Cross-Site Scripting attacks
Flexible Safelists: Multiple predefined safety levels from strict to relaxed
CSS Selectors: Extract elements using familiar CSS selector syntax
Relative Link Handling: Control how relative links are processed during cleaning
Installation
This module can be installed using CommandBox or the BoxLang Installer Scripts
# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoup
Available BIFs (Built-in Functions)
htmlParse( html )
Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.
Parameters:
html
(string, required): The HTML string to parse
Returns: A BoxDocument object with methods for HTML manipulation
Example:
// Parse HTML content
htmlContent = "<html><head><title>My Page</title></head><body><h1>Hello World</h1></body></html>";
doc = htmlParse( htmlContent );
// Access document properties
title = doc.title(); // Returns "My Page"
bodyText = doc.body().text(); // Returns "Hello World"
// Use CSS selectors
htmlContent = "<ul><li class='item'>Item 1</li><li class='item'>Item 2</li></ul>";
doc = htmlParse( htmlContent );
items = doc.select( ".item" ); // Returns elements with class 'item'
// Extract text content
textContent = doc.text(); // Returns plain text without HTML tags
// Enhanced BoxDocument methods
htmlContent = "<div class='container'><h1>Title</h1><p>Content</p></div>";
doc = htmlParse( htmlContent );
// Convert to JSON
jsonString = doc.toJSON(); // Compact JSON
prettyJson = doc.toJSON( true ); // Pretty-printed JSON
// Convert to XML
xmlString = doc.toXML(); // Compact XML
prettyXml = doc.toXML( true, 2 ); // Pretty-printed XML with 2-space indentation
Standard Jsoup Document Methods:
title()
– Get the contents of the<title>
tagselect(selector)
– Find elements using CSS selectorstext()
– Get the combined text of the entire documentouterHtml()
– Get the HTML of the entire documentbody()
– Get the<body>
elementhead()
– Get the<head>
elementgetElementById(id)
– Find an element by its ID attributegetElementsByTag(tagName)
– Get all elements with the given taggetElementsByClass(className)
– Get all elements with the given classgetElementsByAttribute(attrName)
– Get elements that have the specified attributehtml()
– Get the inner HTML of the document bodycreateElement(tagName)
– Create a new element with the given tag
Enhanced BoxDocument Methods:
toJSON()
– Convert the document to a compact JSON representationtoJSON(prettyPrint)
– Convert to JSON with optional pretty-printingtoXML()
– Convert the document to a compact XML representationtoXML(prettyPrint, indentFactor)
– Convert to XML with optional pretty-printing and custom indentation
Enhanced Methods Examples:
// Sample HTML for examples
htmlContent = `
<div class="product" id="item-1">
<h2>Product Name</h2>
<p class="description">Product description here</p>
<span class="price">$19.99</span>
</div>
`;
doc = htmlParse( htmlContent );
// Convert to JSON (compact)
jsonCompact = doc.toJSON();
// Result: {"tag":"html","children":[{"tag":"head"},{"tag":"body","children":[{"tag":"div","attributes":{"class":"product","id":"item-1"},"children":[{"tag":"h2","children":[{"text":"Product Name"}]},{"tag":"p","attributes":{"class":"description"},"children":[{"text":"Product description here"}]},{"tag":"span","attributes":{"class":"price"},"children":[{"text":"$19.99"}]}]}]}]}
// Convert to JSON (pretty-printed)
jsonPretty = doc.toJSON( true );
// Result: Formatted JSON with proper indentation
// Convert to XML (compact)
xmlCompact = doc.toXML();
// Result: <html><head></head><body><div class="product" id="item-1"><h2>Product Name</h2>...</div></body></html>
// Convert to XML (pretty-printed with 4-space indentation)
xmlPretty = doc.toXML( true, 4 );
// Result:
// <html>
// <head></head>
// <body>
// <div class="product" id="item-1">
// <h2>Product Name</h2>
// <p class="description">Product description here</p>
// <span class="price">$19.99</span>
// </div>
// </body>
// </html>
htmlClean( html, safeList, preserveRelativeLinks, baseUri )
Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.
Parameters:
html
(string, required): The HTML string to cleansafeList
(string, optional): The safety level to apply (default: "relaxed")preserveRelativeLinks
(boolean, optional): Whether to preserve relative links (default: false)baseUri
(string, optional): Base URI for resolving relative links (default: "")
Returns: A cleaned HTML string
Safelist Options:
none
: Maximum cleaning, removes all tags and returns plain text onlysimpletext
: Allows very limited inline formatting tags like<b>
,<i>
,<br>
basic
: Basic cleaning, removes all tags except for a few safe onesbasicwithimages
: Basic cleaning but allows imagesrelaxed
: More lenient cleaning, allows more tags (default)
Examples:
// Basic cleaning with default "relaxed" safelist
dirtyHtml = "<script>alert('XSS')</script><p>Hello World!</p>";
cleanHtml = htmlClean( dirtyHtml );
// Result: "<p>Hello World!</p>"
// Strict cleaning with "basic" safelist
cleanHtml = htmlClean(
html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
safeList: "basic"
);
// Result: "<p>Hello World!</p>"
// Allow images with "basicwithimages" safelist
cleanHtml = htmlClean(
html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
safeList: "basicwithimages"
);
// Result: "<img src='image.jpg' /><p>Hello World!</p>"
// Plain text only with "none" safelist
cleanHtml = htmlClean(
html: "<p><strong>Bold text</strong> and <em>italic text</em></p>",
safeList: "none"
);
// Result: "Bold text and italic text"
// Preserve relative links
cleanHtml = htmlClean(
html: "<a href='page.html'>Link</a>",
preserveRelativeLinks: true
);
// Result: "<a href='page.html'>Link</a>"
// Convert relative links to absolute
cleanHtml = htmlClean(
html: "<a href='page.html'>Link</a>",
baseUri: "https://example.com/",
preserveRelativeLinks: false
);
// Result: "<a href='https://example.com/page.html'>Link</a>"
Use Cases
Content Management Systems
Clean user-generated content before storing or displaying:
userContent = "<p>Great article! <script>alert('hack')</script></p>";
safeContent = htmlClean( userContent );
// Store or display safeContent safely
Web Scraping
Parse and extract data from HTML content:
scrapedHtml = "<div class='product'><h2>Product Name</h2><span class='price'>$19.99</span></div>";
doc = htmlParse( scrapedHtml );
productName = doc.select( ".product h2" ).text();
price = doc.select( ".price" ).text();
Data Transformation
Convert HTML to different formats using BoxDocument's enhanced methods:
// Parse HTML content
htmlContent = `
<article>
<header>
<h1>Article Title</h1>
<meta name="author" content="John Doe">
</header>
<section class="content">
<p>First paragraph of the article.</p>
<p>Second paragraph with <em>emphasis</em>.</p>
</section>
</article>
`;
doc = htmlParse( htmlContent );
// Convert to structured JSON for API responses
jsonData = doc.toJSON( true );
// Use jsonData in REST APIs or data processing
// Convert to XML for legacy systems
xmlData = doc.toXML( true, 2 );
// Use xmlData for XML-based integrations
// Extract plain text for search indexing
textContent = doc.text();
// Use textContent for full-text search
Email Template Processing
Clean HTML emails before sending:
emailTemplate = "<p>Hello {{name}}, <script>malicious()</script></p>";
cleanTemplate = htmlClean( emailTemplate, "basic" );
// Process cleanTemplate safely
GitHub Repository and Reporting Issues
Visit the GitHub repository: https://github.com/ortus-boxlang/bx-jsoup for release notes. You can also file a bug report or improvement suggestion via Jira
Last updated
Was this helpful?