Jsoup
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup.
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup. This module enables developers to safely parse, manipulate, and clean HTML content with ease.
Features
HTML Parsing: Parse HTML strings into manipulable document objects
HTML Cleaning: Sanitize HTML content with customizable safety levels
XSS Protection: Built-in protection against Cross-Site Scripting attacks
Flexible Safelists: Multiple predefined safety levels from strict to relaxed
CSS Selectors: Extract elements using familiar CSS selector syntax
Relative Link Handling: Control how relative links are processed during cleaning
Installation
This module can be installed using CommandBox or the BoxLang Installer Scripts
# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoupAvailable BIFs (Built-in Functions)
htmlParse( html )
Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.
Parameters:
html(string, required): The HTML string to parse
Returns: A BoxDocument object with methods for HTML manipulation
Example:
// Parse HTML content
htmlContent = "<html><head><title>My Page</title></head><body><h1>Hello World</h1></body></html>";
doc = htmlParse( htmlContent );
// Access document properties
title = doc.title(); // Returns "My Page"
bodyText = doc.body().text(); // Returns "Hello World"
// Use CSS selectors
htmlContent = "<ul><li class='item'>Item 1</li><li class='item'>Item 2</li></ul>";
doc = htmlParse( htmlContent );
items = doc.select( ".item" ); // Returns elements with class 'item'
// Extract text content
textContent = doc.text(); // Returns plain text without HTML tags
// Enhanced BoxDocument methods
htmlContent = "<div class='container'><h1>Title</h1><p>Content</p></div>";
doc = htmlParse( htmlContent );
// Convert to JSON
jsonString = doc.toJSON(); // Compact JSON
prettyJson = doc.toJSON( true ); // Pretty-printed JSON
// Convert to XML
xmlString = doc.toXML(); // Compact XML
prettyXml = doc.toXML( true, 2 ); // Pretty-printed XML with 2-space indentationStandard Jsoup Document Methods:
title()– Get the contents of the<title>tagselect(selector)– Find elements using CSS selectorstext()– Get the combined text of the entire documentouterHtml()– Get the HTML of the entire documentbody()– Get the<body>elementhead()– Get the<head>elementgetElementById(id)– Find an element by its ID attributegetElementsByTag(tagName)– Get all elements with the given taggetElementsByClass(className)– Get all elements with the given classgetElementsByAttribute(attrName)– Get elements that have the specified attributehtml()– Get the inner HTML of the document bodycreateElement(tagName)– Create a new element with the given tag
Enhanced BoxDocument Methods:
toJSON()– Convert the document to a compact JSON representationtoJSON(prettyPrint)– Convert to JSON with optional pretty-printingtoXML()– Convert the document to a compact XML representationtoXML(prettyPrint, indentFactor)– Convert to XML with optional pretty-printing and custom indentation
Enhanced Methods Examples:
// Sample HTML for examples
htmlContent = `
<div class="product" id="item-1">
<h2>Product Name</h2>
<p class="description">Product description here</p>
<span class="price">$19.99</span>
</div>
`;
doc = htmlParse( htmlContent );
// Convert to JSON (compact)
jsonCompact = doc.toJSON();
// Result: {"tag":"html","children":[{"tag":"head"},{"tag":"body","children":[{"tag":"div","attributes":{"class":"product","id":"item-1"},"children":[{"tag":"h2","children":[{"text":"Product Name"}]},{"tag":"p","attributes":{"class":"description"},"children":[{"text":"Product description here"}]},{"tag":"span","attributes":{"class":"price"},"children":[{"text":"$19.99"}]}]}]}]}
// Convert to JSON (pretty-printed)
jsonPretty = doc.toJSON( true );
// Result: Formatted JSON with proper indentation
// Convert to XML (compact)
xmlCompact = doc.toXML();
// Result: <html><head></head><body><div class="product" id="item-1"><h2>Product Name</h2>...</div></body></html>
// Convert to XML (pretty-printed with 4-space indentation)
xmlPretty = doc.toXML( true, 4 );
// Result:
// <html>
// <head></head>
// <body>
// <div class="product" id="item-1">
// <h2>Product Name</h2>
// <p class="description">Product description here</p>
// <span class="price">$19.99</span>
// </div>
// </body>
// </html>htmlClean( html, safeList, preserveRelativeLinks, baseUri )
Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.
Parameters:
html(string, required): The HTML string to cleansafeList(string, optional): The safety level to apply (default: "relaxed")preserveRelativeLinks(boolean, optional): Whether to preserve relative links (default: false)baseUri(string, optional): Base URI for resolving relative links (default: "")
Returns: A cleaned HTML string
Safelist Options:
none: Maximum cleaning, removes all tags and returns plain text onlysimpletext: Allows very limited inline formatting tags like<b>,<i>,<br>basic: Basic cleaning, removes all tags except for a few safe onesbasicwithimages: Basic cleaning but allows imagesrelaxed: More lenient cleaning, allows more tags (default)
Examples:
// Basic cleaning with default "relaxed" safelist
dirtyHtml = "<script>alert('XSS')</script><p>Hello World!</p>";
cleanHtml = htmlClean( dirtyHtml );
// Result: "<p>Hello World!</p>"
// Strict cleaning with "basic" safelist
cleanHtml = htmlClean(
html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
safeList: "basic"
);
// Result: "<p>Hello World!</p>"
// Allow images with "basicwithimages" safelist
cleanHtml = htmlClean(
html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
safeList: "basicwithimages"
);
// Result: "<img src='image.jpg' /><p>Hello World!</p>"
// Plain text only with "none" safelist
cleanHtml = htmlClean(
html: "<p><strong>Bold text</strong> and <em>italic text</em></p>",
safeList: "none"
);
// Result: "Bold text and italic text"
// Preserve relative links
cleanHtml = htmlClean(
html: "<a href='page.html'>Link</a>",
preserveRelativeLinks: true
);
// Result: "<a href='page.html'>Link</a>"
// Convert relative links to absolute
cleanHtml = htmlClean(
html: "<a href='page.html'>Link</a>",
baseUri: "https://example.com/",
preserveRelativeLinks: false
);
// Result: "<a href='https://example.com/page.html'>Link</a>"Use Cases
Content Management Systems
Clean user-generated content before storing or displaying:
userContent = "<p>Great article! <script>alert('hack')</script></p>";
safeContent = htmlClean( userContent );
// Store or display safeContent safelyWeb Scraping
Parse and extract data from HTML content:
scrapedHtml = "<div class='product'><h2>Product Name</h2><span class='price'>$19.99</span></div>";
doc = htmlParse( scrapedHtml );
productName = doc.select( ".product h2" ).text();
price = doc.select( ".price" ).text();Data Transformation
Convert HTML to different formats using BoxDocument's enhanced methods:
// Parse HTML content
htmlContent = `
<article>
<header>
<h1>Article Title</h1>
<meta name="author" content="John Doe">
</header>
<section class="content">
<p>First paragraph of the article.</p>
<p>Second paragraph with <em>emphasis</em>.</p>
</section>
</article>
`;
doc = htmlParse( htmlContent );
// Convert to structured JSON for API responses
jsonData = doc.toJSON( true );
// Use jsonData in REST APIs or data processing
// Convert to XML for legacy systems
xmlData = doc.toXML( true, 2 );
// Use xmlData for XML-based integrations
// Extract plain text for search indexing
textContent = doc.text();
// Use textContent for full-text searchEmail Template Processing
Clean HTML emails before sending:
emailTemplate = "<p>Hello {{name}}, <script>malicious()</script></p>";
cleanTemplate = htmlClean( emailTemplate, "basic" );
// Process cleanTemplate safelyGitHub Repository and Reporting Issues
Visit the GitHub repository: https://github.com/ortus-boxlang/bx-jsoup for release notes. You can also file a bug report or improvement suggestion via Jira
Last updated
Was this helpful?
