Jsoup
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup.
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup. This module enables developers to safely parse, manipulate, and clean HTML content with ease.
Features
HTML Parsing: Parse HTML strings into manipulable document objects
HTML Cleaning: Sanitize HTML content with customizable safety levels
XSS Protection: Built-in protection against Cross-Site Scripting attacks
Flexible Safelists: Multiple predefined safety levels from strict to relaxed
CSS Selectors: Extract elements using familiar CSS selector syntax
Relative Link Handling: Control how relative links are processed during cleaning
Installation
This module can be installed using CommandBox or the BoxLang Installer Scripts
# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoupAvailable BIFs (Built-in Functions)
htmlParse( html )
Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.
Parameters:
html(string, required): The HTML string to parse
Returns: A BoxDocument object with methods for HTML manipulation
Example:
Standard Jsoup Document Methods:
title()– Get the contents of the<title>tagselect(selector)– Find elements using CSS selectorstext()– Get the combined text of the entire documentouterHtml()– Get the HTML of the entire documentbody()– Get the<body>elementhead()– Get the<head>elementgetElementById(id)– Find an element by its ID attributegetElementsByTag(tagName)– Get all elements with the given taggetElementsByClass(className)– Get all elements with the given classgetElementsByAttribute(attrName)– Get elements that have the specified attributehtml()– Get the inner HTML of the document bodycreateElement(tagName)– Create a new element with the given tag
Enhanced BoxDocument Methods:
toJSON()– Convert the document to a compact JSON representationtoJSON(prettyPrint)– Convert to JSON with optional pretty-printingtoXML()– Convert the document to a compact XML representationtoXML(prettyPrint, indentFactor)– Convert to XML with optional pretty-printing and custom indentation
Enhanced Methods Examples:
htmlClean( html, safeList, preserveRelativeLinks, baseUri )
Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.
Parameters:
html(string, required): The HTML string to cleansafeList(string, optional): The safety level to apply (default: "relaxed")preserveRelativeLinks(boolean, optional): Whether to preserve relative links (default: false)baseUri(string, optional): Base URI for resolving relative links (default: "")
Returns: A cleaned HTML string
Safelist Options:
none: Maximum cleaning, removes all tags and returns plain text onlysimpletext: Allows very limited inline formatting tags like<b>,<i>,<br>basic: Basic cleaning, removes all tags except for a few safe onesbasicwithimages: Basic cleaning but allows imagesrelaxed: More lenient cleaning, allows more tags (default)
Examples:
Use Cases
Content Management Systems
Clean user-generated content before storing or displaying:
Web Scraping
Parse and extract data from HTML content:
Data Transformation
Convert HTML to different formats using BoxDocument's enhanced methods:
Email Template Processing
Clean HTML emails before sending:
GitHub Repository and Reporting Issues
Visit the GitHub repository: https://github.com/ortus-boxlang/bx-jsoup for release notes. You can also file a bug report or improvement suggestion via Jira
Last updated
Was this helpful?
