Jsoup

A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup.

A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup. This module enables developers to safely parse, manipulate, and clean HTML content with ease.

Features

  • HTML Parsing: Parse HTML strings into manipulable document objects

  • HTML Cleaning: Sanitize HTML content with customizable safety levels

  • XSS Protection: Built-in protection against Cross-Site Scripting attacks

  • Flexible Safelists: Multiple predefined safety levels from strict to relaxed

  • CSS Selectors: Extract elements using familiar CSS selector syntax

  • Relative Link Handling: Control how relative links are processed during cleaning

Installation

This module can be installed using CommandBox or the BoxLang Installer Scripts

# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoup

Available BIFs (Built-in Functions)

htmlParse( html )

Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.

Parameters:

  • html (string, required): The HTML string to parse

Returns: A BoxDocument object with methods for HTML manipulation

Example:

Standard Jsoup Document Methods:

  • title() – Get the contents of the <title> tag

  • select(selector) – Find elements using CSS selectors

  • text() – Get the combined text of the entire document

  • outerHtml() – Get the HTML of the entire document

  • body() – Get the <body> element

  • head() – Get the <head> element

  • getElementById(id) – Find an element by its ID attribute

  • getElementsByTag(tagName) – Get all elements with the given tag

  • getElementsByClass(className) – Get all elements with the given class

  • getElementsByAttribute(attrName) – Get elements that have the specified attribute

  • html() – Get the inner HTML of the document body

  • createElement(tagName) – Create a new element with the given tag

Enhanced BoxDocument Methods:

  • toJSON() – Convert the document to a compact JSON representation

  • toJSON(prettyPrint) – Convert to JSON with optional pretty-printing

  • toXML() – Convert the document to a compact XML representation

  • toXML(prettyPrint, indentFactor) – Convert to XML with optional pretty-printing and custom indentation

Enhanced Methods Examples:

Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.

Parameters:

  • html (string, required): The HTML string to clean

  • safeList (string, optional): The safety level to apply (default: "relaxed")

  • preserveRelativeLinks (boolean, optional): Whether to preserve relative links (default: false)

  • baseUri (string, optional): Base URI for resolving relative links (default: "")

Returns: A cleaned HTML string

Safelist Options:

  • none: Maximum cleaning, removes all tags and returns plain text only

  • simpletext: Allows very limited inline formatting tags like <b>, <i>, <br>

  • basic: Basic cleaning, removes all tags except for a few safe ones

  • basicwithimages: Basic cleaning but allows images

  • relaxed: More lenient cleaning, allows more tags (default)

Examples:

Use Cases

Content Management Systems

Clean user-generated content before storing or displaying:

Web Scraping

Parse and extract data from HTML content:

Data Transformation

Convert HTML to different formats using BoxDocument's enhanced methods:

Email Template Processing

Clean HTML emails before sending:

GitHub Repository and Reporting Issues

Visit the GitHub repository: https://github.com/ortus-boxlang/bx-jsoup for release notes. You can also file a bug report or improvement suggestion via Jira

Last updated

Was this helpful?