# Jsoup

A powerful BoxLang module that provides HTML parsing and cleaning capabilities using [Jsoup](https://jsoup.org/). This module enables developers to safely parse, manipulate, and clean HTML content with ease.

### Features

* **HTML Parsing**: Parse HTML strings into manipulable document objects
* **HTML Cleaning**: Sanitize HTML content with customizable safety levels
* **XSS Protection**: Built-in protection against Cross-Site Scripting attacks
* **Flexible Safelists**: Multiple predefined safety levels from strict to relaxed
* **CSS Selectors**: Extract elements using familiar CSS selector syntax
* **Relative Link Handling**: Control how relative links are processed during cleaning

### Installation

This module can be installed using CommandBox or the BoxLang Installer Scripts

```bash
# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoup
```

### Available BIFs (Built-in Functions)

#### htmlParse( html )

Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.

**Parameters:**

* `html` (string, required): The HTML string to parse

**Returns:** A BoxDocument object with methods for HTML manipulation

**Example:**

```javascript
// Parse HTML content
htmlContent = "<html><head><title>My Page</title></head><body><h1>Hello World</h1></body></html>";
doc = htmlParse( htmlContent );

// Access document properties
title = doc.title(); // Returns "My Page"
bodyText = doc.body().text(); // Returns "Hello World"

// Use CSS selectors
htmlContent = "<ul><li class='item'>Item 1</li><li class='item'>Item 2</li></ul>";
doc = htmlParse( htmlContent );
items = doc.select( ".item" ); // Returns elements with class 'item'

// Extract text content
textContent = doc.text(); // Returns plain text without HTML tags

// Enhanced BoxDocument methods
htmlContent = "<div class='container'><h1>Title</h1><p>Content</p></div>";
doc = htmlParse( htmlContent );

// Convert to JSON
jsonString = doc.toJSON(); // Compact JSON
prettyJson = doc.toJSON( true ); // Pretty-printed JSON

// Convert to XML
xmlString = doc.toXML(); // Compact XML
prettyXml = doc.toXML( true, 2 ); // Pretty-printed XML with 2-space indentation
```

**Standard Jsoup Document Methods:**

* `title()` – Get the contents of the `<title>` tag
* `select(selector)` – Find elements using CSS selectors
* `text()` – Get the combined text of the entire document
* `outerHtml()` – Get the HTML of the entire document
* `body()` – Get the `<body>` element
* `head()` – Get the `<head>` element
* `getElementById(id)` – Find an element by its ID attribute
* `getElementsByTag(tagName)` – Get all elements with the given tag
* `getElementsByClass(className)` – Get all elements with the given class
* `getElementsByAttribute(attrName)` – Get elements that have the specified attribute
* `html()` – Get the inner HTML of the document body
* `createElement(tagName)` – Create a new element with the given tag

**Enhanced BoxDocument Methods:**

* `toJSON()` – Convert the document to a compact JSON representation
* `toJSON(prettyPrint)` – Convert to JSON with optional pretty-printing
* `toXML()` – Convert the document to a compact XML representation
* `toXML(prettyPrint, indentFactor)` – Convert to XML with optional pretty-printing and custom indentation

**Enhanced Methods Examples:**

```javascript
// Sample HTML for examples
htmlContent = `
<div class="product" id="item-1">
    <h2>Product Name</h2>
    <p class="description">Product description here</p>
    <span class="price">$19.99</span>
</div>
`;
doc = htmlParse( htmlContent );

// Convert to JSON (compact)
jsonCompact = doc.toJSON();
// Result: {"tag":"html","children":[{"tag":"head"},{"tag":"body","children":[{"tag":"div","attributes":{"class":"product","id":"item-1"},"children":[{"tag":"h2","children":[{"text":"Product Name"}]},{"tag":"p","attributes":{"class":"description"},"children":[{"text":"Product description here"}]},{"tag":"span","attributes":{"class":"price"},"children":[{"text":"$19.99"}]}]}]}]}

// Convert to JSON (pretty-printed)
jsonPretty = doc.toJSON( true );
// Result: Formatted JSON with proper indentation

// Convert to XML (compact)
xmlCompact = doc.toXML();
// Result: <html><head></head><body><div class="product" id="item-1"><h2>Product Name</h2>...</div></body></html>

// Convert to XML (pretty-printed with 4-space indentation)
xmlPretty = doc.toXML( true, 4 );
// Result:
// <html>
//     <head></head>
//     <body>
//         <div class="product" id="item-1">
//             <h2>Product Name</h2>
//             <p class="description">Product description here</p>
//             <span class="price">$19.99</span>
//         </div>
//     </body>
// </html>
```

#### htmlClean( html, safeList, preserveRelativeLinks, baseUri )

Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.

**Parameters:**

* `html` (string, required): The HTML string to clean
* `safeList` (string, optional): The safety level to apply (default: "relaxed")
* `preserveRelativeLinks` (boolean, optional): Whether to preserve relative links (default: false)
* `baseUri` (string, optional): Base URI for resolving relative links (default: "")

**Returns:** A cleaned HTML string

**Safelist Options:**

* `none`: Maximum cleaning, removes all tags and returns plain text only
* `simpletext`: Allows very limited inline formatting tags like `<b>`, `<i>`, `<br>`
* `basic`: Basic cleaning, removes all tags except for a few safe ones
* `basicwithimages`: Basic cleaning but allows images
* `relaxed`: More lenient cleaning, allows more tags (default)

**Examples:**

```javascript
// Basic cleaning with default "relaxed" safelist
dirtyHtml = "<script>alert('XSS')</script><p>Hello World!</p>";
cleanHtml = htmlClean( dirtyHtml );
// Result: "<p>Hello World!</p>"

// Strict cleaning with "basic" safelist
cleanHtml = htmlClean(
    html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
    safeList: "basic"
);
// Result: "<p>Hello World!</p>"

// Allow images with "basicwithimages" safelist
cleanHtml = htmlClean(
    html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
    safeList: "basicwithimages"
);
// Result: "<img src='image.jpg' /><p>Hello World!</p>"

// Plain text only with "none" safelist
cleanHtml = htmlClean(
    html: "<p><strong>Bold text</strong> and <em>italic text</em></p>",
    safeList: "none"
);
// Result: "Bold text and italic text"

// Preserve relative links
cleanHtml = htmlClean(
    html: "<a href='page.html'>Link</a>",
    preserveRelativeLinks: true
);
// Result: "<a href='page.html'>Link</a>"

// Convert relative links to absolute
cleanHtml = htmlClean(
    html: "<a href='page.html'>Link</a>",
    baseUri: "https://example.com/",
    preserveRelativeLinks: false
);
// Result: "<a href='https://example.com/page.html'>Link</a>"
```

### Use Cases

#### Content Management Systems

Clean user-generated content before storing or displaying:

```javascript
userContent = "<p>Great article! <script>alert('hack')</script></p>";
safeContent = htmlClean( userContent );
// Store or display safeContent safely
```

#### Web Scraping

Parse and extract data from HTML content:

```javascript
scrapedHtml = "<div class='product'><h2>Product Name</h2><span class='price'>$19.99</span></div>";
doc = htmlParse( scrapedHtml );
productName = doc.select( ".product h2" ).text();
price = doc.select( ".price" ).text();
```

#### Data Transformation

Convert HTML to different formats using BoxDocument's enhanced methods:

```javascript
// Parse HTML content
htmlContent = `
<article>
    <header>
        <h1>Article Title</h1>
        <meta name="author" content="John Doe">
    </header>
    <section class="content">
        <p>First paragraph of the article.</p>
        <p>Second paragraph with <em>emphasis</em>.</p>
    </section>
</article>
`;
doc = htmlParse( htmlContent );

// Convert to structured JSON for API responses
jsonData = doc.toJSON( true );
// Use jsonData in REST APIs or data processing

// Convert to XML for legacy systems
xmlData = doc.toXML( true, 2 );
// Use xmlData for XML-based integrations

// Extract plain text for search indexing
textContent = doc.text();
// Use textContent for full-text search
```

#### Email Template Processing

Clean HTML emails before sending:

```javascript
emailTemplate = "<p>Hello {{name}}, <script>malicious()</script></p>";
cleanTemplate = htmlClean( emailTemplate, "basic" );
// Process cleanTemplate safely
```

### GitHub Repository and Reporting Issues

Visit the GitHub repository: <https://github.com/ortus-boxlang/bx-jsoup> for release notes. You can also file a bug report or improvement suggestion via [Jira](https://ortussolutions.atlassian.net/secure/CreateIssueDetails!init.jspa?pid=13359\&issuetype=1)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boxlang.ortusbooks.com/boxlang-framework/modularity/jsoup.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
