htmlToBlocks - A Custom Implementation in C#
Great question about empty blocks in Portable Text! Understanding when and why these get created during HTML parsing is important for implementing this correctly in C#.
When Empty Blocks Are Created
During HTML to Portable Text conversion, empty blocks are generated based on specific parsing rules. The htmlToBlocks function (note: @sanity/block-tools has been deprecated in favor of @portabletext/block-tools, but the core logic is similar) creates empty blocks in these scenarios:
1. Block-level HTML elements with no text content:
<p></p>,<h1></h1>,<div></div>etc.- Elements containing only whitespace:
<p> </p>or<p>\n</p>
2. Self-closing or void elements that represent block boundaries:
- Multiple consecutive
<br>tags that create paragraph breaks - Empty container elements that establish document structure
3. Elements with only formatting/annotation children but no actual text:
<p><strong></strong></p>- has a decorator but no text spans<p><a href="..."></a></p>- has an annotation but no content
4. Structural preservation during initial parsing: The parser creates a block object for every block-level HTML element it encounters, regardless of content, to maintain a 1:1 structural mapping during the initial parse phase. This ensures the HTML's document structure is faithfully represented before any cleanup.
The Parsing Algorithm's Behavior
The htmlToBlocks deserializer follows this general pattern:
- Parse HTML into DOM nodes (using browser DOMParser or JSDOM in Node.js)
- Traverse the DOM tree and for each block-level element:
- Create a new block object with
_type: 'block' - Set default values:
markDefs: [],style: 'normal',children: [] - Process child nodes to populate children array
- Create a new block object with
- For text nodes, create span objects with text content
- For inline elements (strong, em, code, etc.), add marks to spans
- Result: Every block-level element becomes a block, even if its children array ends up empty
Requirements for Your C# Implementation
When implementing this in C#, you should:
During Parsing (create empty blocks):
// Pseudo-code for C# implementation
foreach (var element in blockLevelElements)
{
var block = new Block
{
_type = "block",
_key = GenerateKey(),
style = MapStyleFromTag(element.TagName), // h1 -> "h1", p -> "normal"
markDefs = new List<MarkDef>(),
children = new List<Span>()
};
// Process children - this might result in empty children array
foreach (var child in element.ChildNodes)
{
if (child.NodeType == NodeType.Text && !string.IsNullOrWhiteSpace(child.TextContent))
{
block.children.Add(CreateSpan(child));
}
}
blocks.Add(block); // Add even if children is empty
}After Parsing (filter empty blocks):
// Post-processing cleanup
blocks = blocks.Where(block =>
{
if (block.children == null || block.children.Count == 0)
return false;
// Check if all children have only whitespace text
var allText = string.Join("", block.children.Select(c => c.text ?? ""));
return !string.IsNullOrWhiteSpace(allText);
}).ToList();Important: Keep blocks that don't have a children property at all (like image blocks or other custom block types) - only filter blocks where children exists but is empty or contains only whitespace.
Why This Two-Phase Approach?
The empty blocks exist as an intermediate artifact because:
- It simplifies the parsing logic - every block element gets a block object
- It maintains structural fidelity during conversion
- It allows custom rules to inspect the full structure before cleanup
- Post-processing can apply consistent cleanup logic across all blocks
The blocks with "children": [], "markDefs": [], and "style": "normal" are valid Portable Text structures - they represent empty paragraphs in the source HTML. However, they're typically undesirable in the final output since they'd render as empty paragraphs.
Regarding Your C# Port
For your C# implementation, I'd recommend:
- Use a library like HtmlAgilityPack or AngleSharp for HTML parsing
- Create blocks for all block-level elements during the initial pass
- Apply the whitespace-trimming filter as a separate post-processing step
- Consider joining the Sanity Community Slack to discuss the C# port - there may be other .NET developers interested in collaborating!
The key insight is that empty block creation is a byproduct of the structural parsing algorithm, not a bug - but they should generally be filtered out before returning the final Portable Text array.
Sanity – Build the way you think, not the way your CMS thinks
Sanity is the developer-first content operating system that gives you complete control. Schema-as-code, GROQ queries, and real-time APIs mean no more workarounds or waiting for deployments. Free to start, scale as you grow.