Understanding Empty Blocks in HTML to Portable Text Conversion
Great question about empty blocks in Portable Text! Understanding when and why these get created during HTML parsing is important for implementing this correctly in C#.
When Empty Blocks Are Created
During HTML to Portable Text conversion, empty blocks are generated based on specific parsing rules. The htmlToBlocks function (note: @sanity/block-tools has been deprecated in favor of @portabletext/block-tools, but the core logic is similar) creates empty blocks in these scenarios:
1. Block-level HTML elements with no text content:
<p></p>,<h1></h1>,<div></div>etc.- Elements containing only whitespace:
<p> </p>or<p>\n</p>
2. Self-closing or void elements that represent block boundaries:
- Multiple consecutive
<br>tags that create paragraph breaks - Empty container elements that establish document structure
3. Elements with only formatting/annotation children but no actual text:
<p><strong></strong></p>- has a decorator but no text spans<p><a href="..."></a></p>- has an annotation but no content
4. Structural preservation during initial parsing: The parser creates a block object for every block-level HTML element it encounters, regardless of content, to maintain a 1:1 structural mapping during the initial parse phase. This ensures the HTML's document structure is faithfully represented before any cleanup.
The Parsing Algorithm's Behavior
The htmlToBlocks deserializer follows this general pattern:
- Parse HTML into DOM nodes (using browser DOMParser or JSDOM in Node.js)
- Traverse the DOM tree and for each block-level element:
- Create a new block object with
_type: 'block' - Set default values:
markDefs: [],style: 'normal',children: [] - Process child nodes to populate children array
- Create a new block object with
- For text nodes, create span objects with text content
- For inline elements (strong, em, code, etc.), add marks to spans
- Result: Every block-level element becomes a block, even if its children array ends up empty
Requirements for Your C# Implementation
When implementing this in C#, you should:
During Parsing (create empty blocks):
// Pseudo-code for C# implementation
foreach (var element in blockLevelElements)
{
var block = new Block
{
_type = "block",
_key = GenerateKey(),
style = MapStyleFromTag(element.TagName), // h1 -> "h1", p -> "normal"
markDefs = new List<MarkDef>(),
children = new List<Span>()
};
// Process children - this might result in empty children array
foreach (var child in element.ChildNodes)
{
if (child.NodeType == NodeType.Text && !string.IsNullOrWhiteSpace(child.TextContent))
{
block.children.Add(CreateSpan(child));
}
}
blocks.Add(block); // Add even if children is empty
}After Parsing (filter empty blocks):
// Post-processing cleanup
blocks = blocks.Where(block =>
{
if (block.children == null || block.children.Count == 0)
return false;
// Check if all children have only whitespace text
var allText = string.Join("", block.children.Select(c => c.text ?? ""));
return !string.IsNullOrWhiteSpace(allText);
}).ToList();Important: Keep blocks that don't have a children property at all (like image blocks or other custom block types) - only filter blocks where children exists but is empty or contains only whitespace.
Why This Two-Phase Approach?
The empty blocks exist as an intermediate artifact because:
- It simplifies the parsing logic - every block element gets a block object
- It maintains structural fidelity during conversion
- It allows custom rules to inspect the full structure before cleanup
- Post-processing can apply consistent cleanup logic across all blocks
The blocks with "children": [], "markDefs": [], and "style": "normal" are valid Portable Text structures - they represent empty paragraphs in the source HTML. However, they're typically undesirable in the final output since they'd render as empty paragraphs.
Regarding Your C# Port
For your C# implementation, I'd recommend:
- Use a library like HtmlAgilityPack or AngleSharp for HTML parsing
- Create blocks for all block-level elements during the initial pass
- Apply the whitespace-trimming filter as a separate post-processing step
- Consider joining the Sanity Community Slack to discuss the C# port - there may be other .NET developers interested in collaborating!
The key insight is that empty block creation is a byproduct of the structural parsing algorithm, not a bug - but they should generally be filtered out before returning the final Portable Text array.
Show original thread39 replies
Sanity – Build the way you think, not the way your CMS thinks
Sanity is the developer-first content operating system that gives you complete control. Schema-as-code, GROQ queries, and real-time APIs mean no more workarounds or waiting for deployments. Free to start, scale as you grow.