Understanding Empty Blocks in HTML to Portable Text Conversion

39 replies
Last updated: Jan 28, 2026
Hi guys! At our company, we are trying to adopt the Portable Text format, turning away from plain HTML. Our backend is in C#, so unfortunatelly we cannot (wish not) to use this https://www.npmjs.com/package/@sanity/block-tools?activeTab=readme#htmltoblockshtml-blockcontenttype-options-html-deserializer package, but we use it as a source for a custom implementation in C#. I am operating as an intermediate here and may have questions in the future related to this. Our backend developer has a question about the reasoning behind:

{
    "_type": "block",
    "markDefs": [],
    "style": "normal",
    "children": []
},
These empty blocks. What are the requirements for these to be inserted when parsing html to Portable Text?

PS: Anyone here who would be interested in porting (or help porting) block-tools's
htmlToBlocks
to C#? ❤️
AI Update

Great question about empty blocks in Portable Text! Understanding when and why these get created during HTML parsing is important for implementing this correctly in C#.

When Empty Blocks Are Created

During HTML to Portable Text conversion, empty blocks are generated based on specific parsing rules. The htmlToBlocks function (note: @sanity/block-tools has been deprecated in favor of @portabletext/block-tools, but the core logic is similar) creates empty blocks in these scenarios:

1. Block-level HTML elements with no text content:

  • <p></p>, <h1></h1>, <div></div> etc.
  • Elements containing only whitespace: <p> </p> or <p>\n</p>

2. Self-closing or void elements that represent block boundaries:

  • Multiple consecutive <br> tags that create paragraph breaks
  • Empty container elements that establish document structure

3. Elements with only formatting/annotation children but no actual text:

  • <p><strong></strong></p> - has a decorator but no text spans
  • <p><a href="..."></a></p> - has an annotation but no content

4. Structural preservation during initial parsing: The parser creates a block object for every block-level HTML element it encounters, regardless of content, to maintain a 1:1 structural mapping during the initial parse phase. This ensures the HTML's document structure is faithfully represented before any cleanup.

The Parsing Algorithm's Behavior

The htmlToBlocks deserializer follows this general pattern:

  1. Parse HTML into DOM nodes (using browser DOMParser or JSDOM in Node.js)
  2. Traverse the DOM tree and for each block-level element:
    • Create a new block object with _type: 'block'
    • Set default values: markDefs: [], style: 'normal', children: []
    • Process child nodes to populate children array
  3. For text nodes, create span objects with text content
  4. For inline elements (strong, em, code, etc.), add marks to spans
  5. Result: Every block-level element becomes a block, even if its children array ends up empty

Requirements for Your C# Implementation

When implementing this in C#, you should:

During Parsing (create empty blocks):

// Pseudo-code for C# implementation
foreach (var element in blockLevelElements) 
{
    var block = new Block 
    {
        _type = "block",
        _key = GenerateKey(),
        style = MapStyleFromTag(element.TagName), // h1 -> "h1", p -> "normal"
        markDefs = new List<MarkDef>(),
        children = new List<Span>()
    };
    
    // Process children - this might result in empty children array
    foreach (var child in element.ChildNodes) 
    {
        if (child.NodeType == NodeType.Text && !string.IsNullOrWhiteSpace(child.TextContent))
        {
            block.children.Add(CreateSpan(child));
        }
    }
    
    blocks.Add(block); // Add even if children is empty
}

After Parsing (filter empty blocks):

// Post-processing cleanup
blocks = blocks.Where(block => 
{
    if (block.children == null || block.children.Count == 0)
        return false;
    
    // Check if all children have only whitespace text
    var allText = string.Join("", block.children.Select(c => c.text ?? ""));
    return !string.IsNullOrWhiteSpace(allText);
}).ToList();

Important: Keep blocks that don't have a children property at all (like image blocks or other custom block types) - only filter blocks where children exists but is empty or contains only whitespace.

Why This Two-Phase Approach?

The empty blocks exist as an intermediate artifact because:

  1. It simplifies the parsing logic - every block element gets a block object
  2. It maintains structural fidelity during conversion
  3. It allows custom rules to inspect the full structure before cleanup
  4. Post-processing can apply consistent cleanup logic across all blocks

The blocks with "children": [], "markDefs": [], and "style": "normal" are valid Portable Text structures - they represent empty paragraphs in the source HTML. However, they're typically undesirable in the final output since they'd render as empty paragraphs.

Regarding Your C# Port

For your C# implementation, I'd recommend:

  1. Use a library like HtmlAgilityPack or AngleSharp for HTML parsing
  2. Create blocks for all block-level elements during the initial pass
  3. Apply the whitespace-trimming filter as a separate post-processing step
  4. Consider joining the Sanity Community Slack to discuss the C# port - there may be other .NET developers interested in collaborating!

The key insight is that empty block creation is a byproduct of the structural parsing algorithm, not a bug - but they should generally be filtered out before returning the final Portable Text array.

Show original thread
39 replies

Sanity – Build the way you think, not the way your CMS thinks

Sanity is the developer-first content operating system that gives you complete control. Schema-as-code, GROQ queries, and real-time APIs mean no more workarounds or waiting for deployments. Free to start, scale as you grow.

Was this answer helpful?