htmlToBlocks - A Custom Implementation in C#

39 replies
Last updated: Apr 16, 2020
Hi guys! At our company, we are trying to adopt the Portable Text format, turning away from plain HTML. Our backend is in C#, so unfortunatelly we cannot (wish not) to use this https://www.npmjs.com/package/@sanity/block-tools?activeTab=readme#htmltoblockshtml-blockcontenttype-options-html-deserializer package, but we use it as a source for a custom implementation in C#. I am operating as an intermediate here and may have questions in the future related to this. Our backend developer has a question about the reasoning behind:

{
    "_type": "block",
    "markDefs": [],
    "style": "normal",
    "children": []
},
These empty blocks. What are the requirements for these to be inserted when parsing html to Portable Text?

PS: Anyone here who would be interested in porting (or help porting) block-tools's
htmlToBlocks
to C#? ❤️
AI Update

Great question about empty blocks in Portable Text! Understanding when and why these get created during HTML parsing is important for implementing this correctly in C#.

When Empty Blocks Are Created

During HTML to Portable Text conversion, empty blocks are generated based on specific parsing rules. The htmlToBlocks function (note: @sanity/block-tools has been deprecated in favor of @portabletext/block-tools, but the core logic is similar) creates empty blocks in these scenarios:

1. Block-level HTML elements with no text content:

  • <p></p>, <h1></h1>, <div></div> etc.
  • Elements containing only whitespace: <p> </p> or <p>\n</p>

2. Self-closing or void elements that represent block boundaries:

  • Multiple consecutive <br> tags that create paragraph breaks
  • Empty container elements that establish document structure

3. Elements with only formatting/annotation children but no actual text:

  • <p><strong></strong></p> - has a decorator but no text spans
  • <p><a href="..."></a></p> - has an annotation but no content

4. Structural preservation during initial parsing: The parser creates a block object for every block-level HTML element it encounters, regardless of content, to maintain a 1:1 structural mapping during the initial parse phase. This ensures the HTML's document structure is faithfully represented before any cleanup.

The Parsing Algorithm's Behavior

The htmlToBlocks deserializer follows this general pattern:

  1. Parse HTML into DOM nodes (using browser DOMParser or JSDOM in Node.js)
  2. Traverse the DOM tree and for each block-level element:
    • Create a new block object with _type: 'block'
    • Set default values: markDefs: [], style: 'normal', children: []
    • Process child nodes to populate children array
  3. For text nodes, create span objects with text content
  4. For inline elements (strong, em, code, etc.), add marks to spans
  5. Result: Every block-level element becomes a block, even if its children array ends up empty

Requirements for Your C# Implementation

When implementing this in C#, you should:

During Parsing (create empty blocks):

// Pseudo-code for C# implementation
foreach (var element in blockLevelElements) 
{
    var block = new Block 
    {
        _type = "block",
        _key = GenerateKey(),
        style = MapStyleFromTag(element.TagName), // h1 -> "h1", p -> "normal"
        markDefs = new List<MarkDef>(),
        children = new List<Span>()
    };
    
    // Process children - this might result in empty children array
    foreach (var child in element.ChildNodes) 
    {
        if (child.NodeType == NodeType.Text && !string.IsNullOrWhiteSpace(child.TextContent))
        {
            block.children.Add(CreateSpan(child));
        }
    }
    
    blocks.Add(block); // Add even if children is empty
}

After Parsing (filter empty blocks):

// Post-processing cleanup
blocks = blocks.Where(block => 
{
    if (block.children == null || block.children.Count == 0)
        return false;
    
    // Check if all children have only whitespace text
    var allText = string.Join("", block.children.Select(c => c.text ?? ""));
    return !string.IsNullOrWhiteSpace(allText);
}).ToList();

Important: Keep blocks that don't have a children property at all (like image blocks or other custom block types) - only filter blocks where children exists but is empty or contains only whitespace.

Why This Two-Phase Approach?

The empty blocks exist as an intermediate artifact because:

  1. It simplifies the parsing logic - every block element gets a block object
  2. It maintains structural fidelity during conversion
  3. It allows custom rules to inspect the full structure before cleanup
  4. Post-processing can apply consistent cleanup logic across all blocks

The blocks with "children": [], "markDefs": [], and "style": "normal" are valid Portable Text structures - they represent empty paragraphs in the source HTML. However, they're typically undesirable in the final output since they'd render as empty paragraphs.

Regarding Your C# Port

For your C# implementation, I'd recommend:

  1. Use a library like HtmlAgilityPack or AngleSharp for HTML parsing
  2. Create blocks for all block-level elements during the initial pass
  3. Apply the whitespace-trimming filter as a separate post-processing step
  4. Consider joining the Sanity Community Slack to discuss the C# port - there may be other .NET developers interested in collaborating!

The key insight is that empty block creation is a byproduct of the structural parsing algorithm, not a bug - but they should generally be filtered out before returning the final Portable Text array.

Hi, those blocks aren't valid. You probably need to normalize the end result. Either make them include an empty span, or just remove them.
Did you get those from C# or from the JS function?
Used the linked JS block-tools package, without custom rules.
How did you call it?
function convertHTMLtoPortableText (HTMLDoc) {
  return blockTools.htmlToBlocks(HTMLDoc, blockContentType, {
    // rules,
    parseHtml: html => new JSDOM(html).window.document
  })
}
Hmm..that's strange. Because that function should normalize it already.
HTML:
<div>
  <h3>Some text</h3>
  <div>
    <div>
      <h4>Some other text</h4>
    </div>
  </div>
</div>
output:

[
  {
    "_type": "block",
    "markDefs": [],
    "style": "h3",
    "children": [
      {
        "_type": "span",
        "marks": [],
        "text": "Some text"
      }
    ]
  },
  {
    "_type": "block",
    "markDefs": [],
    "style": "normal",
    "children": []
  },
  {
    "_type": "block",
    "markDefs": [],
    "style": "h4",
    "children": [
      {
        "_type": "span",
        "marks": [],
        "text": "Some other text"
      }
    ]
  }
]
Yes, this is the expected output except that middle block should have been normalized with an empty span as children.
Also the same for
<div>
  <h3>Some text</h3>
  <h4>Some other text</h4>
</div>
What is the result of that?
the same as above
Hmm..I'm not getting those results...that's weird.
[
  {
    "_key": "randomKey0",
    "_type": "block",
    "children": [{"_key": "randomKey00", "_type": "span", "marks": [], "text": "Some text"}],
    "markDefs": [],
    "style": "h3"
  },
  {
    "_key": "randomKey1",
    "_type": "block",
    "children": [{"_key": "randomKey10", "_type": "span", "marks": [], "text": ""}],
    "markDefs": [],
    "style": "normal"
  },
  {
    "_key": "randomKey2",
    "_type": "block",
    "children": [{"_key": "randomKey20", "_type": "span", "marks": [], "text": "Some other text"}],
    "markDefs": [],
    "style": "h4"
  }
]

Maybe something with JSDOM?
I'm using "jsdom": "^12.0.0",
Or no...that's so weird. It should just normalize it anyway.
using 15.2.1 here.
Which version of block-tools btw?
It's really strange that it doesn't normalize, because the exported function should do that.
I think I found it out!


const data = fs.readFileSync(path.join(__dirname,"/data/test.html"), {encoding: "utf-8"})
blockTools.htmlToBlocks(data, blockContentType, {
  parseHtml: html => new JSDOM(html).window.document
})

If there are line breaks in the input file, empty blocks will be added.
<div><h3>Some text</h3><h4>Some other text</h4></div>
meaning this does not instert empty blocks
Right, but it should still be normalized, so that's weird.
Do you by the way have any suggestion for helping someone port this to C#? I showed them the https://github.com/portabletext/portabletext specs, but it may be not enough for implementing it from scratch. Do you know people with C# experience (maybe in sanity) who would be willing to help out?
Sorry, I don't know.
Maybe there’s something here that can help? https://github.com/oslofjord/sanity-linq
Yeah, I linked that too. As I understand, you can use it to reverse engineer things, but not straight forward. Would be helpful, if not specifically for C# but to provide a bit better starting point for those would like to implement a converter in other languages? 🙂 I think it could broaden the amount of companies considering sanity to migrate to.
I find this as the biggest pain point in the whole process. It is a breeze to write schemas and generate block content if you already are in Sanity, but getting there might be hard, especially if your current CMS is only delivering HTML. 😕
Absolutely! Better tooling and docs around portable text is on the list.
I really like Sanity, and lobby for it at my company, but this is a turning point for us.
Right now we cannot let our old CMS go yet, so the current workflow is to listen to changes, convert the HTML to a more "sane" structure, save it in another database, and use a GraphQL endpoint to fetch that data.
The "converter" is written in C#, and we think it is cumbersome to use JS in addition in the backend, as the backend developers prefer a single language codebase (understandibly)
I could do it with the given npm package in an additional step, but that would complicate the publishing pipeline, and I would have the sole responsibility for the correct data transformation even though I am supposed (not strictly though) to only work with the frontend
I can see that – I guess we're a bit biased towards JS since much of our stuff is written in it. Then again, the “logics” behind serialization and deserialization of Portable Text should be pretty similar in any language.
So it's something we could take a closer look at.
If not else making it easier for the community to contribute with tooling in their favorite languages
(:javascript: 🤘)
I guess if we can help somehow, I can ask my boss and the back-end developer if we could contribute. (Vi er fra Trondheim, forresten.)
That would be awesome! (Jeg er faktisk født der, sjø)
will ask around then, and get back to you soon

Sanity – Build the way you think, not the way your CMS thinks

Sanity is the developer-first content operating system that gives you complete control. Schema-as-code, GROQ queries, and real-time APIs mean no more workarounds or waiting for deployments. Free to start, scale as you grow.

Was this answer helpful?