Validating recursive JSONs using JSON Schema in Python

Edit: Please note that it's much easier and faster to do this sort of validation directly using pydantic. If, however, you must use jsonschema, this post should be helpful.

Hey there!

Recently, while working on a Django app, I implemented validation for a recursive data model that would periodically be synced between the frontend and the backend through an API. It wasn't particularly easy to find a solution online, so I thought I'd post my findings here.

Here is a visual representation of an example of the data model in question:

The same in JSON:

[
  {
    "title": "Node 1",
    "locked": false,
    "children": [
      {
        "title": "Node 1.1",
        "locked": false,
        "children": []
      },
      {
        "title": "Node 1.2",
        "locked": false,
        "children": [
          {
            "title": "Node 1.2.1",
            "locked": false,
            "children": []
          }
        ]
      },
      {
        "title": "Node 1.3",
        "locked": false,
        "children": []
      }
    ]
  },
  {
    "title": "Node 2",
    "locked": false,
    "children": []
  }
]

An individual node:

{
  "title": "Node 1",
  "locked": false,
  "children": []
}

There doesn't seem to be a way to define this structure in a Django DB model, and so instead of validating it at the model-level, it can be done during serialization. JSON Schema seems to be an industry standard for this sort of stuff and they have a Python package jsonschema! JSON Schema facilitates many things, one of which is validation.

>>> from jsonschema import validate

>>> # A sample schema, like what we'd get from json.load()
>>> schema = {
...     "type" : "object",
...     "properties" : {
...         "price" : {"type" : "number"},
...         "name" : {"type" : "string"},
...     },
... }

>>> # If no exception is raised by validate(), the instance is valid.
>>> validate(instance={"name" : "Eggs", "price" : 34.99}, schema=schema)

A small problem here is that constructing the schema manually for a complex model is not ideal. That's where pydantic comes in handy. Pydantic allows you to, among other things, easily define models as Python classes. Here's how I defined my model:

class TreeDataNode(BaseModel):
    title: str
    locked: bool
    children: list["TreeDataNode"]


class TreeData(BaseModel):
    tree_data: list[TreeDataNode]

It's very intuitive. The field tree_data is a list of TreeDataNode, and the field children of TreeDataNode is also a list of TreeDataNode. This way, pydantic knows that the model is recursive. The convention for referencing objects is to use their name exactly how they're defined. This goes for pydantic models as well but it doesn't work in recursive models because an object can't be referenced before assignment in Python, which is why TreeDataNode is enclosed in quotations in the definition of the field children. Pydantic handles this behind the curtain.

All we need now to generate a JSON Schema is to use pydantic's .model_json_schema function on our TreeData model..

print(TreeData.model_json_schema())

..aaand voilà!

{
    "$defs": {
        "TreeDataNode": {
            "properties": {
                "title": {"title": "Title", "type": "string"},
                "locked": {"title": "Locked", "type": "boolean"},
                "children": {
                    "items": {"$ref": "#/$defs/TreeDataNode"},
                    "title": "Children",
                    "type": "array",
                },
            },
            "required": ["title", "locked", "children"],
            "title": "TreeDataNode",
            "type": "object",
        }
    },
    "properties": {
        "tree_data": {
            "items": {"$ref": "#/$defs/TreeDataNode"},
            "title": "Tree Data",
            "type": "array",
        }
    },
    "required": ["tree_data"],
    "title": "TreeData",
    "type": "object",
}

The above is a Python dictionary representation of the resulting JSON Schema. It can be thought of as:

{
    "tree_data": [TreeDataNode]
}

Now, for the actual validation, I use jsonschema.validate within my serializer to validate the incoming JSON from the frontend against the schema pydantic generated (not sure if the error handling here is sound :D):

    def validate_tree_data(self, data):
        try:
            jsonschema.validate(
                instance={"tree_data": data}, schema=tree_data_json_schema
            )
        except jsonschema.exceptions.ValidationError as e:
            raise serializers.ValidationError(e)

        return data

At the moment, the schema at hand only validates the structure of the JSON and the types of the fields. JSON Schema also allows the implementation of custom validations on a per-field basis, see the documentation. In my use case, since I was going to validate the size of the JSON anyway, I decided it was unnecessary to implement field-level validations. Surely this would also save me some performance overhead, right?

    def validate_tree_data(self, data):
        total_size = 0
        for item in data:
            total_size += len(str(item).encode("utf-8"))
            if total_size > 50 * 1000:  # 50 KB max
                raise serializers.ValidationError("The tree data is too large.")

        try:
            jsonschema.validate(
                instance={"tree_data": data}, schema=tree_data_json_schema
            )
        except jsonschema.exceptions.ValidationError as e:
            raise serializers.ValidationError(e)

I have no idea whether this is an OK implementation of checking the max size :D. An LLM told me it wouldn't affect performance much at all, so here's hoping it works out.

That's all!