In recentposts, I explored how LLMs can be used to generate structured output from unstructured input. I decided to use this ability to generate tags for posts on this blog automatically with a LLM. I let the LLM read every post and return a list of predefined tags. With PydanticAI, we can easily contrain this list to only contain the tags we have defined. If we don’t do that, models tend to invent all sorts of specific tags, but we only want broad categories.
We use a local model once again, Mistral-Small-3.2-24B-Instruct-2506-IQ4_XS. This model is rather large compared to what we used previously, so I cannot use a lot of context on my 16 GB GPU VRAM, but I managed to squeeze 24000 token by quantizing the KV cache with the options --cache-type-k q8_0 and --cache-type-v q8_0 for the llama-server of llama.cpp. Mistral-Small-3.2-24B-Instruct-2506-IQ4_XS works better here than Qwen-2.5-coder-7b-instruct-Q8_0 that we had used previously to great success. The smaller Qwen model can also handle this task, but it tends to miss tags that are applicable, while Mistral finds more tags that match the post. Even so, we combine three Mistral calls to get more tags per post.
We save the tags as JSON in a file that Quarto (the software that generates this blog) can include to generate the categories for the posts shown on the website. If you want to see how that works, go to the git repository of this blog and look into the file preprocess.py.
Since generating the tags takes a few seconds per post, we only process new posts. In order to re-tag an old post, it has to be deleted from the file.
from pathlib import Pathfrom pydantic_ai import Agent, ModelSettings, capture_run_messagesfrom pydantic_ai.providers.openai import OpenAIProviderfrom pydantic_ai.models.openai import OpenAIChatModelfrom rich importprintimport jsonimport nbformatfrom typing import Literalmodel = OpenAIChatModel("", provider=OpenAIProvider( base_url="http://localhost:8080/v1", ), settings=ModelSettings(temperature=0.5, max_tokens=1000),)valid_tags_raw ="""physics: Post is related to physics, especially particle physics.science: Post is about science other than physics.programming: The post is primarily about programming, discussing language features or libraries.high performance computing: Post is about running software efficiently and fast, typically dealing with benchmarks.statistics: Post is related to statistics.llm: Post is related to LLMs (Large Language Models) or uses LLMs, for example through agents.philosophy: Post touches philosophy. engineering: Post is about engineering.opinion: Post expresses opinions.data analysis: Post is about data analysis.visualization: Post is primarily about data visualization.graphical design: Post is about graphical design.parsing: Post deals with parsing input.bootstrap: Post is about the bootstrap method in statistics.uncertainty analysis: Post is about the statistical problems of error estimation, confidence interval estimation, or error propagation.sWeights: Posts about sWeights or COWs (custom orthogonal weight functions).symbolic computation: Post is about symbolic computation, e.g. with sympy.simulation: Post is about simulation of statistical or other processes.neural networks: Post is about (deep) neural networks.machine learning: Post is about machine learning other than with neural networks.prompt engineering: Post is about prompt engineering.web scraping: Post is about web scraping.environment: Post is about energy consumption and other topics that affect Earth's environment."""valid_tags = { v[0]: v[1] for v in (v.split(":") for v in valid_tags_raw.strip().split("\n"))}AllowedTags = Literal[*valid_tags]tag_agent = Agent( model, output_type=list[AllowedTags], system_prompt="Extract broad tags that match the provided post.", instructions=f"""Respond with a list of all tags that match the post.All valid tags:{"- ".join(f"{k}: {v}"for (k, v) in valid_tags.items())}You must use one of these tags, you cannot invent new ones.""",)fn_tag_db = Path("../tag_db.json")if fn_tag_db.exists():with fn_tag_db.open(encoding="utf-8") as f: tag_db = json.load(f)else: tag_db = {}input_files = [Path(fn) for fn in Path().rglob("*.*")]for fn in input_files:if fn.suffix notin (".ipynb", ".md"):continue# skip files that have been processed alreadyif fn.name in tag_db:continuewithopen(fn, encoding="utf-8") as f:if fn.suffix ==".ipynb":# We clean the notebook before passing it to the LLM nb = nbformat.read(f, as_version=4) nb.metadata = {}for cell in nb.cells:if cell.cell_type =="code": cell.outputs = [] cell.execution_count =None cell.metadata = {} doc = nbformat.writes(nb)elif fn.suffix ==".md": doc = f.read() tag_input =f"{fn!s}:\n\n{doc}" tags =set()for _ inrange(3):# To get a more complete set of tags, we iterate the call.with capture_run_messages() as messages:try: result =await tag_agent.run(tag_input) tags |=set(result.output)exceptExceptionas e:print(e)# If there is an error (typically a schema validation error),# print the messages for debugging.print(messages)breakprint(fn.name, tags)# A sorted list is easier to diff if we update tags. tag_db[fn.name] =list(sorted(tags))# save after every change, in case something breakswith fn_tag_db.open("w", encoding="utf-8") as f: json.dump(tag_db, f, indent=2)
Wilson Score Interval with Weighted Histograms.ipynb
['bootstrap', 'data analysis', 'simulation', 'statistics']
The results are nice and allow me to now show a nifty word cloud with the tags on the blog. I nicely shows my main interest and allows me to recover similar articles.
Outlook
Through asking the LLM three times for tags and merging the results, the list of tags is now rather complete. We now err a little on the side of attributing too many tags, for example, the LLM assigns these tags to this post:
I would say that this post is not about ‘high performance computing’. To weed out this and similar mistakes, we have several options:
We could now use a reasoning model to rate each tag on how well the tag matches the post on a scale from 1 to 10.
We could compute embeddings of the tags and the posts and compute a score based on embedding similiarity.
We could then apply a cut to remove poor matches. The approach with the reasoning model is more costly but expected to perform better, because reasoning models can develop an understanding of the text as a whole. Computing embeddings is very fast and cheap, but embedding models understand only very little context and mostly match terms that semantically belong together.