Tagging posts with a LLM – Hans Dembinski’s blog

In recent posts, I explored how LLMs can be used to generate structured output from unstructured input. I decided to use this ability to generate tags for posts on this blog automatically with a LLM. I let the LLM read every post and return a list of predefined tags. With PydanticAI, we can easily contrain this list to only contain the tags we have defined. If we don’t do that, models tend to invent all sorts of specific tags, but we only want broad categories.

We use a local model once again, Mistral-Small-3.2-24B-Instruct-2506-IQ4_XS. This model is rather large compared to what we used previously, so I cannot use a lot of context on my 16 GB GPU VRAM, but I managed to squeeze 24000 token by quantizing the KV cache with the options --cache-type-k q8_0 and --cache-type-v q8_0 for the llama-server of llama.cpp. Mistral-Small-3.2-24B-Instruct-2506-IQ4_XS works better here than Qwen-2.5-coder-7b-instruct-Q8_0 that we had used previously to great success. The smaller Qwen model can also handle this task, but it tends to miss tags that are applicable, while Mistral finds more tags that match the post. Even so, we combine three Mistral calls to get more tags per post.

We save the tags as JSON in a file that Quarto (the software that generates this blog) can include to generate the categories for the posts shown on the website. If you want to see how that works, go to the git repository of this blog and look into the file preprocess.py.

Since generating the tags takes a few seconds per post, we only process new posts. In order to re-tag an old post, it has to be deleted from the file.

from pathlib import Path
from pydantic_ai import Agent, ModelSettings, capture_run_messages
from pydantic_ai.providers.openai import OpenAIProvider
from pydantic_ai.models.openai import OpenAIChatModel
from rich import print
import json
import nbformat
from typing import Literal


model = OpenAIChatModel(
    "",
    provider=OpenAIProvider(
        base_url="http://localhost:8080/v1",
    ),
    settings=ModelSettings(temperature=0.5, max_tokens=1000),
)

valid_tags_raw = """
physics: Post is related to physics, especially particle physics.
science: Post is about science other than physics.
programming: The post is primarily about programming, discussing language features or libraries.
high performance computing: Post is about running software efficiently and fast, typically dealing with benchmarks.
statistics: Post is related to statistics.
llm: Post is related to LLMs (Large Language Models) or uses LLMs, for example through agents.
philosophy: Post touches philosophy. 
engineering: Post is about engineering.
opinion: Post expresses opinions.
data analysis: Post is about data analysis.
visualization: Post is primarily about data visualization.
graphical design: Post is about graphical design.
parsing: Post deals with parsing input.
bootstrap: Post is about the bootstrap method in statistics.
uncertainty analysis: Post is about the statistical problems of error estimation, confidence interval estimation, or error propagation.
sWeights: Posts about sWeights or COWs (custom orthogonal weight functions).
symbolic computation: Post is about symbolic computation, e.g. with sympy.
simulation: Post is about simulation of statistical or other processes.
neural networks: Post is about (deep) neural networks.
machine learning: Post is about machine learning other than with neural networks.
prompt engineering: Post is about prompt engineering.
web scraping: Post is about web scraping.
environment: Post is about energy consumption and other topics that affect Earth's environment.
"""

valid_tags = {
    v[0]: v[1] for v in (v.split(":") for v in valid_tags_raw.strip().split("\n"))
}


AllowedTags = Literal[*valid_tags]


tag_agent = Agent(
    model,
    output_type=list[AllowedTags],
    system_prompt="Extract broad tags that match the provided post.",
    instructions=f"""
Respond with a list of all tags that match the post.

All valid tags:

{"- ".join(f"{k}: {v}" for (k, v) in valid_tags.items())}

You must use one of these tags, you cannot invent new ones.
""",
)


fn_tag_db = Path("../tag_db.json")

if fn_tag_db.exists():
    with fn_tag_db.open(encoding="utf-8") as f:
        tag_db = json.load(f)
else:
    tag_db = {}

input_files = [Path(fn) for fn in Path().rglob("*.*")]

for fn in input_files:
    if fn.suffix not in (".ipynb", ".md"):
        continue

    # skip files that have been processed already
    if fn.name in tag_db:
        continue

    with open(fn, encoding="utf-8") as f:
        if fn.suffix == ".ipynb":
            # We clean the notebook before passing it to the LLM
            nb = nbformat.read(f, as_version=4)
            nb.metadata = {}
            for cell in nb.cells:
                if cell.cell_type == "code":
                    cell.outputs = []
                    cell.execution_count = None
                    cell.metadata = {}
            doc = nbformat.writes(nb)
        elif fn.suffix == ".md":
            doc = f.read()

    tag_input = f"{fn!s}:\n\n{doc}"

    tags = set()
    for _ in range(3):
        # To get a more complete set of tags, we iterate the call.
        with capture_run_messages() as messages:
            try:
                result = await tag_agent.run(tag_input)
                tags |= set(result.output)
            except Exception as e:
                print(e)
                # If there is an error (typically a schema validation error),
                # print the messages for debugging.
                print(messages)
                break
    print(fn.name, tags)
    # A sorted list is easier to diff if we update tags.
    tag_db[fn.name] = list(sorted(tags))

    # save after every change, in case something breaks
    with fn_tag_db.open("w", encoding="utf-8") as f:
        json.dump(tag_db, f, indent=2)

all_KK_decays.ipynb
['physics', 'programming', 'visualization']

asymmetric errors.ipynb
['physics', 'programming', 'statistics', 'uncertainty analysis']

benchmark of building an array with numba.ipynb
['high performance computing', 'programming']

chance_of_deviations_in_random_splits.ipynb
['data analysis', 'programming', 'simulation', 'statistics']

comparison_chisquare_test_statistics.ipynb
['simulation', 'statistics', 'symbolic computation', 'visualization']

Correlation.ipynb
['data analysis', 'high performance computing', 'physics', 'programming', 'simulation', 'visualization']

cows.ipynb
['programming', 'sWeights', 'statistics', 'visualization']

cross_section_extrapolation_error.ipynb
['data analysis', 'physics', 'statistics', 'uncertainty analysis', 'visualization']

cross_section_pip_vs_pp.ipynb
['data analysis', 'physics', 'visualization']

duffing.ipynb
['physics', 'programming', 'simulation', 'visualization']

error_propagation_ratio.ipynb
['simulation', 'statistics', 'uncertainty analysis']

error_propagation_with_sympy.ipynb
['programming', 'statistics', 'symbolic computation', 'uncertainty analysis']

exceptions.md
['high performance computing', 'programming']

factorization_test.ipynb
['data analysis', 'programming', 'sWeights', 'statistics', 'visualization']

fast_deep_set.ipynb
['high performance computing', 'machine learning', 'neural networks', 'physics', 'programming']

Fit weighted histograms with SPD method.ipynb
['data analysis', 'physics', 'programming', 'science', 'statistics', 'visualization']

From fixed target to sqrt_s_nn and back.ipynb
['physics', 'symbolic computation']

gof_test_statistic.ipynb
['data analysis', 'science', 'simulation', 'statistics']

HESSE_vs_MINOS_TwoGauss.ipynb
['data analysis', 'simulation', 'statistics', 'uncertainty analysis', 'visualization']

hyperon_feed_down.ipynb
['data analysis', 'physics', 'programming', 'visualization']

Interactive plotting in Jupyter with matplotlib.ipynb
['programming', 'visualization']

invariant mass combinatorial background.ipynb
['physics', 'simulation', 'statistics', 'symbolic computation']

Leave-one-out cross-validation.ipynb
['data analysis', 'statistics']

llama_index_rag.ipynb
['llm', 'programming', 'web scraping']

llm_tag_posts.ipynb
['high performance computing', 'llm', 'programming']

logaddexp.ipynb
['high performance computing', 'programming', 'statistics']

look_elsewhere_effect.ipynb
[
    'bootstrap',
    'data analysis',
    'physics',
    'programming',
    'simulation',
    'statistics',
    'uncertainty analysis',
    'visualization'
]

MCMC.ipynb
['programming', 'simulation', 'statistics', 'visualization']

naive_calibration_bias.ipynb
['data analysis', 'physics', 'programming', 'simulation', 'statistics']

New iminuit displays.ipynb
['programming']

Numerically stable calculation of invariant mass.ipynb
['high performance computing', 'physics', 'programming']

p-value conversion.ipynb
['data analysis', 'science', 'statistics', 'visualization']

parse_latex.ipynb
['parsing', 'programming']

parsing_webpages_with_llm.ipynb
['llm', 'parsing', 'prompt engineering', 'web scraping']

parsing_webpages_with_llm_revisited.ipynb
['llm', 'parsing', 'programming', 'prompt engineering', 'web scraping']

plots_with_hyperlinks.ipynb
['programming', 'visualization']

ratio bias.ipynb
['data analysis', 'statistics', 'visualization']

regression.ipynb
['machine learning', 'neural networks', 'programming']

render_latex_to_svg.ipynb
['programming', 'visualization']

resample and numba.ipynb
['high performance computing', 'programming', 'statistics', 'visualization']

RooFit.ipynb
['physics', 'programming', 'simulation', 'statistics']

Simple parallelization in Jupyter Notebooks.ipynb
['high performance computing', 'programming']

Sleep mode power consumption.ipynb
['data analysis', 'environment']

SPD with negative weights.ipynb
['bootstrap', 'programming', 'simulation', 'statistics', 'uncertainty analysis']

struct_of_arrays_vs_arrays_of_structs.ipynb
['high performance computing', 'programming']

template_with_distortion.ipynb
['bootstrap', 'programming', 'simulation', 'statistics', 'uncertainty analysis', 'visualization']

tracking_efficiency.ipynb
['physics', 'programming', 'statistics', 'uncertainty analysis']

UnbiasedEML.ipynb
['science', 'simulation', 'statistics', 'symbolic computation']

Uncertainty of efficiency computed from fitted yields.ipynb
['data analysis', 'simulation', 'statistics', 'symbolic computation', 'uncertainty analysis', 'visualization']

visual_cross_section.ipynb
['physics', 'programming', 'simulation']

Wilson Score Interval with Weighted Histograms.ipynb
['bootstrap', 'data analysis', 'simulation', 'statistics']

The results are nice and allow me to now show a nifty word cloud with the tags on the blog. I nicely shows my main interest and allows me to recover similar articles.

Outlook

Through asking the LLM three times for tags and merging the results, the list of tags is now rather complete. We now err a little on the side of attributing too many tags, for example, the LLM assigns these tags to this post:

[‘high performance computing’, ‘llm’, ‘programming’]

I would say that this post is not about ‘high performance computing’. To weed out this and similar mistakes, we have several options:

We could now use a reasoning model to rate each tag on how well the tag matches the post on a scale from 1 to 10.
We could compute embeddings of the tags and the posts and compute a score based on embedding similiarity.

We could then apply a cut to remove poor matches. The approach with the reasoning model is more costly but expected to perform better, because reasoning models can develop an understanding of the text as a whole. Computing embeddings is very fast and cheap, but embedding models understand only very little context and mostly match terms that semantically belong together.