From unstructured to structured: Parsing webpages with a Large Language Model (LLM)

In a recent article, I showed how to set up a simple RAG system based on a locally run Large Language Model. I already praised the ollama library there, which makes it very easy to run LLMs locally. In this post, I will let the LLM parse a web page to extract data and return it in a structured format. More specifically, I will read a couple of web pages from InspireHEP about a few scientific papers on which I am a co-author and then convert these into a simple Markdown list with these references. That is an extract-transform-load (ETL) task in data engineering. Normally, one would write a rigid parser to solve this task, but with LLMs we can skip that and just describe the task in human language.

And yes, there are easier ways to solve this particular task: InspireHEP allows one to download information about papers in machine readable format (BibTeX and others). The point is that this solution can be used also for other pages that do not offer access to their data in machine-readable format.

I will use the llama3-chatqa model with 8b parameters, which is supposed to be good at this task. I got good results from this model after tweaking the prompt a lot. Larger models need less prompt engineering. A better but larger model for this task is command-r with 32b parameters, which I cannot run efficiently on a GPU with 8 GB RAM.

I don’t use llama-index in this post:

llama-index does not provide a locally-run reader for dynamic web pages that run JavaScript. The ReadabilityWebReader should work, but does not.
There are dependency conflicts between the llama-index and ollama.

llama-index (like all the other libraries out there) offer lots of paid cloud-based services, but the support for locally running solutions is modest. More resources are allocated to support paid services, for obvious reasons.

Fortunately, we don’t need any functionality in llama-index, we can do it with a few lines of code from scratch. I use playwright to render the HTML and markdownify to convert to Markdown, and then I use ollama to generate my responses. You need to install these libraries to run the notebook

ollama
playwright
markdownify

Converting dynamic web pages to Markdown

The code for this part was written by ChatGPT. At least on Windows, the Playwright code cannot be run inside a Jupyter notebook, so I had to use a script. The script first produces the dynamically generated HTML via a remote controlled browser, and then converts it to Markdown and saves the Markdown files in the subdirectory scraped.

import subprocess
from pathlib import Path

script = r"""
from playwright.sync_api import sync_playwright
from markdownify import markdownify as md
from pathlib import Path

urls = '''
https://inspirehep.net/literature/1889335
https://inspirehep.net/literature/2512593
https://inspirehep.net/literature/2017107
https://inspirehep.net/literature/2687746
https://inspirehep.net/literature/2727838
'''

urls = [x.strip() for x in urls.split("\n") if x and not x.isspace()]


def scrape_to_markdown(urls, output_dir):
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True, parents=True)

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)

        for url in urls:
            output_fn = (
                url.replace("://", "_").replace("/", "_").replace(".", "_") + ".md"
            )
            ofile = output_dir / output_fn
            page = browser.new_page()

            page.goto(url)

            # Wait for JavaScript-rendered content to load
            page.wait_for_load_state("networkidle")

            rendered_html = page.content()

            page.close()

            markdown_content = md(rendered_html)

            with open(ofile, "w", encoding="utf-8") as file:
                file.write(markdown_content)

            print(f"Saved {ofile!r}")

        browser.close()


scrape_to_markdown(urls, "scraped")
"""

if not Path("scraped").exists():
    with open("scrape.py", "w", encoding="utf-8") as f:
        f.write(script)

    subprocess.run(["python", "scrape.py"])

The content of an example files looks like this:

When heavy ions meet cosmic rays: potential impact of QGP formation on the muon puzzle - INSPIREYou need to enable JavaScript to run this app.From 21 Dec 2023 to 06 Jan 2024 the INSPIRE team works on a reduced schedule and it will take a bit longer than usual to address your requests. Best wishes for the season![INSPIRE Logo](/)literature

* Help
* Submit
* [Login](/user/login)

[Literature](/literature)[Authors](/authors)[Jobs](/jobs)[Seminars](/seminars)[Conferences](/conferences) More... 

When heavy ions meet cosmic rays: potential impact of QGP formation on the muon puzzle
--------------------------------------------------------------------------------------

* [Tanguy Pierog](/authors/1025036)(
  + [KIT, Karlsruhe, IAP](/institutions/1856897)),
* [Sebastian Baur](/authors/1370189)(
  + [Brussels U., IIHE](/institutions/902696)),
* [Hans Dembinski](/authors/1035720)(
  + [Tech. U., Dortmund (main)](/institutions/1088595)),
* [Matías Perlin](/authors/1590214)(
  + [KIT, Karlsruhe, IAP](/institutions/1856897) and
  + [CNEA, Buenos Aires](/institutions/902700)),
* [Ralf Ulrich](/authors/1035845)(
  + [KIT, Karlsruhe, IAP](/institutions/1856897))

Show All(6)20218 pagesPublished in: 

* *PoS* ICRC2021 (2021) 469
Contribution to: 

* [+ ICRC 2021](/conferences/1776906), 469

* Published: 2021
DOI: 

* [10.22323/1.395.0469](//doi.org/10.22323/1.395.0469)
View in: 

* [HAL Science Ouverte](https://hal.science/hal-03373282)
[pdf](https://inspirehep.net/files/294f575f9507555012249728090abc50)citeclaim[reference search](/literature?q=citedby:recid:1928162)[9 citations](/literature?q=refersto:recid:1928162)
### Citations per year

202120222023202432Abstract: (SISSA)The deficit of muons in the simulation of extensive air showers is a long-standing problem and the origin of large uncertainties in the reconstruction of the mass of the high energy primary cosmic rays. Hadronic interaction models, re-tuned after early LHC data, have a more consistent description of the muon content among them but still disagree with data. Collective hadronization due to the formation of a quark gluon plasma (QGP) has already been studied as a possible cause for a larger production of muons under extreme conditions (rare, very central nuclear interactions), but without real success. However, in the view of the most recent LHC data, a collective hadronization phase might not only be limited to such extreme conditions. And because of its different ratio of electromagnetic to hadronic energy, a QGP may have the properties to solve the muon puzzle. This hypothesis is demonstrated using a theoretical approach and tested in a proper way by the modification of hadronic model spectra in CONEX to mimic the production of a QGP also in less extreme conditions with a possible large impact on air shower physics.

* showers: atmosphere
* cosmic radiation: primary
* energy: hadronic
* interaction: model
* energy: high
* muon: production
* nucleus: interaction
* model: hadronic
* quark gluon: plasma
* collective
Show all (16)References(40)Figures(0)

* [1]
  #### [Air Shower Simulation with a New Generation of post-LHC Hadronic Interaction Models in CORSIKA](/literature/1687010)
  
  + [Tanguy Pierog](/authors/1025036)(
    - [KIT, Karlsruhe](/institutions/911469))
  + - *PoS* ICRC2017 (2018) 1100 •
  + DOI: 
    - [10.22323/1.301.1100](//doi.org/10.22323/1.301.1100)edit

[...]

The web page also contains all the references cited by the paper. I skipped that part here, which is not of interest for us. In fact, we can and should cut that part away in order to help the model focus on the relevant text piece and to not overload its context window.

The converted Markdown contains mistakes, where the conversion process garbled up the structure of the document. Let’s see whether the LLM can make sense of this raw text. We want it to extract the authors, the journal data, the title, and the DOI.

Extracting data from raw text with a LLM

We need to write a good prompt for our model, that explains what we want, ideally without ambiguity. Just like humans, LLMs are better at inferring a rule from concrete examples, than from an abstract description. In my prompt, I describe the general format and then illustrate it with concrete examples, which seems to work well.

Also just like humans, LLMs pay more attention to context that is nearby. For transformer architectures, there is no fundamental reason for that, it just something the model learns during training. Therefore we put the instructions after the data, to make sure that our format is followed.

You can read more about prompt engineering elsewhere. I find it amusing that prompt engineering rediscovers good practices from education, which is no surprise: LLMs learned how to ‘think’ from texts written by humans, so they pick up our cognitive biases, too.

Even so, the model sometimes generates garbage, because it is so small. To compensate, I run it three times for each document and pick the good results by hand.

import ollama
from pathlib import Path

input_dir = Path("scraped")

documents = [fn.open(encoding="utf-8").read() for fn in input_dir.glob("*.md")]

prompt_template = """
Extract a reference with authors, title, journal info, and DOI from the text in <input> tags.

<input>
{text}
</input>

Return the reference in this structure and nothing else:

First and last name of first author, First and last name of second author, ..., journal reference, "The title", [DOI](DOI URL)

Examples of valid references:

Roel Aaij et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
Flavia Gesualdi et al., PoS ICRC2021 (2021) 473, "On the muon scale of air showers and its application to the AGASA data", [10.22323/1.395.0473](https://doi.org/10.22323/1.395.0473)

Requirements:

- If there are more than four authors, use `<first author> et al.` instead of listing all authors. Replace <first author> with the actual name of the author.
- The list of authors must be author names only separated by commas.
- Remove any *italic* or **bold** emphasis from the reference.
- Convert any LaTeX code in the title with an equivalent plain text description.

The extracted reference:
"""

for idoc, doc in enumerate(documents):
    # strip the bibliography block
    d = doc[:doc.index("###")]
    prompt = prompt_template.format(text=d)
    for trial in range(3):
        # a low temperate seems to make the output more reliable
        response = ollama.generate(model='llama3-chatqa', prompt=prompt, options={"temperature": 0.3, "seed": trial})
        # tiny bit of post-processing: replace newlines with spaces, trim whitespace
        text = response.response.replace('\n', '').strip()
        print(f"{idoc}.{trial}: {text}")

0.0: Roel Aaij, JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)Note: The DOI URL is automatically generated from the DOI string and may not match the provided example exactly.
0.1: Roel Aaij, JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166).
0.2: Roel Aaij, Nikhef, Amsterdam et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166).
1.0: Albrecht, Cazon, Dembinski, Fedynitch, Kampert, Astrophys.Space Sci. 367 (2022) 3, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5).
1.1: Albrecht et al., Astrophys. Space Sci. 367 (2022) 3, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5).
1.2: Johannes Albrecht, Lorenzo Cazon, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astrophys.Space Sci. 367 (2022) 3, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5).
2.0: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).
2.1: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur. Phys. J. C 82 (2022) 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).
2.2: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur. Phys. J. C 82 (2022) 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).
3.0: L. Cazon, H.P. Dembinski, G. Parente, F. Riehn, A.A. Watson, PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).
3.1: 1. L. Cazon, H.P. Dembinski, G. Parente, F. Riehn, A.A. Watson, PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).
3.2: L. Cazon, H.P. Dembinski, G. Parente, F. Riehn, A.A. Watson, PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).
4.0: Hans Dembinski, Michael Schmelling, arXiv:2110.00294 [stat.AP], "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166).
4.1: Hans Dembinski, Michael Schmelling, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", arXiv:2110.00294 [stat.AP], [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166).
4.2: Hans Dembinski, Michael Schmelling, arXiv:2110.00294, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166).

Discussion

The output is pretty good for such a small model, also considering the modest quality of the input. The LLM correctly identified the relevant pieces: authors, journal information, and the paper title. Which is which is not really obvious from the raw input, the pieces are not clearly labelled. The model uses its knowledge how a name looks like, how a journal reference looks like and so on.

One should note, however, that the output quality varies considerably from run to run, and to get to this quality, several iterations on the prompt were necessary. Reducing the temperature for the model to make it ‘less creative’ seemed to have a positive effect. In a professional context, one would need to find the optimum prompt and temperature by running the model on a large number of test cases.

Mistake analysis

The model ignored the instruction to shorten long author lists, but larger models are able to do that. While the model was able to count authors in isolation, it fails to comprehend rules such as “shorten the author list if it contains more than X names”. It will even claim nonsense like that “three names are more than four”.
The model failed to convert the LaTeX in the title of the first paper correctly, $\sqrt{s} = ...$ just became s = ..., while it should be sqrt(s) = ....
The model rarely makes spelling mistakes when text is copied verbatim. It rather makes higher-level mistakes, like omitting an author completely, than misspelling them. Spelling mistakes do occur, however, see e.g. case 1.1, where “Johanes” is misspelled.
The last reference is special, this is an unpublished preprint. It does not have a DOI, for example. The model nevertheless produced a reasonable entry sometimes, but in other cases it hallucinates DOIs that do not exist.

Failed attempts

One technique to improve the output of a flawed model is to let the model critique its previous output and suggest improvements. I tried this, but this small model is unable to critique itself, it always accepts its own answer even if it does not adhere to the requested format.

Update: Using a distilled DeepSeek-R1 model

DeepSeek-R1 is a new Open Source reasoning model which uses test-time compute to improve its reasoning, like OpenAI’s o1 model. Basically, it has the chain-of-thought prompting technique hard-wired and will generate a thinking process before answering. In several benchmarks it reaches the same performance as o1.

While the full DeepSeek-R1 model with over 600b parameters is too large to be run locally, the authors provide distilled smaller models. I use one of these based on Llama-3.1 with 8b parameters, the same architecture that llama3-chatqa uses. I also tried another version based on Qwen2 with 14b parameters, but it is consistently crashing the Ollama server after a while.

To use the output of this model for our task, we need to skip over the thinking process and keep only the final output. That is easy to do, because the model is trained to always put its thinking in <think> tags.

We expect the model to be smarter in following our requirements, especially the rule about shortening long lists of authors, which requires reasoning.

KEYWORD = "</think>"

for idoc, doc in enumerate(documents):
    # strip the bibliography block
    d = doc[:doc.index("###")]
    prompt = prompt_template.format(text=d)
    for trial in range(3):
        # a low temperate is recommended by the authors
        response = ollama.generate(model='deepseek-r1:8b', prompt=prompt, options={"temperature": 0.3, "seed": trial})
        # skip the thinking part
        raw = response.response
        text = raw[raw.index(KEYWORD) + len(KEYWORD):].strip()    
        print(f"{idoc}.{trial}: {text}")

0.0: Roel Aaij et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
0.1: Roel Aaij et al., *JHEP* 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
0.2: Roel Aaij et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at √s = 13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
1.0: Johannes Albrecht et al., Astrophys. Space Sci. 367 (2022) 3, 27, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)
1.1: Johannes Albrecht et al., Astrophys. Space Sci. 367 (2022) 3, 27, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)
1.2: Johannes Albrecht et al., *Astrophys.Space Sci.* 367 (2022) 3, 27, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)
2.0: Hans Peter Dembinski and Ahmed Abdelmotteleb, *Eur.Phys.J.C* 82 (2022) 11, 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z)
2.1: Hans Peter Dembinski, Ahmed Abdelmotteleb, *Eur. Phys. J. C* 82 (2022) 11, 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z)
2.2: Hans Peter Dembinski and Ahmed Abdelmotteleb, *Eur. Phys. J. C* 82 (2022) 11, 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z)
3.0: L. Cazon et al., PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431)
3.1: L. Cazon et al., "The muon measurements of Haverah Park and their connection to the muon puzzle," [PoS ICRC2023 (2023) 431](https://doi.org/10.22323/1.444.0431), [10.22323/1.444.0431](//doi.org/10.22323/1.444.0431)
3.2: L. Cazon et al., PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431)
4.0: Dembinski H. and Schmelling M., JHEP 11 (2021) 123, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", [10.1007/JHEP11(2021)123](https://doi.org/10.1007/JHEP11(2021)123)
4.1: Hans Dembinski and Michael Schmelling, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", [arXiv:2110.00294](https://arxiv.org/abs/2110.00294)
4.2: Hans Dembinski and Michael Schmelling, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", [arXiv:2110.00294](https://doi.org/arXiv:2110.00294)

The model is indeed better at counting the authors and appropriately replacing long author lists with the “et al.” form. It even managed once to follow the instruction to convert LaTeX code into a proper text description by replacing \sqrt{s} with √s, which is impressive. Overall, the output is more consistent than with llama3-chatqa, but the model still makes a few minor and major mistakes.

It occasionally produces an invalid URL, omitting the https: prefix. llama3-chatqa never makes that mistake.
It does not always follow the instruction to remove all emphasis markup from the reference. That can be rectified by post-processing in this case.
In one case, it swapped the order of title and journal.
Similar to llama3-chatqa, it usually hallucinates an invalid journal and DOI for the last paper, because it was only released on arXiv, but occasionally it got it right, too.

In conclusion, the distilled deepseek-r1 model performs slightly better at this task, although it is using the same architecture as llama3-chatqa, at the cost of using 6x more compute. It would be interesting to see whether a version that is less quantized performs better, in which case the errors could be attributed to “noise” in the reasoning process, or whether the main issue is the limited attention capability of 8b models, which have fewer attention blocks compared to larger models.