From unstructured to structured: Parsing webpages with a Large Language Model (LLM)

Published

January 7, 2025

In a recent article, I showed how to set up a simple RAG system based on a locally run Large Language Model. I already praised the ollama library there, which makes it very easy to run LLMs locally. In this post, I will let the LLM parse a web page to extract data and return it in a structured format. More specifically, I will read a couple of web pages from InspireHEP about a few scientific papers on which I am a co-author and then convert these into a simple Markdown list with these references. That is an extract-transform-load (ETL) task in data engineering. Normally, one would write a rigid parser to solve this task, but with LLMs we can skip that and just describe the task in human language.

And yes, there are easier ways to solve this particular task: InspireHEP allows one to download information about papers in machine readable format (BibTeX and others). The point is that this solution can be used also for other pages that do not offer access to their data in machine-readable format.

I will use the llama3-chatqa model with 8b parameters, which is supposed to be good at this task. I got good results from this model after tweaking the prompt a lot. Larger models need less prompt engineering. A better but larger model for this task is command-r with 32b parameters, which I cannot run efficiently on a GPU with 8 GB RAM.

I don’t use llama-index in this post:

llama-index (like all the other libraries out there) offer lots of paid cloud-based services, but the support for locally running solutions is modest. More resources are allocated to support paid services, for obvious reasons.

Fortunately, we don’t need any functionality in llama-index, we can do it with a few lines of code from scratch. I use playwright to render the HTML and markdownify to convert to Markdown, and then I use ollama to generate my responses. You need to install these libraries to run the notebook

Converting dynamic web pages to Markdown

The code for this part was written by ChatGPT. At least on Windows, the Playwright code cannot be run inside a Jupyter notebook, so I had to use a script. The script first produces the dynamically generated HTML via a remote controlled browser, and then converts it to Markdown and saves the Markdown files in the subdirectory scraped.

import subprocess
from pathlib import Path

script = r"""
from playwright.sync_api import sync_playwright
from markdownify import markdownify as md
from pathlib import Path

urls = '''
https://inspirehep.net/literature/1889335
https://inspirehep.net/literature/2512593
https://inspirehep.net/literature/2017107
https://inspirehep.net/literature/2687746
https://inspirehep.net/literature/2727838
'''

urls = [x.strip() for x in urls.split("\n") if x and not x.isspace()]


def scrape_to_markdown(urls, output_dir):
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True, parents=True)

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)

        for url in urls:
            output_fn = (
                url.replace("://", "_").replace("/", "_").replace(".", "_") + ".md"
            )
            ofile = output_dir / output_fn
            page = browser.new_page()

            page.goto(url)

            # Wait for JavaScript-rendered content to load
            page.wait_for_load_state("networkidle")

            rendered_html = page.content()

            page.close()

            markdown_content = md(rendered_html)

            with open(ofile, "w", encoding="utf-8") as file:
                file.write(markdown_content)

            print(f"Saved {ofile!r}")

        browser.close()


scrape_to_markdown(urls, "scraped")
"""

if not Path("scraped").exists():
    with open("scrape.py", "w", encoding="utf-8") as f:
        f.write(script)

    subprocess.run(["python", "scrape.py"])

The content of an example files looks like this:

When heavy ions meet cosmic rays: potential impact of QGP formation on the muon puzzle - INSPIREYou need to enable JavaScript to run this app.From 21 Dec 2023 to 06 Jan 2024 the INSPIRE team works on a reduced schedule and it will take a bit longer than usual to address your requests. Best wishes for the season![INSPIRE Logo](/)literature

* Help
* Submit
* [Login](/user/login)

[Literature](/literature)[Authors](/authors)[Jobs](/jobs)[Seminars](/seminars)[Conferences](/conferences) More... 

When heavy ions meet cosmic rays: potential impact of QGP formation on the muon puzzle
--------------------------------------------------------------------------------------

* [Tanguy Pierog](/authors/1025036)(
  + [KIT, Karlsruhe, IAP](/institutions/1856897)),
* [Sebastian Baur](/authors/1370189)(
  + [Brussels U., IIHE](/institutions/902696)),
* [Hans Dembinski](/authors/1035720)(
  + [Tech. U., Dortmund (main)](/institutions/1088595)),
* [Matías Perlin](/authors/1590214)(
  + [KIT, Karlsruhe, IAP](/institutions/1856897) and
  + [CNEA, Buenos Aires](/institutions/902700)),
* [Ralf Ulrich](/authors/1035845)(
  + [KIT, Karlsruhe, IAP](/institutions/1856897))

Show All(6)20218 pagesPublished in: 

* *PoS* ICRC2021 (2021) 469
Contribution to: 

* [+ ICRC 2021](/conferences/1776906), 469

* Published: 2021
DOI: 

* [10.22323/1.395.0469](//doi.org/10.22323/1.395.0469)
View in: 

* [HAL Science Ouverte](https://hal.science/hal-03373282)
[pdf](https://inspirehep.net/files/294f575f9507555012249728090abc50)citeclaim[reference search](/literature?q=citedby:recid:1928162)[9 citations](/literature?q=refersto:recid:1928162)
### Citations per year

202120222023202432Abstract: (SISSA)The deficit of muons in the simulation of extensive air showers is a long-standing problem and the origin of large uncertainties in the reconstruction of the mass of the high energy primary cosmic rays. Hadronic interaction models, re-tuned after early LHC data, have a more consistent description of the muon content among them but still disagree with data. Collective hadronization due to the formation of a quark gluon plasma (QGP) has already been studied as a possible cause for a larger production of muons under extreme conditions (rare, very central nuclear interactions), but without real success. However, in the view of the most recent LHC data, a collective hadronization phase might not only be limited to such extreme conditions. And because of its different ratio of electromagnetic to hadronic energy, a QGP may have the properties to solve the muon puzzle. This hypothesis is demonstrated using a theoretical approach and tested in a proper way by the modification of hadronic model spectra in CONEX to mimic the production of a QGP also in less extreme conditions with a possible large impact on air shower physics.

* showers: atmosphere
* cosmic radiation: primary
* energy: hadronic
* interaction: model
* energy: high
* muon: production
* nucleus: interaction
* model: hadronic
* quark gluon: plasma
* collective
Show all (16)References(40)Figures(0)

* [1]
  #### [Air Shower Simulation with a New Generation of post-LHC Hadronic Interaction Models in CORSIKA](/literature/1687010)
  
  + [Tanguy Pierog](/authors/1025036)(
    - [KIT, Karlsruhe](/institutions/911469))
  + - *PoS* ICRC2017 (2018) 1100 •
  + DOI: 
    - [10.22323/1.301.1100](//doi.org/10.22323/1.301.1100)edit

[...]

The web page also contains all the references cited by the paper. I skipped that part here, which is not of interest for us. In fact, we can and should cut that part away in order to help the model focus on the relevant text piece and to not overload its context window.

The converted Markdown contains mistakes, where the conversion process garbled up the structure of the document. Let’s see whether the LLM can make sense of this raw text. We want it to extract the authors, the journal data, the title, and the DOI.

Extracting data from raw text with a LLM

We need to write a good prompt for our model, that explains what we want, ideally without ambiguity. Just like humans, LLMs are better at inferring a rule from concrete examples, than from an abstract description. In my prompt, I describe the general format and then illustrate it with concrete examples, which seems to work well.

Also just like humans, LLMs pay more attention to context that is nearby. For transformer architectures, there is no fundamental reason for that, it just something the model learns during training. Therefore we put the instructions after the data, to make sure that our format is followed.

You can read more about prompt engineering elsewhere. I find it amusing that prompt engineering rediscovers good practices from education, which is no surprise: LLMs learned how to ‘think’ from texts written by humans, so they pick up our cognitive biases, too.

Even so, the model sometimes generates garbage, because it is so small. To compensate, I run it three times for each document and pick the good results by hand.

import ollama
from pathlib import Path

input_dir = Path("scraped")

documents = [fn.open(encoding="utf-8").read() for fn in input_dir.glob("*.md")]
prompt_template = """
Extract the authors, title, journal info, and DOI from the text in <input> tags.

<input>
{text}
</input>

Return the result in Markdown format in this format:

First and last name of first author, First and last name of second author, ..., journal reference, "The title", [DOI](DOI URL)

Requirements:
- If there are more than four authors, use `First author et al.` instead of listing all authors.
- The journal reference must not contain *italic* or **bold** emphasis.
- The list of authors must be author names only separated by commas.
- Convert LaTeX formulas into equivalent plain text.

Examples that pass the check:
- Roel Aaij et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
- Flavia Gesualdi et al., PoS ICRC2021 (2021) 473, "On the muon scale of air showers and its application to the AGASA data", [10.22323/1.395.0473](https://doi.org/10.22323/1.395.0473)

The extracted reference:
"""

for idoc, doc in enumerate(documents):
    # strip the bibliography block
    d = doc[:doc.index("###")]
    prompt = prompt_template.format(text=d)
    for trial in range(3):
        # a low temperate seems to make the output more reliable
        response = ollama.generate(model='llama3-chatqa', prompt=prompt, options={"temperature": 0.3, "seed": trial})
        # tiny bit of post-processing: replace newlines with spaces, trim whitespace
        text = response.response.replace('\n', '').strip()
        print(f"{idoc}.{trial}: {text}")
0.0: Roel Aaij et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
0.1: Roel Aaij et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
0.2: Roel Aaij et al., JHEP 01 (2022) 166, "Measurement of prompt charged-particle production in pp collisions at s=13 TeV", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
1.0: Johannes Albrecht, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astropart.Space Sci. 367 (2022) 3, 27, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)
1.1: Johanes Albrecht, Lorenzo Cazon, Hans Dembinski, Anatoli Fedynitch, Karl-Heinz Kampert, Astrophys.Space Sci., "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)
1.2: Johannes Albrecht et al., Astrophys.Space Sci. 367 (2022) 3, "The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider", [10.1007/s10509-022-04054-5](https://doi.org/10.1007/s10509-022-04054-5)
2.0: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).
2.1: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).
2.2: Hans Peter Dembinski, Ahmed Abdelmotteleb, Eur.Phys.J.C 82 (2022) 1043, "A new maximum-likelihood method for template fits", [10.1140/epjc/s10052-022-11019-z](https://doi.org/10.1140/epjc/s10052-022-11019-z).
3.0: L. Cazon et al., PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).
3.1: L. Cazon et al., PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).
3.2: L. Cazon et al., PoS ICRC2023 (2023) 431, "The muon measurements of Haverah Park and their connection to the muon puzzle", [10.22323/1.444.0431](https://doi.org/10.22323/1.444.0431).
4.0: Hans Dembinski, Michael Schmelling, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", [arXiv:2110.00294](https://arxiv.org/abs/2110.00294)
4.1: Hans Dembinski, Michael Schmelling, arXiv:2110.00294, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", [10.1007/JHEP01(2022)166](https://doi.org/10.1007/JHEP01(2022)166)
4.2: Hans Dembinski, Michael Schmelling, "Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments", arXiv:2110.00294

Discussion

The output is pretty good for such a small model, also considering the modest quality of the input. The LLM correctly identified the relevant pieces: authors, journal information, and the paper title. Which is which is not really obvious from the raw input, the pieces are not clearly labelled. The model uses its knowledge how a name looks like, how a journal reference looks like and so on.

One should note, however, that the output quality varies considerably from run to run, and to get to this quality, several iterations on the prompt were necessary. Reducing the temperature for the model to make it ‘less creative’ seemed to have a positive effect. In a professional context, one would need to find the optimum prompt and temperature by running the model on a large number of test cases.

Mistake analysis

  • The model ignored the instruction to shorten long author lists, but larger models are able to do that. While the model was able to count authors in isolation, it fails to comprehend rules such as “shorten the author list if it contains more than X names”. It will even claim nonsense like that “three names are more than four”.
  • The model failed to convert the LaTeX in the title of the first paper correctly, $\sqrt{s} = ...$ just became s = ..., while it should be sqrt(s) = ....
  • The model rarely makes spelling mistakes when text is copied verbatim. It rather makes higher-level mistakes, like omitting an author completely, than misspelling them. Spelling mistakes do occur, however, see e.g. case 1.1, where “Johanes” is misspelled.
  • The last reference is special, this is an unpublished preprint. It does not have a DOI, for example. The model nevertheless produced a reasonable entry sometimes, but in other cases it hallucinates DOIs that do not exist.

Failed attempts

One technique to improve the output of a flawed model is to let the model critique its previous output and suggest improvements. I tried this, but this small model is unable to critique itself, it always accepts its own answer even if it does not adhere to the requested format.