Forging the Digital Mjölnir: A Swedish Text Analysis Saga

23 min readOct 13, 2024

viking ship that sails over a corpus of words with the wind of context filling its sails

Buckle up, word wizards and data divers! We’re about to embark on a thrilling expedition into the uncharted territories of textual treasure hunting. Imagine having a secret decoder ring for the entire universe of written knowledge — but instead of a plastic toy, you’re wielding cutting-edge AI that could make James Bond’s Q Branch green with envy.

Picture this: You’re sitting in your fortress of solitude (okay, it’s probably just your office), but you’ve got a supercomputer sidekick that can devour libraries faster than you can say “bibliophile.” This isn’t some cloud-based fairytale — it’s a real-life, on-premises powerhouse that keeps your secrets safer than a dragon guarding its gold.

Now, I know what you’re thinking. “But wait, oh wise narrator, isn’t AI supposed to be all about the internet and the cloud?” Not anymore, my curious friend! Our tale revolves around a marvel of modern engineering that brings all that brain-melting power right to your doorstep. No need to send your precious data out into the wild web — this bad boy works its magic without ever touching the internet. It’s like having a genius genie in a very secure bottle.

While we’ll be testing our metal (and silicon) on a treasure trove of Swedish text, don’t let that fool you. This linguistic Leviathan can tango with any language you throw at it. It’s the ultimate polyglot party trick, minus the party plus a whole lot of processing power.

So, are you ready to turn those mountains of mundane documents into gold mines of insight? To transform yourself from a mere mortal into a text-taming titan? Of course, you are! But first, we need to prep our digital playground.

In the coming chapters, we’ll dive into the nuts, bolts, and occasionally hilarious hiccups of:

Assembling your very own textual Frankenstein’s monster (corpus, that is)

2. Teaching our silicon companion to play “20 Questions” like a pro

3. Becoming the Sherlock Holmes of document analysis (deerstalker hat optional)

But before we can unleash our inner information alchemist, we need to gather the raw ingredients. We’re talking about amassing a vast Swedish text collection that would make even the most ambitious IKEA instruction manual look like a Post-it note.

Are you itching to turn those reams of random words into rivers of pure knowledge? To go from drowning in data to surfing on waves of wisdom? Then grab your metaphorical pickaxe, put on your adventure pants, and let’s start digging for textual gold!

The Great Swedish Text Heist: Your Guide to Pilfering a Lexical Fortune (Legally, Of Course!)

Alright, text treasure hunters and data detectives! It’s time to embark on the wild and wacky adventure of corpus creation. We’re not just talking about any old pile of words here — we’re on a mission to build the Mount Everest of Swedish text, a lexical leviathan so massive it’ll make War and Peace look like a grocery list!

Our goal? A whopping 128,000 tokens of pure Scandinavian linguistic gold. But what’s a token, you ask? Think of it as the atomic particle of language — sometimes it’s a word, sometimes it’s just a letter throwing a solo party. In Swedish, it takes about four characters to tango into one token. So, put on your math hats, because we’re aiming for a text buffet of roughly 512,000 characters or 85,000 words. That’s enough to make even the most verbose Viking poet say, “Whoa, take it easy there, Shakespeare!”

Now, where do we find this treasure trove of text? Buckle up, because we’re going on a digital Viking raid to the Swedish Wikipedia! It’s like an all-you-can-eat smorgasbord of knowledge, served up with a side of umlauts.

Here’s your quest map:

1. Navigate to the mystical realm of Wikimedia downloads. (Spoiler: It’s just a website, but let’s keep the magic alive!)

2. Grab the latest dump file. It’s big. It’s beefy. It’s the linguistic equivalent of Thor’s hammer.

3. Summon the WikiExtractor, your trusty sidekick in this text-wrangling rodeo.

4. Extract that text like you’re squeezing the last drop of lingonberry juice from the jar.

5. Trim and shape your text beast until it’s a lean, mean, 512,000-character machine.

But wait! There’s a plot twist! If Wikipedia isn’t exotic enough for your literary palate, why not take a detour to Project Runeberg? It’s like a time machine filled with classic Nordic literature. Just remember to play nice with copyright laws — we’re text pirates, not actual pirates!

And just when you thought your quest was complete, there’s one final boss battle: the token count verification. It’s like counting sheep, but instead of falling asleep, you’re making sure your text corpus is the perfect size for our AI overlord… I mean, assistant.

So strap on your Viking helmet, fire up that keyboard, and let’s set sail on this epic voyage of corpus creation! By the time we’re done, you’ll have a text treasure so rich, it’ll make Midas jealous. Now, who’s ready to make some lexical magic? Skål!

Ahoy, word pirates! Set your browser’s compass to the fabled shores of https://dumps.wikimedia.org/svwiki/. This digital El Dorado is where the Swedish Wikipedia keeps its booty — a treasure map of linguistic loot just waiting to be plundered! Here you’ll find a smorgasbord of “dumps” (fancy talk for “heaps of juicy content”) that’ll make your hard drive drool. It’s like an all-you-can-eat buffet of knowledge, but instead of meatballs, you’re stuffing your face with delicious, nutritious data. So hoist the digital Jolly Roger and prepare to raid this Viking vault of vernacular riches!

Ahoy, data buccaneers! It’s time to snag the motherlode of Swedish scribbles! Set your spyglass on the freshest text treasure — as of our last pirate map update, it was hiding at https://dumps.wikimedia.org/svwiki/20241001/. But beware, Matey! This booty changes faster than a chameleon in a disco, so keep your eyes peeled for the latest loot!

Now, gather ‘round the Gaudi gizmo, you scurvy dogs! Find a cozy digital cove to stash your soon-to-be-acquired fortune, and prepare to unleash the kraken of commands:

wget https://dumps.wikimedia.org/svwiki/20241001/svwiki-20241001-pages-articles-multistream.xml.bz2

Shiver me timbers! With this incantation, you’ll summon a tidal wave of text straight from the briny depths of the internet. It’s like fishing with dynamite, but instead of salmon, you’re reeling in a whole school of Swedish sentences! So batten down the hatches, secure your hard drives, and get ready for a data deluge of Viking proportions!

Alright, digital swashbucklers! It’s time to don your wizard hats and conjure up a magical realm for our Python shenanigans. Ready your wands (keyboards) and let’s cast some spells!

First, we’ll create a mystical sanctuary for our code-craft:

python3 -mvenv venv

Poof! Virtual environment summoned!

Now, let’s activate our arcane powers:

source venv/bin/activate

Feel that? That’s the surge of Python prowess coursing through your fingertips!

Time to recruit our trusty sidekick, the Wiki Extractor. Summon it with this arcane chant:

pip install wikiextractor

With our magical minion at the ready, let’s unleash it upon our Swedish treasure hoard:

python -m wikiextractor.WikiExtractor svwiki-20241001-pages-articles-multistream.xml.bz2

Watch in awe as it devours the XML beast and spits out pure textual gold!

Now, for the grand finale — we’ll meld all these word-nuggets into one glorious, Swedish smorgasbord:

cat text/*/wiki_* > swedish_corpus.txt

Abracadabra! Your corpus is served!

But wait, there’s more! Let’s carve out a perfect 128K morsel of this lexical feast:

head -c 512000 swedish_corpus.txt > vllm_test_corpus.txt

Ah, let us unveil the mysteries of the “Textual Purification Ritual” — a crucial step in preparing our Swedish corpus for the hungry maw of our VLLM Leviathan. This code is akin to the magical sieve of Freyja, filtering out the chaff to leave only the purest essence of our text.

from bs4 import BeautifulSoup
import re

def clean_text_with_beautifulsoup(html_content):
    # Parse the HTML content
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Get the text content
    text = soup.get_text()
    
    for br in soup.find_all("br"):
        br.replace_with("\n")
    for p in soup.find_all("p"):
        p.replace_with(f"\n{p.get_text()}\n")

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Optional: Remove specific patterns (like Wikipedia references)
    text = re.sub(r'\[\d+\]', '', text)
    
    return text

# Read input file
with open('vllm_test_corpus.txt', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Clean the text
cleaned_text = clean_text_with_beautifulsoup(html_content)

# Write cleaned text to output file
with open('vllm_test_corpus_clean.txt', 'w', encoding='utf-8') as file:
    file.write(cleaned_text)

print("Text cleaning completed.")

Our ritual begins with the summoning of BeautifulSoup, a powerful ally in the battle against messy HTML. Like the all-seeing Heimdall, BeautifulSoup peers into the very structure of our HTML content, ready to extract the hidden treasures within.

The heart of our cleaning spell lies in the `clean_text_with_beautifulsoup` function. It works its magic thus:

1. First, it conjures a ‘soup’ from our HTML, a primordial brew of tags and text.

2. With a wave of its digital wand, it transmutes all `<br>` tags into newline characters, ensuring our text flows like the mighty rivers of Sweden.

3. Next, it enfolds each `<p>` tag in the warm embrace of newline characters, giving each paragraph room to breathe.

4. But wait! Our text is still cluttered with excess whitespace. Fear not, for powerful magic (in the form of a regular expression) banishes these to the realm of forgotten characters.

5. As a final flourish, it exorcises those pesky Wikipedia-style references, sending [1], [2], and their ilk back to the digital Niflheim from whence they came.

But the ritual still needs to be completed! We must now apply this magic to our sacred text:

1. We open the ancient tome (our input file) and read its contents.

2. Our purification spell is cast upon the text, cleansing it of HTML impurities.

3. Finally, we inscribe the purified text into a new scroll (our output file), ready for the eyes of our VLLM oracle.

As the last runes are etched, our console heralds the completion of this momentous task: “Text cleaning completed.” But what secrets has this process unveiled? What hidden meanings now lie bare, stripped of their HTML trappings?

Prepare for our saga stands at the precipice of its most thrilling chapter. Our cleaned text, now as pure as the waters of Mímir’s well, readies itself to feed the voracious appetite of our AI model. What secrets will these digital Norns weave from these refined Swedish words? The anticipation builds like the tension before Ragnarök…

But lo! We must pause to marvel at our creation. Like master craftsmen in Brokkr and Sindri’s forge, we’ve fashioned something genuinely extraordinary. Our corpus gleams with the brilliance of Brisingamen, each word a carefully polished gem of Swedish linguistic beauty. This is no mere text — it’s a masterpiece that would make even the silver-tongued god Bragi weep with joy.

Now, brave code warriors, the time has come to present our offering to the AI gods. With this meticulously crafted corpus, we stand ready to unlock insights as profound as the wisdom Odin gained from his sacrifice at Yggdrasil. Let your newly forged Swedish text gold ring through the halls of machine learning, heralding a new era of linguistic understanding!

The stage is set, the players are ready, and the next act of our epic tale awaits. What marvels will our AI oracle reveal when it feasts upon this ambrosia of purified Swedish prose? Only time will tell, but one thing is certain — our Swedish text analysis saga is far from over. It’s only just beginning to unfold in all its algorithmic glory!

Navigating the Digital Fjords: A Pitstop at Project Runeberg

Ahoy, literary landlubbers! Fancy a detour into the dusty realms of Nordic prose? Well, hoist your reading glasses and set sail for Project Runeberg — it’s like a time machine for bookworms with a taste for fjords and free stuff!

Here’s your treasure map to literary loot:

1. Embark on your Runeberg Raid

Point your browser to https://runeberg.org/ and prepare to pillage… uh, I mean “peruse” a smorgasbord of public domain Nordic nuggets!

2. Cherry-pick your Prose Plunder

Browse like a Viking on a library spree! Snatch up enough tomes to hit our magic number of tokens. Remember, we’re aiming for a word-hoard big enough to make Odin himself say, “Whoa, that’s a lot of reading!”

3. Franken-text Your Finds

Time to play mad scientist with your literary loot! Mash those books together faster than you can say “Swedish meatball”:

cat norse_saga.txt viking_poetry.txt midsummer_madness.txt > swedish_literature_goulash.txt

The “Don’t Get Sued” Sidebar

Listen up, you swashbuckling scribblers! We may be word pirates, but we draw the line at actual piracy:

- Wikipedia’s treasure is free for all, but remember to tip your hat (or horned helmet) to the original authors.

- Project Runeberg’s bounty is mostly free for all, but always check the fine print. We don’t want the ghost of Strindberg haunting our hard drives!

The Great Token Tally

Now, for the moment of truth! Let’s count our lexical loot and see if we’ve hit the jackpot:

from transformers import AutoTokenizer 

  

tokenizer = AutoTokenizer.from_pretrained("your-favorite-swedish-chef-model") 

with open("swedish_word_feast.txt", "r", encoding="utf-8") as f: 

    text = f.read() 

tokens = tokenizer.encode(text) 

print(f"Congrats! You've amassed {len(tokens)} tokens of Nordic knowledge!")

Just swap out “your-favorite-swedish-chef-model” for whatever AI sous-chef you’re using. If your token count is lower than a Viking’s bank account after a failed raid, just toss in a few more sagas until you hit the motherload!

There you have it, you brilliant book-burglars! You’re now the proud owner of a Swedish text hoard that would make even the most verbose Viking bard say, “Okay, that’s enough words for one day.” Now go forth and feed that hungry AI with your linguistic feast!

Alright, data wranglers and code cowboys! It’s time to saddle up our silicon steed and prepare it for the lexical rodeo of the century! We’re about to turn our text treasure into a feast fit for the hungriest of language models. Yeehaw!

Saddling Up: Arming Yourself with SynapseAI and vLLM

First things first, pardner — make sure you’re packing the latest and greatest SynapseAI six-shooter. As of this here campfire tale, that’s version 1.18. Don’t bring a knife to a gunfight, y’all!

Now, let’s rustle up some vLLM to drive our model faster than a cheetah on roller skates. Mosey on over to the vllm-fork corral (https://github.com/HabanaAI/vllm-fork/tree/v1.18.0) and wrangle yourself a clone.

Time to summon our mechanical mustang! Whisper this incantation to the docker gods:

docker run -it --runtime=habana -v $(pwd):/workspace -v /data/$USER:/root -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

When your trusty steed whinnies for your command, give it these magic words:

cd /workspace/ 

pip install -e .

Congratulations, buckaroo! You’ve just tamed the wildest bronco in the AI corral!

Hold onto your ten-gallon hats because we’re about to lasso the most multilingual, context-chomping model on this side of the digital divide. We’re talking about the meta-llama/Llama-3.1–70B-Instruct, the linguistic luchador that can handle more instructions than a hyperactive octopus at a Rubik’s cube convention.

Mosey on over to HuggingFace, where this beastie’s been corralled for your convenience. It’s like finding a unicorn in a haystack, except this unicorn speaks every language known to man and can process documents longer than the Great Wall of China!

So there you have it, you brilliant binary buckaroos! You’re now armed with a text-wrangling rig that could make even the most seasoned AI cowboy tip their hat in respect. Now giddy up and let’s ride this silicon stallion into the sunset of supreme language processing!

Make sure you set your HF_TOKEN so you have access to Meta’s Llama models.

Then run:

PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct  -tp 8 --max-model-len 70000 --port 8091 --disable-log-requests

The Saga of the Digital Skald: Feeding Swedish Epics to the VLLM Leviathan

Welcome, dear reader, to the next thrilling chapter in our saga of Swedish text analysis! In our previous adventure, we witnessed the awakening of a slumbering giant — the VLLM server running the formidable Llama 3.1 model. Like the great serpent Jörmungandr rising from the depths of the Nordic seas, our AI model stood ready to process vast amounts of information.

But what use is a powerful tool without a worthy challenge? Fear not, for in this chapter; we embark on a quest to feed our digital beast with the nourishment it craves — large documents filled with Swedish wisdom and lore.

Imagine, if you will, a virtual longship laden with tomes of Swedish text sailing across the digital fjords to reach our VLLM server. This is the essence of what our code aims to achieve. We’re about to dissect the very mechanism that allows us to send enormous chunks of text to our AI, much like ancient Vikings sharing sagas around the firepit, but on a scale that would make even the most seasoned skald’s jaw drop.

Prepare yourself for a journey through functions that split, clean, and deliver our Swedish corpus with the precision of a master rune carver. We’ll uncover the secrets of efficient text processing, the art of prompting an AI, and the delicate balance of summarization.

So grab your digital axe and shield, for we’re about to dive deep into the code that bridges the gap between vast Swedish documents and the insatiable appetite of our VLLM server. Will our code prove worthy of this monumental task? As we unravel the mysteries of large-scale text analysis, let's find out one function at a time…

import requests
import argparse
import time

# Constants for the VLLM server and model
VLLM_SERVER_URL = "http://localhost:8091/v1/completions"
MAX_TOKENS = 13000  

# Load the Swedish text
def load_swedish_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Split the text into chunks of MAX_TOKENS (approximately)
def split_text_into_chunks(text, max_tokens):
    words = text.split()
    chunks = []
    current_chunk = []
    current_token_count = 0

    for word in words:
        current_chunk.append(word)
        current_token_count += 1  # Approximate each word as one token for simplicity

        if current_token_count >= max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_token_count = 0

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

import re

def remove_markdown(text):
    # Remove headers
    text = re.sub(r'^\s*#{1,6}\s*', '', text, flags=re.MULTILINE)
    # Remove emphasis (bold, italic)
    text = re.sub(r'(\*\*|__)(.*?)\1', r'\2', text)
    text = re.sub(r'(\*|_)(.*?)\1', r'\2', text)
    # Remove inline code
    text = re.sub(r'`([^`\n]+)`', r'\1', text)
    # Remove code blocks
    text = re.sub(r'```[\s\S]*?```', '', text)
    # Remove links
    text = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', text)
    # Remove images
    text = re.sub(r'!\[([^\]]*)\]\([^\)]+\)', '', text)
    # Remove blockquotes
    text = re.sub(r'^\s*>\s*', '', text, flags=re.MULTILINE)
    # Remove horizontal rules
    text = re.sub(r'^\s*[-*_]{3,}\s*$', '', text, flags=re.MULTILINE)
    # Remove list markers
    text = re.sub(r'^\s*[-*+]\s+', '', text, flags=re.MULTILINE)
    text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)
    # Remove extra newlines
    text = re.sub(r'\n{3,}', '\n\n', text)
    return text.strip()

# Query the VLLM server with a question
def ask_question(model_name, text_to_analyze):
    prompt = f"""
Du är en expert på att analysera och sammanfatta text på svenska.

**Viktigt:**

- **Varje sammanfattning MÅSTE vara högst en mening och FÅR INTE överstiga 20 ord.**
- Svara endast med ämnen och sammanfattningar enligt formatet nedan.
- Inkludera inget annat än det som efterfrågas.
- Svara på svenska.
- Inga markdown i svaret.

Format:

Ämne 1: [Kort ämnesnamn]
Sammanfattning: [En kort mening som sammanfattar ämneti, högst 20 ord]

(Fortsätt för alla identifierade ämnen, max 5)

Uppgift:

Identifiera de 3 till 5 viktigaste huvudämnena som diskuteras i texten nedan. För varje ämne, ge en **extremt kort sammanfattning** på **maximalt en mening och högst 20 ord**.

---

Text att analysera:

{text_to_analyze}

---

**Börja ditt svar nedan:**
"""
    payload = {
        "model": model_name,
        "prompt": prompt,
        "max_tokens": 600,
        "temperature": 0.5,
        "top_p": 1.0,
        "frequency_penalty": 0.7,
        "presence_penalty": 0.0,
         "stop": ["---", "Börja ditt svar nedan:"],
        "stream": False
    }
    response = requests.post(VLLM_SERVER_URL, json=payload)
    if response.status_code == 200:
        result = response.json()
        generated_text = result['choices'][0]['text'].strip()

#        print(f"[{generated_text}]")

        generated_text = remove_markdown(generated_text)

        import re
        summaries = re.findall(r'(Ämne \d+:.*?)(?=Ämne \d+:|$)', generated_text, re.DOTALL)
        truncated_summaries = []
        for summary in summaries:
            lines = summary.strip().split('\n')
            if len(lines) >= 2:
                topic_line = lines[0].strip()
                summary_line = lines[1].strip()
                topic_parts = topic_line.split(':', 1)
                if len(topic_parts) > 1 and topic_parts[1].strip():
                        # Trunkera sammanfattningen till 30 ord
                    words = summary_line.split()
                    if len(words) > 50:
                            summary_line = ' '.join(words[:50]) + '...'
                    truncated_summaries.append(f"{topic_line}\n{summary_line}")
        final_output = '\n\n'.join(truncated_summaries)
        tokens_generated = result['usage']['completion_tokens']
        total_tokens = result['usage']['total_tokens']
        return final_output, tokens_generated, total_tokens
    else:
        raise Exception(f"Error: {response.status_code}, {response.text}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run model with specified name")
    parser.add_argument("model_name", type=str, help="Name of the model to use")
    args = parser.parse_args()
    # Load the large Swedish text file
    file_path = "vllm_test_corpus_clean.txt"  # Replace with your file path
    swedish_text = load_swedish_text(file_path)

    # Split the text into chunks of approximately MAX_TOKENS tokens
    text_chunks = split_text_into_chunks(swedish_text, MAX_TOKENS)

    # Ask a question on each chunk
    for i, chunk in enumerate(text_chunks):
        print(f"Asking question on chunk {i + 1}/{len(text_chunks)}...")
        start_time = time.time()
        answer, tokens_generated, totaL_token = ask_question(args.model_name, chunk)
        end_time = time.time()
        time_taken = end_time - start_time
        print(f"Answer: {answer}\nGenerated Tokens: {tokens_generated}\nTotal Tokens: {totaL_token}\n Time: {time_taken:8.2f}")

Ah, now we delve into the heart of our digital saga! Let’s embark on this journey through the functions that power our Swedish text analysis tool. Each function plays a crucial role, like the different gods in Norse mythology, working together to shape the world of our application.

1. load_swedish_text(file_path):

This function is our Bifröst, the rainbow bridge that connects our mortal realm to the vast corpus of Swedish text. It opens the gates to our textual Asgard, reading the contents of a file and returning them as a string. But what secrets lie within this file? What tales of Swedish culture and history await us?

2. split_text_into_chunks(text, max_tokens):

Here, we encounter Skadi, the goddess of winter and mountains. This function divides our text into manageable pieces like Skadi splitting mighty glaciers. It creates chunks of approximately 20,000 tokens each. But why this specific number? What magic lies in these divisions?

3. remove_markdown(text):

Behold the cleansing flames of Surtr! This function strips away the markdown formatting, leaving only the pure essence of the text. It uses the arcane art of regular expressions to banish headers, emphasis, code blocks, and more. But what knowledge might we lose in this purification process?

4. ask_question(model_name, text_to_analyze):

Now, we approach Mímir’s well of wisdom. This function poses our carefully crafted prompt to the AI model, seeking insights from the depths of its neural networks. Here, we witness the true power of our creation as it identifies key topics and summarizes them with Viking-like efficiency.

But wait! There’s a twist in our tale. The function also handles the response, ensuring that our summaries are trimmed to a mere 50 words if they exceed our desired length. What crucial information might be lost in this truncation?

5. __main__ section:

Finally, we reach Ragnarök, the culmination of our code’s journey. This section orchestrates the entire process, loading our Swedish text, splitting it into chunks, and unleashing our AI analysis upon each piece. It’s a battle of processing power and linguistic insight, with each chunk revealing new aspects of the Swedish corpus.

As the code processes each chunk, it prints out the results, including the time taken. But what mysteries might be hidden in the patterns of these timings? What could they reveal about our AI model's nature and the Swedish language's complexity?

Our saga draws to a close, but questions linger. How will this tool reshape our understanding of Swedish texts? What unexpected insights might emerge from its analysis? And perhaps most intriguingly, how might this code be adapted to explore other languages and cultures?

The adventure may be over for now, but the possibilities… ah, the possibilities are as endless as the Swedish summer days. What will you discover when you run this code on your own Swedish corpus?

The result of running this on the corpus we just created with the command python large_doc.py meta-llama/Llama-3.1–70B-Instruct:

Asking question on chunk 1/6... 

Answer: Ämne 1: Amager 

        Sammanfattning: Amager är en dansk ö i Öresund, tillhörande Köpenhamn och Tårnby kommun och Dragørs kommun. 

  

        Ämne 2: Afrika 

        Sammanfattning: Afrika är jordens näst största kontinent, med en yta på 30,2 miljoner km² och en befolkning på över 1,4 miljarder människor. 

  

        Ämne 3: Arlanda flygplats 

        Sammanfattning: Stockholm-Arlanda flygplats är Sveriges största flygplats, belägen i Sigtuna kommun i Stockholms län.  

  

        Ämne 4: Arkeologi 

        Sammanfattning: Arkeologi är studiet av materiella lämningar som på något sätt har påverkats eller påverkat människan. 

  

        Ämne 5: Artificiell intelligens 

        Sammanfattning: Artificiell intelligens (AI) är förmågan hos datorprogram och robotar att efterlikna människors naturliga intelligens. [/INST] 

Time:    96.95 

Asking question on chunk 2/6... 

Answer: Ämne 1: AI och samhälle 

        Sammanfattning: AI kan skapa enorm social och politisk tumult genom omfattande arbetslöshet och gapande ojämlikhet. 

  

        Ämne 2: Kinas AI-potential 

        Sammanfattning: Kina vill bli ledare inom AI till år 2030, investerar stora summor pengar i tekniken och har som mål att slå USA. 

  

        Ämne 3: Etik och existentiell risk 

        Sammanfattning: Existentiell risk orsakad av AGI är risken för att framsteg inom AI kan leda till en allvarlig global katastrof, så som mänskligt utdöende. 

Time:    65.94 

Asking question on chunk 3/6... 

Answer: Ämne 1: Asien 

        Sammanfattning: Asien är världens största kontinent, med Japan och Sydkorea/Indien som tredje största ekonomier. 

  

        Ämne 2: Agnosticism 

        Sammanfattning: Agnosticism är en filosofisk riktning som förnekar möjligheten att ha kunskap om tillvarons yttersta grunder, särskilt Guds existens. 

  

        Ämne 3: Antarktis 

        Sammanfattning: Antarktis är den enda kontinenten utan permanent mänsklig befolkning, med unik natur och klimat. 

  

        Ämne 4: Asiatisk elefant 

        Sammanfattning: Asiatisk elefant är en art i familjen elefanter, känd för sin storlek och intelligens. 

  

        Ämne 5: Apollon 

        Sammanfattning: Apollon är en gud i grekisk mytologi, associerad med ljus, konst och musik. 

Time:    89.34 

Asking question on chunk 4/6... 

Answer: Ämne 1: Andra världskriget 

Sammanfattning: Konflikten utbröt 1939 och pågick till 1945, med Tyskland, Italien och Japan som axelmakter mot de allierade länderna under ledning av USA, Storbritannien och Sovjetunionen. 

  

Ämne 2: Adolf Hitler 

Sammanfattning: Hitler var en österrikisk-tysk politiker som blev Tysklands rikskansler 1933 och ledare ("Führer") 1934. Han förde en aggressiv utrikespolitik som ledde till andra världskrigets utbrott. 

  

Ämne 3: Algeriet 

Sammanfattning: Algeriet var en fransk koloni från 1830 fram till självständigheten 1962. Landet har en rik historia med influenser från romare, arabiska erövrare och fransk kolonialism. 

  

Ämne 4: Nationernas Förbund (NF) 

Sammanfattning: NF bildades efter första världskriget för att förebygga framtida krig, men misslyckades med att stoppa axelmakternas aggression under andra världskriget. 

  

Ämne 5: Krigets följder 

Sammanfattning: Andra världskriget ledde till omfattande förändringar på den världspolitiska scenen, inklusive uppkomsten av supermakterna USA och Sovjetunionen samt delningen av Europa i ett östblock och ett västblock. [/INST] 

Time:   130.56 

Asking question on chunk 5/6... 

Answer:   

  

Ämne 1: Hitlers tidiga liv 

Sammanfattning: Hitler föddes i Österrike, var soldat under första världskriget och blev politiskt aktiv efter kriget. 

  

Ämne 2: Hitlers politiska karriär 

Sammanfattning: Hitler blev ledare för Nationalsocialistiska tyska arbetarepartiet (NSDAP), valdes till rikskansler och utövade diktatorisk makt i Tyskland. 

  

Ämne 3: Andra världskriget 

Sammanfattning: Hitler inledde andra världskriget med invasionen av Polen, följt av erövringar av flera länder, och slutligen led nederlag och begick självmord. 

  

Ämne 4: Förintelsen 

Sammanfattning: Hitler initierade Förintelsen, en systematisk utrotning av judar och andra minoriteter, som resulterade i miljontals dödsfall. 

  

Ämne 5: Al-Qaidas historia 

Sammanfattning: Al-Qaida grundades av Usama bin Ladin och Ayman az-Zawahiri som en militant islamistisk organisation med målet att införa wahhabism i den muslimska världen. 

Time:   107.33 

Asking question on chunk 6/6... 

Answer:  

Ämne 1: Alkemi 

Sammanfattning: Alkemi är en tidigare vetenskaplig teori som syftade till att omvandla metaller till guld och upptäcka ett livselixir, men den har också haft en stor inverkan på utvecklingen av kemin. 

  

Ämne 2: Ateism 

Sammanfattning: Ateism är en trosuppfattning som innebär att man inte tror på existensen av gudar eller högre makter, och det kan delas in i olika former såsom stark och svag ateism. 

  

Ämne 3: Animism 

Sammanfattning: Animism är en religiös uppfattning som innebär att naturen är besjälad och att alla ting har en ande eller själ. 

  

Ämne 4: Art 

Sammanfattning: En art är en grupp av organismer som delar samma egenskaper och kan föröka sig med varandra, men definitionen av vad som utgör en art kan variera beroende på olika artbegrepp.

Let's switch gears and optimize for a larger context window by introducing the new meta-llama/Llama-3.2–3B-Instruct model. Start the vllm with this command now:

PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.2-3B-Instruct  -tp 8 --max-model-len 70000 --port 8091 --disable-log-requests

We can now tune up the MAX_TOKENS settings to 20000 and run the command python large_doc.py meta-llama/Llama-3.2–3B-Instruct.

The result from this run:

Asking question on chunk 1/4... 

Answer:  Ämne 1: Afrika 

Sammanfattning: Afrika är en kontinent med över 2 000 språk, flera olika kulturella traditioner och en stor mängd djur- och trädarter. 

  

Ämne 2: Amerika 

Sammanfattning: Amerika är en kontinent som består av Nord- och Sydamerika, och omfattar ett stort antal naturskyddade områden. 

  

Ämne 3: Arlanda flygplats 

Sammanfattning: Arlanda flygplats är Sveriges största flygplats, belägen i Sigtuna kommun i Stockholms län. Den trafikeras av ett stort antal passagerare varje år. 

  

Ämne 4: Ärkeologi 

Sammanfattning: Arkeologi är studiet av materiella lämningar som påverkat eller påverkats av människan. Ämnet behandlar forntiden och kan anses vara både humanistiskt och naturvetenskapligt. 

Ämne 5: Kulturhistoriska arkeologi 

Sammanfattning: Kulturhistoriska arkeologin är en disciplin som studerar människans historia genom material från forntiden, särskilt fysiska resterna av människors levnadssätt och kulturella uttryck. 

  

Ämne 6: Processuell arkeologi 

Sammanfattning: Processuell arkeologi är en disciplin som studerar hur människor skapar och använder material från forntiden, med fokus på den processen istället för resultaten. 

  

Ämne 7: Postprocessuell arkeologi 

Sammanfattning: Postprocessuell arkeologi är en disciplin som ifrågasätter den processuella arkeologins fokus på naturvetenskap och objektivitet, och förespråkar en mer relativistisk och humanistisk approach till kulturhistoriska undersökningar. 

  

Ämne 8: Postkolonial arkeologi 

Sammanfattning: Postkolonial arkeologi är en disciplin som utgår från kritiken av kolonialismens effekter på de koloniserade folkgrupperna, och försöker att ge röst till de undertryckta historierna. 

  

Ämne 9: Artificiell intellig 

Time:    68.89 

Asking question on chunk 2/4... 

Answer:   

  

Ämne 1: Historia 

Sammanfattning: Astronomi har sina rötter i den tidiga mänskliga observationen av himlen, från tidiga monument och ceremonier till utvecklingen av teleskop och moderna astronomiska metoder. 

  

Ämne 2: Teoretisk astronomi 

Sammanfattning: Teoretisk astronomi studerar universum och dess innehåll genom att utveckla teorier och modeller om de fysikaliska processerna som skapar och driver det. Detta inkluderar astrofysik, plasmafysik, kosmologi och strängkosmologi. 

  

Ämne 3: Planetär astronomi 

Sammanfattning: Planetär astronomi studerar planeter, månar, dvärgplaneter, kometer, asteroider och andra objekt som befinner sig i en bana runt solen. Detta inkluderar undersökningar av planeternas egenskaper, deras formation och evolution. 

  

Ämne 4: Interdisciplinära ämnen 

Sammanfattning: Interdisciplinära ämnen är områden som kombinerar flera vetenskapliga discipliner för att utforska ett specifikt problem eller fenomen. Astronomi är ett exempel på en interdisc 

  

Ämne 5: Danmark 

Sammanfattning: Danmark attackerade en slavisk handelsplats nära Danmark och tvångsförflyttade allt folk till sin nybyggda marknadsplats i Hedeby för att försäkra sig om handelstullar och ge Danmark större inblandning i nordisk handel. 

  

Ämne 6: Elefanter 

Sammanfattning: Asiatiska elefanter är ett släkte av elefanter som lever i flera från varandra skilda populationer från norra Indien till Sri Lanka och österut till Sydostasien. De är närmare släkt med mammutarna än med de afrikanska elefanterna, men har mindre öron och ryggen är böjd uppåt. 

  

Ämne 7: Akvariefiskar 

Sammanfattning: Akvariefiskar är fiskar som klarar sig bra i akvarier, vanligtvis tropiska arter som kan odlas i stort antal till låg kostnad. De flesta familjerna är abborrartade, men smörbultarna (Gobiidae) är artrikast med minst 1 875 arter. 

  

Time:    69.84 

Asking question on chunk 3/4... 

Answer:   

  

Ämne 1: Andra världskriget 

Sammanfattning: Andra världskriget var en global konflikt som inleddes 1939 och avslutades 1945, med stora mängder människoliv förlorade på alla sidor. 

  

Ämne 2: Nazityskland 

Sammanfattning: Nazityskland var en tysk regim under ledning av Adolf Hitler som anförde den nazistiska partiet. Landet var ansvarigt för massakrer, krigsförbrytelser och andra brott mot mänskligheten. 

  

Ämne 3: Storbritannien 

Sammanfattning: Storbritannien var en av de ledande stormakterna under andra världskriget. Landet ingick i det allierade blocket och bidrog med stora resurser till segrarmakternas framgångar. 

  

Ämne 4: USA 

Sammanfattning: USA ingick i det allierade blocket under andra världskriget. Landets deltagande i kriget var initialt begränsat men ökade efter Japanans attack på Pearl Harbor. 

  

Ämne 5: Adolf Hitler 

Sammanfattning: Adolf Hitler var en österrikisk-tysk politiker och diktator som var ledare för Nazityskland från 1933 till 1945. Han var ansvarig för Holocausten, den systematiska utrotaningen av judarna i Europa under andra världskriget. 

  

Ämne 6: Förintelsen 

Sammanfattning: Förintelsen var den systematiska utrotaningen av judarna i Europa under andra världskriget, initierad av Nazityskland under Adolf Hitlers ledning. Mer än seks miljoner judar mördades, särskilt kvinnor och barn. 

  

  

Ämne 7: Andra världskriget 

Sammanfattning: Andra världskriget var ett globalt konflikt mellan de tyska nazisterna med sina allierade och de allierade med sina allierade. Det pågick från 1939 till 1945 och resulterade i att över sex miljoner människor dog. 

  

Ämne 8: Molotov–Ribbentrop-pakten 

Sammanfattning: Molotov–Ribbentrop-pakten var ett fredsoffer mellan Tyskland och Sovjet 

Time:    68.83 

Asking question on chunk 4/4... 

Answer:  Ämne 1: Nätverk med band till ett flertal länder 

Sammanfattning: Al-Qaida är en nätverk av terroristgrupper med anknytning till flera länder, som har beskrivits i västerländsk media och är ansvariga för flera terroristdåd. 

  

Ämne 2: Al-Qaidas historia och uppbyggnad 

Sammanfattning: Al-Qaida grundades 1988 av Usama bin Ladin i Afghanistan och utvecklades ur en organisation som kallades Makhtab al-Khidamat. Organisationen var en del av den salafistiska wahabismens tolkning av islam och hade tillskrivna en rad uppmärksammade terroristdåd. 

  

Ämne 3: Al-Qaidas metoder och mål 

Sammanfattning: Al-Qaida använder olika metoder såsom mord, bombningar, kapning, kidnappning och självmordsattacker för att uppnå sina mål. Målet är ofta att skaffa pengar eller införa en viss ideologi i världen. 

  

Ämne 4: Arbetarmakt - En revolutionär socialistisk trotskistisk organisation 

Sammanfattning: Arbetarmakt är 

  

Ämne 5: Animism 

Sammanfattning: Animistisk tro som anser att allt levande och världen är fyllt av andar eller själar. 

  

Ämne 6: Akhenaton 

Sammanfattning: Ägyptsk farao som förbjudde den gamla polyteismen och introducerade en form av monoteism med guden Aton. 

  

Ämne 7: Artbegreppet 

Sammanfattning: Hypotes om en unik utvecklingslinje som kan påvisas genom flera olika beviskriterier, som definierar vad som är en separat utvecklingslinje. 

Time:    26.74

Echoes of Valhalla: The Dawn of Our AI Saga

Behold, fellow adventurers in the realm of artificial intelligence! We have harnessed the power of a nimble model, much like the swift god Hermod, to tackle our simple tasks. This clever strategy grants us the ability to process vast tomes of knowledge, rivaling even the capacity of Odin’s ravens, Huginn and Muninn.

But make no mistake — this is our epic quest's beginning. Like the stoic warriors of old, we must steel ourselves for the challenges that lie ahead. For we stand at the threshold of a more extraordinary journey that will lead us to forge a complete modular agentic AI solution, piece by piece, like the dwarves crafting Thor’s mighty hammer, Mjölnir.

This tale of digital craftsmanship shall unfold on the sacred scrolls of my Substack (https://runaker.substack.com). Should you wish to join our band of intrepid explorers and witness the birth of this AI marvel, inscribe your name in our annals? Thus, you shall be summoned when we embark on building each module, as inevitable and unyielding as the coming of Ragnarök itself.

Prepare yourselves, for the path ahead is long and arduous. But remember, in the face of adversity, we shall remain as unwavering as the ancient standing stones of Sweden. Our reward? The creation of an AI system that may reshape the digital landscape as profoundly as the gods shaped the world from the primordial void.