Interim blog setup, rough notes being dumped here at the moment - I'll tidy once I get to a milestone. And sorry about the layout, that'll probably take even longer to get around to... #TODO

Blog

llama_index SPARQL Notes 24

Published on 2023-09-24 by @danja

 


llama_index SPARQL Notes 22

Published on 2023-09-23 by @danja

cd ~/AI/LIBS-under-dev/llama_index

pytest tests/graph_stores/test_sparql.py > ../testy.md

 


llama_index SPARQL Notes 23

Published on 2023-09-23 by @danja

 


llama_index SPARQL Notes 21

Published on 2023-09-20 by @danja

I've been fixing some of the things GitHub complained about when I did a pull request. Luckily things that ChatGPT can help with.

I've added type annotations to sparql.py, started writing some tests.

added sparqlwrapper to setup.py

ok, ChatGPT got the shape of the returned values wrong (so much for type annotations!) but was easy to fix.

Next is the more challenging one, test_sparql.py.

Give ChatGPT a zip. Include paths.txt so it gets the imports right.

"I wish to create a pytest test. I will now upload a zip of files. Look at the functions in test_simple.py and their relation to functions in simple.py, remember these. I will then give you further instructions"

"I would like a test file, test_sparql.py that will test all the functions in sparql.py. Much of this will be very similar to test_simple.py as they both follow the same interface, types.py. I would like you to go through every function in sparql.py one at a time and create at test for it. Don't show me any results until this is done. Combine them into test_sparqly.py and give me that. Keep explanation to a minimum. Do not ask me for confirmation, just continue until you have a full response."

much time, many prompts later...

9 errors. I noticed "EndPointNotFound"

 


llama_index SPARQL Notes 20

Published on 2023-09-19 by @danja

I did a pull request on sparql.py etc. last night.

GitHub gave me a message that there weren't any type annotations.

I'll probably do those tomorrow.

I started making up some diagrams to use in docs/slides.

Occurred to me, I should have use / not # in the URIs, so they'll be directly retrievable.

HTML links

https://www.w3.org/TR/rdfa-syntax/

 


llama_index SPARQL Notes 19

Published on 2023-09-18 by @danja

git checkout main git merge sparql-01

https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging

path issues.

pip install -U llama-index

original llama_index has changes...

git checkout main git merge sparql-01

bit of fiddling, pushed

export PYTHONPATH=/home/danny/AI/LIBS-under-dev/llama_index

export PYTHONPATH=/home/hkms-apps/llama_index

pip install -e .

python docs/examples/graph_stores/graph-rag-sparql-mini.py

 


llama_index SPARQL Notes 18

Published on 2023-09-17 by @danja

I spent the past few days mostly admin stuff (and stacking logs).

Prior to making a pull request on llama_index I want to :

I'm going to try to be a little more careful with git this time after a screw-up the other day.

So for a clean slate, I backed up and then deleted my fork of llama_index. Then :

I think the only necessary core files are :

and docs/demos :

 


llama_index SPARQL Notes 17

Published on 2023-09-13 by @danja

I've got a bit distracted trying to set up good Jupyter Notebook hosting on my server. After going round in circles a lot have settled on using JupyterHub, specifically :

https://github.com/jupyterhub/the-littlest-jupyterhub

It's now running at http://hyperdata.it:8484/user/admin/lab

This provides a lot more than I need for this, is multi-user, but doesn't appear to make much demand on resources (assuming very few users).

Looking around this took me into auth territory a bit. I do want this set up for other services on my site, so I'm biting the bullet a bit.

LDAP seems the best base for me, JupyterHub has a plugin. More generally, an LDAP-based identity provider would be nice to have.

But before that - how to add the path of my in-progress LlamaIndex bits so the Notebook can see it. PYTHONPATH isn't working. Virtual envs were mention in the (quick) install procedure, so probe time -

import sys
import os

print('sys.prefix = '+sys.prefix)
print('\nos.environ = '+str(os.environ))
print('\nVIRTUAL_ENV = '+str(os.environ.get('VIRTUAL_ENV')))
print('\nsys.executable = '+sys.executable)
print('\nsys.path = ')
print(sys.path)

gives

sys.prefix = /opt/tljh/user

os.environ = environ({'LANG': 'en_US.UTF-8', 'PATH': '/opt/tljh/user/bin:/opt/tljh/hub/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', '_WSREP_START_POSITION': '', 'HOME': '/home/jupyter-admin', 'LOGNAME': 'jupyter-admin', 'USER': 'jupyter-admin', 'SHELL': '/bin/bash', 'INVOCATION_ID': '0f6f2c87ddd84b6580a092738a5f2828', 'JOURNAL_STREAM': '8:13724686', 'RUNTIME_DIRECTORY': '/run/jupyter-admin', 'SYSTEMD_EXEC_PID': '89780', 'JPY_API_TOKEN': 'a0b308207fff4b1ba0d634c490dd3f33', 'JUPYTERHUB_ACTIVITY_URL': 'http://127.0.0.1:15001/hub/api/users/admin/activity', 'JUPYTERHUB_API_TOKEN': 'a0b308207fff4b1ba0d634c490dd3f33', 'JUPYTERHUB_API_URL': 'http://127.0.0.1:15001/hub/api', 'JUPYTERHUB_BASE_URL': '/', 'JUPYTERHUB_CLIENT_ID': 'jupyterhub-user-admin', 'JUPYTERHUB_DEFAULT_URL': '/lab', 'JUPYTERHUB_HOST': '', 'JUPYTERHUB_OAUTH_ACCESS_SCOPES': '["access:servers!server=admin/", "access:servers!user=admin"]', 'JUPYTERHUB_OAUTH_CALLBACK_URL': '/user/admin/oauth_callback', 'JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES': '[]', 'JUPYTERHUB_OAUTH_SCOPES': '["access:servers!server=admin/", "access:servers!user=admin"]', 'JUPYTERHUB_SERVER_NAME': '', 'JUPYTERHUB_SERVICE_PREFIX': '/user/admin/', 'JUPYTERHUB_SERVICE_URL': 'http://127.0.0.1:35051/user/admin/', 'JUPYTERHUB_USER': 'admin', 'PYDEVD_USE_FRAME_EVAL': 'NO', 'JPY_SESSION_NAME': '/home/jupyter-admin/graph-rag-sparql-mini.ipynb', 'JPY_PARENT_PID': '89780', 'TERM': 'xterm-color', 'CLICOLOR': '1', 'FORCE_COLOR': '1', 'CLICOLOR_FORCE': '1', 'PAGER': 'cat', 'GIT_PAGER': 'cat', 'MPLBACKEND': 'module://matplotlib_inline.backend_inline'})

VIRTUAL_ENV = None

sys.executable = /opt/tljh/user/bin/python

sys.path =
['/home/jupyter-admin', '/opt/tljh/user/lib/python310.zip', '/opt/tljh/user/lib/python3.10', '/opt/tljh/user/lib/python3.10/lib-dynload', '', '/opt/tljh/user/lib/python3.10/site-packages']

root@localhost:/opt/tljh# grep -r sys.path

...

/opt/tljh/hub/lib/python3.10/site-packages/tljh/jupyterhub_config.py contains

# Load arbitrary .py config files if they exist.
# This is our escape hatch
extra_configs = sorted(glob(os.path.join(CONFIG_DIR, "jupyterhub_config.d", "*.py")))
for ec in extra_configs:
    load_subconfig(ec)

so...

nano /opt/tljh/config/jupyterhub_config.d/extra-path.py

# temp until I've figured out how JupyterHub does sys.path
import sys
sys.path.append("/home/hkms-apps/llama_index")

systemctl restart jupyterhub

No change.

Oh, rights. chmod on that ^, no change.

Runs at another point?

I don't need to check everything starts ok on reboot on this server, so reboot time.

14066 triple in Guradians graph

All started ok, no change to sys.pth.

Hmm. It doesn't appear to be running in a venv, so better retry

root@localhost:~# export PYTHONPATH=/home/hkms-apps/llama_index root@localhost:~# echo $PYTHONPATH /home/hkms-apps/llama_index

systemctl restart jupyterhub


WARNING: The script openai is installed in '/home/jupyter-admin/.local/bin' which is not on PATH.

Just remembered the actual code is under /home/hkms-apps/llama_index/llama_index

 


llama_index SPARQL Notes 16

Published on 2023-09-12 by @danja

/opt/tljh/hub/lib/python3.10/site-packages/tljh/jupyterhub_config.py

is called from

jupyterhub.service

# Template file for JupyterHub systemd service
# Uses simple string.format() for 'templating'
[Unit]
# Traefik must have successfully started *before* we launch JupyterHub
Requires=traefik.service
After=traefik.service

[Service]
User=root
Restart=always
WorkingDirectory=/opt/tljh/state
# Protect bits that are normally shared across the system
PrivateTmp=yes
PrivateDevices=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
Environment=TLJH_INSTALL_PREFIX=/opt/tljh
Environment=PATH=/opt/tljh/hub/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# Run upgrade-db before starting, in case Hub version has changed
# This is a no-op when no db exists or no upgrades are needed
ExecStart=/opt/tljh/hub/bin/python3 -m jupyterhub.app -f /opt/tljh/hub/lib/python3.10/site-packages/tljh/jupyterhub_config.py

[Install]
# Start service when system boots
WantedBy=multi-user.target
 


llama_index SPARQL Notes 15

Published on 2023-09-11 by @danja

Add OpenAPI keys.

Endpoint : https://fuseki.hyperdata.it/#/dataset/llama_index_sparql-test/

DROP GRAPH <http://purl.org/stuff/guardians>

sudo /usr/local/nebula/scripts/nebula.service start all
cd ~/AI/nlp/GraphRAG/src
export PYTHONPATH=$PYTHONPATH:/home/danny/AI/LIBS-under-dev/llama_index
python graph-rag-sparql-minimal.py

On server for demo.

export PYTHONPATH=$PYTHONPATH:/home/hkms-apps/llama_index

/home/hkms-apps/llama_index

add to nano /etc/profile

 


llama_index SPARQL Notes 14

Published on 2023-09-10 by @danja

Add OpenAPI keys.

Endpoint : https://fuseki.hyperdata.it/#/dataset/llama_index_sparql-test/

DROP GRAPH <http://purl.org/stuff/guardians>

sudo /usr/local/nebula/scripts/nebula.service start all
cd ~/AI/nlp/GraphRAG/src
export PYTHONPATH=$PYTHONPATH:/home/danny/AI/LIBS-under-dev/llama_index
python graph-rag-sparql-minimal.py

Fool Danny!

I'd made rels() as a placeholder for get_rel_map() while testing. Forget to wire it in when running end-to-end.

Connecting that up (and a few little tweaks) got rid of the recursion issue.

Yay!! I might be missing something obvious but it now appears to work!

Now to tidy up, then set up a demo Notebook.

Ew, I got the recursion/depth error again.

Ah, interesting!

I was giving it 2 questions :

The first the LLM should be able to answer but the RAG data would be no use, and vice versa.

When I removed the cat query, no recursion error. So I guess maybe the LLM was returning TMI.

So I'll leave cats out for now.

Back to setting up a demo.


Grrr. Accidentally stuck a big file in a repo, got in a tangle with github. Headache, time to stop.

 


llama_index SPARQL Notes 13

Published on 2023-09-08 by @danja

So, eliminating loops...

For now at least I reckon it'd be best to prevent loops before triplets go to the store. The simplest thing that should work is when presented with a triplet (subj, rel, obj), check that obj hasn't already been used as a subj.

 


llama_index SPARQL Questions

Published on 2023-09-07 by @danja

Immediate goal is to recreate Wey's Graph RAG demo but using a SPARQL store rather than NebulaGraph.

I'm running it as graph-rag-sparql-minimal.py. The in-progress sparql.py (rels() will be called from get_rel_map()).

I've got data going to the store something like it should using graph_store.upsert(). Data coming back from calls to graph_store.get_rel_map() that appears to be more or less the right shape. But an issue with each I would appreciate advice on.

  1. Batching data

    Right now I've set up INSERT queries to add triplets (as RDF triples) one at a time to the store. It's really inefficient because each has to be wrapped in a query skeleton and POSTed over HTTP. I imagine some kind of batch handling is already implemented somewhere, but I couldn't find anything I understood. It isn't essential right now, but if there is something, pointers?

  2. Loops in graph data

    This is a breaking issue for me right now. I'm sure I can sort it out, but a little more knowledge should make that quicker :) So at some point the data is fed through llama_index/llama_index/response_synthesizers/tree_summarize.py. I get RecursionError: maximum recursion depth exceeded in comparison, so seems likely there's a loop somewhere. Short term, some kind of check+delete should fix that (algorithm hunt time). But more generally, what is the situation with graphs & loops? Is this an underlying feature of the way Graph RAG operates, or is it just an implementation issue?


 


llama_index SPARQL Notes 12

Published on 2023-09-07 by @danja

cd ~/AI/nlp/GraphRAG/src
export PYTHONPATH=$PYTHONPATH:/home/danny/AI/LIBS-under-dev/llama_index
python /home/danny/AI/LIBS-under-dev/llama_index/tests/storage/graph_stores/test_sparql.py

That took a while but I now have results like :

{Peter Quill: [
                    'Peter Quill, -[was abducted from Earth]->, as a child',
                    'Peter Quill, -[is leader of]->, Guardians of the Galaxy',<-[are attacked by]-, Adam',

Ah, but as a string. Need to add a loop on the subjs and tweak data accumulation to conform to:

    def get_rel_map(
        self, subjs: Optional[List[str]] = None, depth: int = 2
    ) -> Dict[str, List[List[str]]]:

Hmm, I don't really understand that return shape. And I've overwritten the log that had it...

sudo /usr/local/nebula/scripts/nebula.service start all
cd ~/AI/nlp/GraphRAG/src
export PYTHONPATH=$PYTHONPATH:/home/danny/AI/LIBS-under-dev/llama_index
python graph-rag-nebulagraph-minimal.py

if nebulagraph.py, get_rel_map() :

        # We put rels in a long list for depth>= 1, this is different from
        # SimpleGraphStore.get_rel_map() though.
        # But this makes more sense for multi-hop relation path.

the results in the log look very like what's produced from SPARQL above. Ok, time to try it.

First, at https://fuseki.hyperdata.it/#/dataset/llama_index_sparql-test/query

DROP GRAPH <http://purl.org/stuff/guardians>

little char problem, ChatGPT time

Ah, SPARQL encoding different from Python on Spanish n with wibble

It might now be putting what it should in the store (took a few minutes), but it crashes with :

  File "/home/danny/AI/LIBS-under-dev/llama_index/llama_index/response_synthesizers/tree_summarize.py", line 142, in get_response
    return self.get_response(
           ^^^^^^^^^^^^^^^^^^
  [Previous line repeated 980 more times]
  File "/home/danny/AI/LIBS-under-dev/llama_index/llama_index/response_synthesizers/tree_summarize.py", line 96, in get_response
    summary_template = self._summary_template.partial_format(query_str=query_str)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danny/AI/LIBS-under-dev/llama_index/llama_index/prompts/base.py", line 206, in partial_format
    return SelectorPromptTemplate(
           ^^^^^^^^^^^^^^^^^^^^^^^
...
RecursionError: maximum recursion depth exceeded in comparison
...
 File "pydantic/class_validators.py", line 337, in pydantic.class_validators._generic_validator_basic.lambda13
 ...

Hmm. A loop or just to big?


      PREFIX er:  <http://purl.org/stuff/er#>
        BASE <http://purl.org/stuff/data>

SELECT DISTINCT ?rel1 ?obj1 ?rel2 ?obj2 WHERE {

GRAPH <http://purl.org/stuff/guardians> {
    ?triplet a er:Triplet ;
        er:subject ?subject ;
        er:property ?property ;
        er:object ?object .

    ?subject er:value "Peter Quill"  .
    ?property er:value ?rel1 .
    ?object er:value ?obj1 .

OPTIONAL {
    ?triplet2 a er:Triplet ;
        er:subject ?object ;
        er:property ?property2 ;
        er:object ?object2 .

    ?property2 er:value ?rel2 .
    ?object2 er:value ?obj2 .
}}}

LIMIT 10

 


llama_index SPARQL Notes 11

Published on 2023-09-06 by @danja

WARNING:llama_index.graph_stores.nebulagraph:s =Peter Quill WARNING:llama_index.graph_stores.nebulagraph:rel_map =

Ok, I want rel_map to take the subject, Peter Quill, call the SPARQL store and return something in this format :

{'Peter Quill': [ 'Peter Quill, -[would return to the MCU]->, May 2021, <-[would return to the MCU]-, Peter Quill', 'Peter Quill, -[would return to the MCU]->, May 2021', 'Peter Quill, -[was raised by]->, a group of alien thieves and smugglers', 'Peter Quill, -[is leader of]->, Guardians of the Galaxy', 'Peter Quill, -[would return to the MCU]->, May 2021, <-[Gunn reaffirmed]-, Guardians of the Galaxy Vol. 3', ...


Hmm, it takes a list :

def get_rel_map(
    self, subjs: Optional[List[str]] = None, depth: int = 2
) -> Dict[str, List[List[str]]]:

Looping through the list to build the query should work, but there might be a more elegant way. Whatever, start with a single subject.

If I build this up in :

llama_index/tests/storage/graph_stores/test_sparql.py

It make a good start to the test.

Probably unnecessary but I've added an `unescape_from_rdf` helper to `sparql.py` to revert the quote escaping that Turtle needed.

cd ~/AI/nlp/GraphRAG/src export PYTHONPATH=$PYTHONPATH:/home/danny/AI/LIBS-under-dev/llama_index python /home/danny/AI/LIBS-under-dev/llama_index/tests/storage/graph_stores/test_sparql.py


> urllib.error.HTTPError: HTTP Error 502: Proxy Error

Oops. Too many results? Check server...

That took me a long time, bit fiddly. But now :

results = graph_store.select_triplets('Peter Quill', 10)

is returning :

{'rel': {'type': 'literal', 'value': 'is leader of'}, 'obj': {'type': 'literal', 'value': 'Guardians of the Galaxy'}} {'rel': {'type': 'literal', 'value': 'is half-human'}, 'obj': {'type': 'literal', 'value': 'half-Celestial'}} {'rel': {'type': 'literal', 'value': 'was abducted from Earth'}, 'obj': {'type': 'literal', 'value': 'as a child'}} {'rel': {'type': 'literal', 'value': 'was raised by'}, 'obj': {'type': 'literal', 'value': 'a group of alien thieves and smugglers'}}


Ok, so now I reckon I need SPARQL UNION (and possibly BIND) to get some <-[backwards]- bits.

Break time.

Hmm, I was playing around with the SPARQL, looks like this dataset (populated from `sparql.py`) is missing a few triples.
For now go with https://fuseki.hyperdata.it/#/dataset/llama_index-test/query which came from NebulaGraph.

Ok, this returns some things of the right shape, will do for now :

PREFIX er: http://purl.org/stuff/er#

BASE http://purl.org/stuff/data

SELECT DISTINCT ?subj ?rel ?obj ?rel2 ?obj2 WHERE {

GRAPH <http://purl.org/stuff/guardians> {
    ?triplet a er:Triplet ;
        er:subject ?subject ;
        er:property ?property ;
        er:object ?object .

    ?subject er:value "Peter Quill"  .
    ?property er:value ?rel .
    ?object er:value ?obj .
OPTIONAL {
        ?triplet2 a er:Triplet ;
        er:subject ?subject2 ;
        er:property ?property2 ;
        er:object ?object2 .

    ?subject2 er:value ?obj .
    ?property2 er:value ?rel2 .
    ?object2 er:value ?obj2 .
}
}

}


**Property paths!** D'oh! I'd forgotten about them. Probably useful here. https://www.w3.org/TR/sparql11-query/#propertypaths

But for now, get suitable output of `rel_map` from results of the above.

**ChatGPT**
Given the following example :

subj = 'Peter Quill'
rels = {'rel': {'type': 'literal', 'value': 'is leader of'}, 'obj': {'type': 'literal', 'value': 'Guardians of the Galaxy'}, 'rel2': {'type': 'literal', 'value': 'cannot heal'}, 'obj2': {'type': 'literal', 'value': 'Rocket'}}
arp = to_arrows(subj, rels)

write the function to_arrows so this will be the value of string arp :

'Peter Quill, -[would return to the MCU]->, May 2021, <-[Gunn reaffirmed]-, Guardians of the Galaxy Vol. 3'
**didnt really help**

Started doing it manually, now too tired. Night night.

---
I've used this (and almost identical in Java etc) _so often_, but have managed to forget :

> Logger.setLevel() specifies the lowest-severity log message a logger will handle, where debug is the lowest built-in severity level and critical is the highest built-in severity. For example, if the severity level is INFO, the logger will handle only INFO, WARNING, ERROR, and CRITICAL messages and will ignore DEBUG messages.

`:cat AI`
`:tag SPARQL`
`:tag LlamaIndex`
 


llama_index SPARQL Notes 10

Published on 2023-09-05 by @danja

Got some funds so I paid some API credit. But have spent the day so far on admin and bits & pieces that were niggling me. And now, 16:47, it's dogwalk time. But I need a bit of fresh air, clear head for sparql.py. Back from dogwalk

Credit balance $29.87

sudo /usr/local/nebula/scripts/nebula.service start all
cd ~/AI/nlp/GraphRAG/src
export PYTHONPATH=$PYTHONPATH:/home/danny/AI/LIBS-under-dev/llama_index
python graph-rag-nebulagraph-minimal.py

5 or 6 runs with little tweaks in between. OpenAI credit still says $29.87.

No idea why, this returned a rather than text. So pulled that out.

It also ends on an exception from calling -

    def __del__(self) -> None:
        """Close NebulaGraph session pool."""
        self._session_pool.close()

But, before that it says :

Peter Quill is the half-human, half-Celestial leader of the Guardians of the Galaxy...

The log isn't any more useful -

INFO:llama_index.graph_stores.nebulagraph:nebulagraph HERE
INFO:llama_index.graph_stores.simple:simple HERE
INFO:llama_index.graph_stores.sparql:sparql HERE
INFO:__main__:graph-rag-nebulagraph-minimal HERE
INFO:__main__:#### 1.2
INFO:__main__:#### skip 2
INFO:__main__:#### 4
INFO:llama_index.indices.loading:Loading all indices.
INFO:__main__:#### 6.2
INFO:__main__:

Peter Quill is the half-human, half-Celestial leader of the Guardians of the Galaxy...

Hmm, log level..?

Tweaked a bit, noticed it's loading index from ./storage_graph/docstore.json

Oops, I pulled out the code for generating the kg when I was looking at the upload bits without OpenAI

TypeError: can only concatenate str (not "list") to str

ok,

for s in subjs:
    logger.warning('s =' + str(s))

a couple more tweaks later it ran through. Much more interesting logs!

First thing of note :

DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
DEBUG:openai:api_version=None data='{"prompt": "Some text is provided below. Given the text, extract up to 10 knowledge triplets in the form of (subject, predicate, object). Avoid stopwords.\\n---------------------\\nExample:Text: Alice is Bob\'s mother.Triplets:\\n(Alice, is mother of, Bob)\\nText: Philz is a coffee shop founded in Berkeley in 1982.\\nTriplets:\\n(Philz, is, coffee shop)\\n(Philz, founded in, Berkeley)\\n(Philz, founded in, 1982)\\n---------------------\\nText: The Guardians travel to Orgocorp\'s headquarters to find the switch\'s override code and save Rocket\'s life.As Rocket lies unconscious, he recalls his past.As a baby raccoon, he was experimented on by the High Evolutionary, who sought to enhance and anthropomorphize animal lifeforms to create an ideal society called Counter-Earth.Rocket befriended his fellow Batch 89 test subjects: the otter Lylla, the walrus Teefs, and the rabbit Floor.The High Evolutionary was impressed by Rocket\'s rapidly growing intelligence but became furious once it exceeded his own.The High Evolutionary used Rocket to perfect his Humanimal creations, then planned to harvest Rocket\'s brain for further research and exterminate the obsolete Batch 89.Rocket freed his friends, but the High Evolutionary killed Lylla.Enraged, Rocket mauled the High Evolutionary, but his henchmen killed Teefs and Floor during a firefight with Rocket, before the latter fled Counter-Earth in a spaceship.In the present, the Ravagers, including an alternate version of Gamora, help the Guardians infiltrate Orgocorp.They retrieve Rocket\'s file but discover that the code was removed, with the likely culprit being Theel, one of the High Evolutionary\'s advisors.The Guardians, along with Gamora, depart for Counter-Earth to find him.They are followed by Ayesha and Adam after the High Evolutionary, their race\'s creator, threatened to wipe out the Sovereign if they fail to retrieve Rocket.The Guardians reach Counter-Earth and are guided to the Ar\\u00eate Laboratories complex.Drax and Mantis remain with Gamora and Rocket as Peter Quill, Groot, and Nebula travel to Ar\\u00eate.Nebula is forced to wait outside by guards as Quill and Groot enter Ar\\u00eate, while Drax tricks Mantis into pursuing Quill\'s group.Gamora saves Rocket from being captured by Adam and the High Evolutionary\'s guard War Pig.Questioned by Quill, the High Evolutionary admits disillusionment with the Animen\'s imperfect society.He destroys Counter-Earth, killing the Humanimals and Ayesha.Ar\\u00eate departs as a spaceship, with Nebula, Drax and Mantis boarding to rescue Quill and Groot, who instead escape Ar\\u00eate with Theel, retrieving the code from his corpse before being rescued by Gamora in their ship.\\nTriplets:\\n", "stream": false, "model": "text-davinci-002", "temperature": 0.0, "max_tokens": 3480}' message='Post details'
DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None

Lots and lots of upserts, up to around line 1000 in nebby.log

line 1020 #### nebulagraph get_rel_map called

WARNING:llama_index.graph_stores.nebulagraph:
#### nebulagraph get_rel_map called
WARNING:llama_index.graph_stores.nebulagraph:s =Peter Quill
WARNING:llama_index.graph_stores.nebulagraph:#### nebulagraph get_flat_rel_map
WARNING:llama_index.graph_stores.nebulagraph:subs = ['Peter Quill']
WARNING:llama_index.graph_stores.nebulagraph:get_flat_rel_map() subjs_param: ['Peter Quill'], query: WITH map{`true`: '-[', `false`: '<-['} AS arrow_l,     map{`true`: ']->', `false`: ']-'} AS arrow_r MATCH (s)-[e:`relationship`*..2]-()   WHERE id(s) IN $subjs WITH id(s) AS subj,[rel IN e |   [  arrow_l[tostring(typeid(rel) > 0)] +      rel.`relationship`+  arrow_r[tostring(typeid(rel) > 0)],  CASE typeid(rel) > 0    WHEN true THEN dst(rel)    WHEN false THEN src(rel)  END  ]] AS rels WITH   subj,  REDUCE(acc = collect(NULL), l in rels | acc + l)    AS flattened_rels RETURN  subj,  REDUCE(acc = subj, l in flattened_rels | acc + ', ' + l )    AS flattened_rels
WARNING:llama_index.graph_stores.nebulagraph:rel_map =
WARNING:llama_index.graph_stores.nebulagraph:s =Peter Quill
WARNING:llama_index.graph_stores.nebulagraph:rel_map =

right, this looks where the triplets for a subj are got & returned

raw snipped

I saved that chunks as rel_map.json and let VSCode try to format it - much clearer. Kinda...

PS. format a bit by hand

{'Peter Quill': [
    'Peter Quill, -[would return to the MCU]->, May 2021, <-[would return to the MCU]-, Peter Quill',
    'Peter Quill, -[would return to the MCU]->, May 2021',
    'Peter Quill, -[was raised by]->, a group of alien thieves and smugglers',
    'Peter Quill, -[is leader of]->, Guardians of the Galaxy',
    'Peter Quill, -[would return to the MCU]->, May 2021, <-[Gunn reaffirmed]-, Guardians of the Galaxy Vol. 3',
    ...

Ok, a format in a format. I don't know why there are backwards arrows and apparently repetition, but the basic stuff will be straightforward from SPARQL results. I reckon I'll start with simple forward-arrow triples, see if that communicates enough.

oh wait, line 1124 :

'Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
DEBUG:openai:api_version=None data='{"prompt": "Context information from multiple sources is below.\\n---------------------\\nThe following are knowledge sequence in max depth 2 in the form of directed graph like:\\n`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...
...

Break time. Food & flop.

:cat AI :tag SPARQL :tag LlamaIndex

 


Things to Revisit

Published on 2023-09-05 by @danja

reminders


Kingsley

thread -

https://twitter.com/kidehen/status/1677095027246243840

Woot!!!

#SPARQL plugin for #ChatGPT working, and it was so darn easy to make and deploy too, courtesy of our #VirtuosoRDBMS (designed to bring #Web magic to #DataConnectivity & #DBMS realms).

Session Transcript: https://chat.openai.com/share/ff7ec94e-0b0b-4505-b379-d6be7a91d7bf

#SemanticWeb #AI #LinkedData #AGI


https://twitter.com/mark_l_watson

Wowza! I asked ChatGPT with Diagrams Plugin to generate a UML sequence diagram for a SPARQL query to DBPedia: A+ grade

LLMs are far, far, far! from AGI but very useful. ChatGPT and CoPilot are fixtures of my Emacs setup, and web versions including Bard, Midjourney, and Bing+GPT are solidly in my workflow.

 


Neural Net Nuts and Bolts Speculation

Published on 2023-09-05 by @danja

braindump for future me

Chain of Thought

I've had dataflows on my mind this week, tracing through some code and seeing some coincidentally related material, eg. Tweet from Mark Watson :

Wowza! I asked ChatGPT with Diagrams Plugin to generate a UML sequence diagram for a SPARQL query to DBPedia: A+ grade

The other day I had a skim through some AI books I got in the 1990s, mostly curious what ideas were around then that have been forgotten, could maybe be revitalized. One notable takeaway was how ReLU (rectified linear unit, a diode) has since unseated tanh/sigmoid as the activation function of choice.

Seed of an Idea

Looking at a vanilla neural network, single numeric values flow through, getting modified along the way (and are backpropagated, changing weights). For tracking flows and highlighting things, medical diagnosis can use radioactive tracers.

Could something comparable be used with NNs?

At the input layer a value x is given to a node, subsequently each node receives a bunch of values from connected nodes in the previous layer. What if instead a pair of values were passed (x, C), where C is a marker constant, what should come out of a node, what about side effects?

First pass, how about this : the treatment of the x values stays exactly the same as in the vanilla case - but a C is hitching a ride. A selection function at the node picks the C from the input pairs with the highest value from its inputs. This is the value passed along from this node to nodes in the next layer.

The side effect I have in mind is similar to the way weights are adjusted in backprop, that the node takes on a value of C. This could also occur on the backprop phase, so each node holds (Cf, Cb).

Are there any implementation issues I haven't seen? Might this be any use for anything?

To investigate, I guess trying it in toy network code would be the next step.

There's a kind of bigger picture/generalization aspect to this. What if the values passed around, rather than usual single values or strange pairs as above, are arbitrary data structures? The transfer funtions are arbitrary? I'm pretty sure there'll be net designs which pass matrices of real numbers around, I've a feeling there might be calculation performance optimization potential somewhere around there. But I haven't a clue what activation functions would be appropriate...

On the function question, usually differentiability is canon. But in a view from 1km, this is a special case where optimization is done through convex optimization, hill-climbing over the number spaces. Other optimization techniques exist, eg. combinatorial, integer programming.

I've not read much about optimization techniques, apart from bits in papers along the lines of "We found 5 layers gave accuracy within 2%, 6 or more only gave fractional improvement.". The relative benefits of different activation functions was looked at a lot in the early days of Deep Learning. But nowadays experiments (I've seen) tend to be looking more at large-scale topologies, with the subunits chosen from known good black boxes (from a box of black boxes?).

I don't know, but perhaps the space of possible network functionality could be explored with a bit of meta-programming, trying different setups as above. It feels like it should be automatable.

To borrow from the Outer Limits :

There is nothing wrong with your neural network. We will control the datatypes. We will control the activation functions.

 


llama_index SPARQL Notes 09

Published on 2023-09-04 by @danja

I lost internet connectivity last night, woke up this morning with no electricity. The electric came on pretty soon - I had been warned ENEL were messing with the village wiring. The net took a trip into town to sort out (billing error, and then they took ages to switch me back on). The money I was expecting hasn't yet arrived t the bank, so still no OpenAI credit. So I spent most of the day staring at the wall/Twitter.

Ok, it seems like I was looking at the wrong place to put the query. Not query() but :

    def get(self, subj: str) -> List[List[str]]:
        """Get triplets."""
        ...

The implementation in nebulagraph.py is scary!

Fool Danny, I forgot to put a logger bit in to check what gets returned there.

In simple.py :

    def get(self, subj: str) -> List[List[str]]:
        """Get triplets."""
        return self._data.graph_dict.get(subj, [])

Wey mentions 'flattened' around get_rel_map(), where the subject is given with a list of associated rel, obj. But this looks less flat.

I'll assume for now, if data =

"one", "two", "three"
"one", "two", "four"
"two", "two", "four"

get('one') will return :

[[`one', 'two', 'three'], ['one','two','four']]

Now I'm confused. I did put a logger into get(), list of lists, lol. Nothing came out.

Ok, I'll make a helper method that acts as above, see where it go when I can see the behaviour again.

@prefix er: http://purl.org/stuff/er# . @base http://purl.org/stuff/data .

Query like :

PREFIX er:  <http://purl.org/stuff/er#>
BASE <http://purl.org/stuff/data>

SELECT DISTINCT ?rel ?obj WHERE {
    GRAPH <http://purl.org/stuff/guardians> {
        ?triplet a er:Triplet ;
            er:subject ?subject ;
            er:property ?property ;
            er:object ?object .

        ?subject er:value ?subj_string .
        ?property er:value ?rel .
        ?object er:value ?obj .
    }
}

Ok, that worked via Fuseki's UI. For code, replace ?subj_string with {subj}. Put it in def sparql_query(self, query_string):

sparqlwrapper will not doubt need help producing what I want, need to RTFM there, results.response somewhere - if it's available as a list of lists that'd be nice.

I'm using GET method, I don't like POST unless really necessary (it's floppy in the spec).

I must make an endpoint-focused test for this

Wired in to get(), that'll do for today.

 


llama_index SPARQL Notes 08

Published on 2023-09-03 by @danja

Continuing from yesterday, log-probing modulagraph.py

While added logger probes, having another look at functions in modulagraph.py, I should implement in sparql.py. They don't look complicated in what they do, but the shape of the data structures returned is confusing.

    def client(self) -> Any:
        self._session_pool

Not really sure there, the sparqlwrapper clients aren't reusable, maybe return a generator functions? need to look at the other graph_store implementations.

    def get(self, subj: str) -> List[List[str]]:

I suspect that'll be like SELECT DISTINCT ?p ?o WHERE { <subj> ?p ?o }

https://adamj.eu/tech/2021/07/06/python-type-hints-how-to-use-typing-cast/
    def get_rel_map(
        self, subjs: Optional[List[str]] = None, depth: int = 2
    ) -> Dict[str, List[List[str]]]:
        """Get rel map."""
        # We put rels in a long list for depth>= 1, this is different from
        # SimpleGraphStore.get_rel_map() though.
        # But this makes more sense for multi-hop relation path.

Confusing. But this calls get_flat_rel_map which has a handy comment :

        # The flat means for multi-hop relation path, we could get
        # knowledge like: subj -rel-> obj -rel-> obj <-rel- obj.
        # This type of knowledge is useful for some tasks.
        # +-------------+------------------------------------+
        # | subj        | flattened_rels                     |
        # +-------------+------------------------------------+
        # | "player101" | [95, "player125", 2002, "team204"] |
        # | "player100" | [1997, "team204"]                  |
        # ...
        # +-------------+------------------------------------+
    def upsert_triplet(self, subj: str, rel: str, obj: str) -> None:

I think the implementation for NebulaGraph makes this look trickier than it is. A templated SPARQL INSERT block is probably whaqt's needed.

    def delete(self, subj: str, rel: str, obj: str) -> None:

Same approach as upsert_triplet?

    def query(self, query: str, param_map: Optional[Dict[str, Any]] = {}) -> Any:

Hmm, the param_map looks painful. This function calls :

    def execute(self, query: str, param_map: Optional[Dict[str, Any]] = {}) -> Any:

No really sure how that operates...

I'm not sure how useful the log statements I've added will be with the data structures, I've only got str(structure), some may well need more unrolling.

Ok, try that (without any calls to OpenAI API) :

Runs ok, nice :

INFO:llama_index.graph_stores.nebulagraph:get_schema() schema:
Node properties: [{'tag': 'entity', 'properties': [('name', 'string')]}]
Edge properties: [{'edge': 'relationship', 'properties': [('relationship', 'string')]}]
Relationships: ['(:entity)-[:relationship]->(:entity)']

Hmm. How best to express the schema in RDF? RDFS is the obvious choice. This current version absolutely won't be a general solution so the schema should be associated with the (guardians) graph. OWL has ontology descriptions..? TODO re-read specs.

Uncommented some of the calls in graph-rag-nebulagraph-minimal.py that depend on OpenAI API, added a few log statements. Added API key. I've probably not included all the blocks of code needed for this to work end-to-end, but give it a go.

OpenAI API credit balance $0.21

Running...

 raise openai.error.AuthenticationError(
openai.error.AuthenticationError: No API key provided. You can set your API key in code using 'openai.api_key = <API-KEY>', or you can set the environment variable OPENAI_API_KEY=<API-KEY>). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = <PATH>'. You can generate API keys in the OpenAI web interface. See https://platform.openai.com/account/api-keys for details.

Strange, I had os.environ["OPENAI_API_KEY"] = "...". Added openai.api_key = "..."

Oops, NameError: name 'openai' is not defined. Did you mean: 'OpenAI'?

import openai

Also uncommented text-to-NebulaGraphCypher bits.

Ok, failed on :

  File "/home/danny/AI/LIBS-under-dev/llama_index/llama_index/graph_stores/nebulagraph.py", line 632, in query
    logger.info('param_map = '+param_map)
                ~~~~~~~~~~~~~~^~~~~~~~~~
TypeError: can only concatenate str (not "dict") to str

log got as far as :

#### nebulagraph query called
INFO:llama_index.graph_stores.nebulagraph:query =
MATCH (p:`entity`)-[:relationship]->(m:`entity`) WHERE p.`entity`.`name` == 'Peter Quill'
RETURN p.`entity`.`name`;

tweaked -

        # logger.info('param_map = '+param_map)
        logger.info('param_map = ')
        for key, value in param_map.items():
            logger.info(key + ' = '+value)

Good-good, at command line got :

python graph-rag-nebulagraph-minimal.py
Graph Store Query:

MATCH (p:`entity`)-[:relationship]->(m:`entity`) WHERE p.`entity`.`name` == 'Peter Quill'
RETURN p.`entity`.`name`;
Graph Store Response:
{'p.entity.name': ['Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill', 'Peter Quill']}
Final Response:

Peter Quill is a character in the Marvel Universe. He is the son of Meredith Quill and Ego, and the half-brother of Gamora. He was raised by his mother on Earth until he was abducted by aliens at the age of ten. He was raised by the Ravagers, a group of space pirates, and eventually became their leader. He is also known as Star-Lord.

Only a little more in log :

#### nebulagraph get_schema called
INFO:llama_index.graph_stores.nebulagraph:get_schema() schema:
Node properties: [{'tag': 'entity', 'properties': [('name', 'string')]}]
Edge properties: [{'edge': 'relationship', 'properties': [('relationship', 'string')]}]
Relationships: ['(:entity)-[:relationship]->(:entity)']

INFO:__main__:#### 5.2
INFO:__main__:#### 5.3
INFO:__main__:#### 6.1
INFO:llama_index.graph_stores.nebulagraph:
#### nebulagraph query called
INFO:llama_index.graph_stores.nebulagraph:query =
MATCH (p:`entity`)-[:relationship]->(m:`entity`) WHERE p.`entity`.`name` == 'Peter Quill'
RETURN p.`entity`.`name`;
INFO:llama_index.graph_stores.nebulagraph:param_map =
INFO:__main__:#### 6.2

Looks like it's first querying for all the subjects, then using those to get all property, objects.

Credit balance $0.20

Boo! VSCode hung when I was trying to rename a file. For a clean slate I rebooted. Browser opened with :

Credit balance $0.15

Commented out the text-to-NebulaGraphCypher bits again.

So next, attacking sparql.py again (last time was with ChatGPT helping, only served to confuse me) :

The bits that use sparqlwrapper I'll put in helper functions so connection can be tested independently.

Had a long break, back to desk, noticed the schema in nebulagraphstore.py appears to be NebulaGraph-specific. Not implemented in simple.py, nothing to suggest it's essential elsewhere (as long as the DB don't care).

Should probably do anyhow, just for self-domumenting purposes. Couple of relevant terms :

http://www.w3.org/2002/07/owl#Ontology http://www.w3.org/2004/02/skos/core#inScheme

(playing a little fast & loose with domain & range, but I think is valid)

Started filling out sparql.py. I thought I'd got to the point where it would INSERT DATA to store, instead:

python graph-rag-sparql-minimal.py
[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.
<IPython.core.display.Markdown object>

Yay!!!

It appears to be passing triples to SPARQL store ok.

There was a char issue, but I've added escaping. But when I ran it :

openai.error.RateLimitError: You exceeded your current quota, please check your plan and billing details.

I was ready to call it a day anyway. Some money is due in my bank tomorrow, fingers crossed.

The next bit, the query part, should be straightforward, starting with :

SELECT ?s ?p ?o WHERE {
    GRAPH <http://purl.org/stuff/guardians> {
            ?s ?p ?o
        }
}
 


llama_index SPARQL Notes 07

Published on 2023-09-02 by @danja

Ok, time to attack the bits listed at the end of yesterday's post.

Break time. Made some progress, but got a bit distracted starting a master plan. Pretty much than same as I'd already put in the GraphRAG doc, but expanded a bit, more background, hopefully a better explanation for folks that occupy other Venn diagrams.

export PYTHONPATH=$PYTHONPATH:/home/danny/AI/LIBS-under-dev/llama_index

Added extra logging to nebulagraph.py, just markers on functions defined in types.py are being called -

logging.basicConfig(filename='loggy.log', filemode='w', level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info('nebulagraph HERE')
...
logger.info('#### nebulagraph client(self) called')

After running python graph-rag-nebulagraph-minimal.py, what was in loggy.log (in the dir I ran from) :

INFO:llama_index.graph_stores.nebulagraph:HERE
INFO:llama_index.indices.loading:Loading all indices.
INFO:llama_index.indices.loading:Loading all indices.
INFO:llama_index.graph_stores.nebulagraph:#### get_schema called

Ok, that's a start. For a bit more coverage, I'll do the same to simple.py and the skeletal sparql.py.

INFO:llama_index.graph_stores.nebulagraph:nebulagraph HERE
INFO:llama_index.graph_stores.simple:simple HERE
INFO:llama_index.indices.loading:Loading all indices.
INFO:llama_index.indices.loading:Loading all indices.
INFO:llama_index.graph_stores.nebulagraph:#### nebulagraph get_schema called
 


llama_index SPARQL Notes 06

Published on 2023-09-01 by @danja

grr, forgot again,

sudo /usr/local/nebula/scripts/nebula.service start all

So far nebula-rdf-dump.py is making quasi-Turtle triples from the NebulaGraph data. If I provide some namespace prefixes that'll make it proper. er for my minimal Entity-Relationship vocab, use the base to complete the instance data URIs.

Easy one, prepend it with :

@prefix er: <http://purl.org/stuff/er#> .
@base <http://purl.org/stuff/data> .

Using f formatting in the code, run...

Check :

rapper: Parsing URI file:///home/danny/AI/nlp/GraphRAG/nebula-sparql-utils/guardians.ttl with parser turtle
rapper: Parsing returned 2380 triples

Good-o.

Now to figure out the SPARQL to pust the stuff to the store.

Different prefix syntax, https://www.w3.org/TR/sparql11-query/#syntaxTerms

BASE <http://purl.org/stuff/data>
PREFIX er:  <http://purl.org/stuff/er#>

What queries are needed?

https://www.w3.org/TR/sparql11-update

CREATE ( SILENT )? GRAPH IRIref

DROP  ( SILENT )? (GRAPH IRIref | DEFAULT | NAMED | ALL )

INSERT DATA  QuadData

where QuadData are formed by TriplesTemplates, i.e., sets of triple patterns, optionally wrapped into a GRAPH block.

Ok, so here I guess an update call for :

CREATE GRAPH <http://purl.org/stuff/guardians>
{prefixes}
INSERT DATA {
    GRAPH <http://purl.org/stuff/guardians>
       { {triples} } }

ew, need to escape {} ... nah, less thought needed to concatenate with +

Good-oh, sparqlwrapper has an update example https://sparqlwrapper.readthedocs.io/en/latest/main.html#sparql-update-example (with auth)

danny@danny-desktop:~/AI/nlp/GraphRAG/nebula-sparql-utils$ python nebula-rdf-dump.py
Update succeeded

Update succeeded

Looking good so far...check data at

https://fuseki.hyperdata.it/#/dataset/llama_index-test/query

SELECT ?s ?p ?o WHERE {
    GRAPH <http://purl.org/stuff/guardians> {
            ?s ?p ?o
        }
}
LIMIT 50

Nothing!!!

Tried the SPARQL as dumped to file, then SELECT as above - looks fine.

DROP GRAPH <http://purl.org/stuff/guardians>

check code around sparqlwrapper...

Hmm, can you reuse clients (create graph & insert)? Apparenty not. Added code to make a new one - it worked!

yardstone reached!

Tired, but looking at next steps -

from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore

...

%pip install nebula3-python ipython-ngql

os.environ['NEBULA_USER'] = "root"
os.environ['NEBULA_PASSWORD'] = "nebula" # default password
os.environ['NEBULA_ADDRESS'] = "127.0.0.1:9669" # assumed we have NebulaGraph installed locally

space_name = "guardians"
edge_types, rel_prop_names = ["relationship"], ["relationship"] # default, could be omit if create from an empty kg
tags = ["entity"] # default, could be omit if create from an empty kg

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

Looks like most of what StorageContext does is done through the graph store implementations, eg. SimpleGraphStore


Ok, I reckon next, going back to putting tracer logging in NebulaGraphStore

  1. pull out a bit of code from Notebook that uses nebulagraphstore.py but not the OpenAI API
  2. make sure it works!
  3. point python at my dev llama_index
  4. add logging points to dev llama_index nebulagraphstore.py

currently at :

~/.local/lib/python3.11/site-packages/llama_index

noo...that has a single subdir readers

Ok, this may or may not work - a symlink

 


llama_index SPARQL Notes 05

Published on 2023-08-31 by @danja

Grr...next admin session I must set up systemd

sudo /usr/local/nebula/scripts/nebula.service start all

for ChatGPT :

I'd like a function to remove duplicates from a Python json structure. For example, give the following :
[
{'s': 'production on Vol.3', 'p': 'was put on hold until', 'o': 'February 2021'
},
{'s': 'production on Vol.3', 'p': 'put on hold until', 'o': 'February 2021'
},
{'s': 'production on Vol.3', 'p': 'was put on hold until', 'o': 'February 2021'
}
]
the function should return :
[
{'s': 'production on Vol.3', 'p': 'was put on hold until', 'o': 'February 2021'
},
{'s': 'production on Vol.3', 'p': 'put on hold until', 'o': 'February 2021'
}
]

It got it right first time!

So, next to flip this JSON into RDF/SPARQL.

Not 100% sure, but I don't think I actually need the extract_entities function, anything useful will also appear in extract_entities and the types/roles are implied by the triplets.

sanitize

def escape_for_rdf(input_str) - thank you ChatGPT.

Given the following JSON, how would I retrieve the values of s, p, o? [ {'s': "Industrial Light & Magic's StageCraft", 'p': 'was developed for', 'o': 'Disney+ Star Wars series The Mandalorian' } ]

took a little backwards & forwards, but we got there.

I'm really not sure what level of granularity will eventually be needed, for big lumps of data something else from the SPARQL Protocol would probably be better (like a full-on HTTP PUT). Per-triple DELETE/INSERT might be desirable elsewhere. But here it shouldn't get too big, the INSERT approach should work, and would also work per-triple.

The NebulaGraph notion of a Space appears very similar to RDF/SPARQL Named Graphs, so I'll go with that (more information, and things can get messy doing everything in the default graph).

This is what I came up with the other day :

# Simple Entity-Relation

@base <http://purl.org/stuff/data> .
@prefix er: <http://purl.org/stuff/er> .

<#T123> a er:Triplet ;
er:id "#T123" ;
er:subject <#E123> ;
er:property <#R456> ;
er:object <#E567> .

<#E123> a er:Entity ;
er:value "one" .

<#R456> a er:Relationship ;
er:value "two" .

<#E567> a er:Entity ;
er:value "three" .

I opted for URL-named resources rather than blank nodes or other IRI because soon it may help with sanity-checking, further down the line the potential for HTTP retrieval is nice to have. There are a lot of annotation/meta triples that could be added, but I reckon this is about the minimum necessary to fulfil the Wey Gu's Notebook with SPARQL use case.

I guess I need a fragment ID generator. Suboptimal but so it's not too hard on the eyes for now, I'll got with alpha(capitals)-numeric. How many chars?

Please write a function to generate a random 4-character string using only numeric characters and capital letters.

Ok.

Need to keep identity of entities/relationships. I'm not sure what should happen if the same string is pulled out more than once as entity and/or relationship from different contexts. Should they have different IDs (URLs)? I don't fancy a deep dive into llama_index structures right now, I'll wait for someone to tell me. Whatever, giving identical strings the same URL shouldn't make a huge difference either way.

Strings appearing both in the role of Relationship & Entity might also be thinking about. But again, for now, no big deal.

Ran current version -

                        <#THKOE> a er:Triplet ;
                                er:subject <#EK8WH> ;
                                er:property <#RJSJV> ;
                                er:object <#ELD8T> .

                        <#EK8WH> a er:Entity ;
                                er:value "Vol.3" .

                        <#RJSJV> a er:Relationship ;
                                er:value "is the first MCU film to feature" .

                        <#ELD8T> a er:Entity ;
                                er:value "the word  fuck  uncensored" .

Progress!

 


llama_index SPARQL Notes 04

Published on 2023-08-30 by @danja

Continuing from yesterday on getting data from a NebulaGraph store and putting it into a SPARQL store.

I was able to get a condensed JSON rendition of the Entity data from NebulaGraph, now to do the same with Relationships

USE guardians;
MATCH (src:entity)-[e:relationship]->(dst:entity)
RETURN src, e, dst
LIMIT 10;

ChatGPT time -

The task will be to create a function to extend an existing Python file. The purpose of this is to take data of a specific shape from a NebulaGraph DB via the nebula3 API and save it in a condensed form as JSON text. I will first give you the nGQL statements that were used to create the NebulaGraph space, then upload a the file, then describe the function I would like to add to it. Here are the nGQL statements :

CREATE SPACE guardians(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);
:sleep 10;
USE guardians;
CREATE TAG entity(name string);
CREATE EDGE relationship(relationship string);
:sleep 10;
CREATE TAG INDEX entity_index ON entity(name(256));

In the file there is the API call : resp = client.execute_json('MATCH (v:entity) RETURN v') I will add : resp_rel = client.execute_json('MATCH (src:entity)-[e:relationship]->(dst:entity) RETURN src, e, dst') json_rel_str = resp_rel.decode('utf-8') Please write a function extract_relationships(json_rel_str) following the pattern of extract_entities(json_data)

What it gave me looks close, but the output was :

    {'src': 'N/A', 'relationship': 'N/A', 'dst': 'N/A'
    },
    {'src': 'N/A', 'relationship': 'N/A', 'dst': 'N/A'
    },
...

I will upload a file containing a single result from the database. I would like the data extracted from this by the function extract_relationships to be: [{"src": "production on Vol.3", "rel": "relationship", "dst": "February 2021"}] please modify the function accordingly

ChatGPT Finished working Show work It appears that I forgot to request the file upload. Could you please upload the file containing a single result from the database? This will help me modify the extract_relationships function to meet your requirements.

We went around in circles a lot. ChatGPT's mapping was repeatedly out, I hadn't looked closely neough at the NebulaGraph results.

But finally I ran with :

resp_rel = client.execute_json(
    'MATCH (src:entity)-[e:relationship]->(dst:entity) RETURN src, e, dst' LIMIT 1)

effectively make it like :

[{"s": "production on Vol.3", "p": "was put on hold until", "o": "February 2021"}]

Yes! There are a lot of duplicates in the output but I'll just hack that for now. But I think I've done the unfmiliar bits now, the RDF/SPARQL side should be straightforward from here.

Enough for today.

This would probably have been much easier with JSON-LD. But since I started down this path...

 


ejabberd Notes

Published on 2023-08-30 by @danja

danbri mentioned XMPP (for ChatGPT) yesterday, it appears I have ejabberd installed. Is it running?

Browser-based client :

https://conversejs.org/docs/html/quickstart.html

Pro-tip, if you just want to quickly test things locally, you can run make serve inside a checkout of the Converse repo. Converse is then hosted at http://localhost:8000

} from '../../../node_modules/lit/html.js'; // } from './html.js'; danny

cp node_modules/lit/html.js src/headless/utils/

grr, hateful thing, given up.

Installed gajim (via synaptic) instead

no immediate joy (where to cofig servers?)

Installed pidgin

isn't connecting - the Enabled box for the account unchecks itself

Active: activating (start) since Wed 2023-08-30 11:13:37 UTC; 3min 22s ago
...
Aug 30 11:15:27 localhost sh[443214]: 2023-08-30 11:15:27.334892+00:00 [error] Cannot ping watchdog: {badmatch,{error,eacces}}

looks like the install is old, I think done from .deb So,

Unpacking ejabberd (23.04-1) over (21.12-1) ...
...
Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 142.

Oops!

Grrr, Cannot ping watchdog again.

Google...

https://askubuntu.com/questions/1411679/ubuntu-22-04-ejabberd-apparmour-profile-broken

Description=A distributed, fault-tolerant Jabber/XMPP server
Documentation=https://www.process-one.net/en/ejabberd/docs/
After=epmd.service network.target
Requires=epmd.service

[Service]
Type=forking
User=ejabberd
Group=ejabberd
LimitNOFILE=65536
Restart=on-failure
RestartSec=5
ExecStart=/bin/sh -c '/usr/sbin/ejabberdctl start && /usr/sbin/ejabberdctl started'
ExecStop=/bin/sh -c '/usr/sbin/ejabberdctl stop && /usr/sbin/ejabberdctl stopped'
ExecReload=/bin/sh -c '/usr/sbin/ejabberdctl reload_config'
PrivateTmp=true
ProtectHome=true
ProtectSystem=full
TimeoutSec=300

[Install]
WantedBy=multi-user.target

Active: active (running) since Wed 2023-08-30 11:47:14 UTC; 33s ago

Yay!

From pidgin :

danja@hyperdata.it/ disconnected
Host Unknown

Port problem? Whatever, enough of that for now.


Hah! The second I returned to Lemmy setup, I saw mynotes from when I installed ejabberd last time :

2022-10-27

installed ejabberd

Creating config file /etc/default/ejabberd with new version

Creating config file /etc/ejabberd/ejabberd.yml with new version
Created symlink /etc/systemd/system/multi-user.target.wants/ejabberd.service → /
lib/systemd/system/ejabberd.service.

admin@hyperdata.it

/etc/ejabberd/ejabberd.yml

ejabberdctl reload_config

ejabberdctl register admin hyperdata.it [password]

https://hyperdata.it:5280/admin/

ejabberdctl register danny hyperdata.it [password]
ejabberdctl register device hyperdata.it [password]
ejabberdctl register marinella hyperdata.it [password]

locally
mosquitto_sub -u danny@hyperdata.it -P  [password] -t "test/1" -d -v
new terminal
mosquitto_pub -u device@hyperdata.it -P  [password] -t "test/1" -d -m "ABC"

...Whatever, enough of that for now.

 


llama_index SPARQL Notes 03

Published on 2023-08-29 by @danja

Today

I'm down to $0.23 OpenAI API credit, so until I next have $ need to look at things that don't need it. sparql.py doesn't in itself need the OpenAI API, but a SPARQLy version of Wey Gu's Notebook is my target functional E2E test.

I still have NebulaGraph data generated from Wey's Notebook. I can use nebula-python to pull out data from there, RDFLib to build RDF, sparqlwrapper to push to store. The SPARQL needed will be essentially the same as for sparql.py. Also NebulaGraph <=> RDF utils would be nice to have (may already exist, but I'll pretend I didn't consider that, need to inform myself).

Then -


MOVE PATH FROM llama_index INSTALLED TO DEV TREE

sudo /usr/local/nebula/scripts/nebula.service start all

Started with :

resp = client.execute_json('MATCH (v:entity) RETURN v')

After converting this to string there were character encoding errors. Should really be investigated properly, but for now I'll just go with some quick & dirty sanitization (Valid RDF literals will probably need something anyhow).

json_str = resp.decode('utf-8') appears to have fixed the encoding issues.

 


Lemmy

Published on 2023-08-29 by @danja

https://github.com/LemmyNet

recommends Ansible

I opted for the full version -

ew, initial bits painful, found the docs confusing, maybe assume more prior knowledge than I have (only used Docker a bit, have never used Ansible). More worrying, I haven't a clue what it's stuck on my server.

2023-08-30

The Ansible thing did look like it'd mostly run ok, but things definitely fell apart around nginx. Very likely cause being host:port clash, as I already have an Apache server in front on hyperdata.it/127.0.0.1 :80 and :443

https://www.geeksforgeeks.org/changing-nginx-port-in-linux/

I changed the default config yesterday, but just looked again, there's actually :

/etc/nginx/sites-available/default
/etc/nginx/sites-available/hyperdata.it.conf
/etc/nginx/sites-enabled/default
/etc/nginx/sites-enabled/hyperdata.it.conf

So I've deleted the /etc/nginx/sites-enabled/default symlink and changed hyperdata.it.conf to use port 8421

I'm pretty sure I don't need https/443 on nginx, I've got Apache doing that, is proxying to Fuseki etc on plain http on different ports. So I've deleted that section from hyperdata.it.conf

Active: active (running) since Wed 2023-08-30 09:27:02 UTC; 12s ago

Yay!

Hmm. http://hyperdata.it:8421 just redirects to https://hyperdata.it/

On server,

Currently reorganising the server. I'll put my FOAF/Solid here soon, for now my contact details should be at : https://hyperdata.it/ note to self : this is /opt/lampp/htdocs/profile/index.html

Aha! That's my placeholder for https://danny.ayers.name/, configured in Apache.

Where are my notes on Apache setup..?

~/HKMS/server-setup/ apparently - TODO make that easier to find

distraction danbri mentioned XMPP (for ChatGPT) yesterday, it appears I have ejabberd installed. Is it running? see ejabberd Notes

https://lemmy.hyperdata.it

http://lemmy.hyperdata.it http://lemmy.hyperdata.it:8421 https://lemmy.hyperdata.it:8421 http://hyperdata.it:8421 https://hyperdata.it:8421

lynx http://127.0.0.1:8421

root@localhost:~/temp# wget http://127.0.0.1:8421
--2023-09-01 11:19:35--  http://127.0.0.1:8421/
Connecting to 127.0.0.1:8421... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://127.0.0.1/ [following]
--2023-09-01 11:19:35--  https://127.0.0.1/
Connecting to 127.0.0.1:443... connected.
    ERROR: certificate common name ‘danny.ayers.name’ doesn't match requested host name ‘127.0.0.1’.
To connect to 127.0.0.1 insecurely, use `--no-check-certificate'.

The only 301 that looked relevant was in :

nano /etc/nginx/sites-available/hyperdata.it.conf

so I removed that block

systemctl restart nginx.service

https://lemmy.hyperdata.it/ seems to redirect to http://lemmy.hyperdata.it/, nginx starter page.

That redirect was needed!

put it back, random longshot, removed (shouldn't be needed after Apache..?):

    location /.well-known/acme-challenge/ {
        root /var/www/certbot;
    }

in /var/log/nginx/access.log

127.0.0.1 - - [01/Sep/2023:11:31:41 +0000] "GET / HTTP/1.1" 301 162 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"

It looks like it's getting that far, but the lemmy bit isn't doing anything

https://join-lemmy.org/docs/administration/troubleshooting.html

root@localhost:/home/hmks-apps/lemmy/hyperdata.it# docker-compose logs -f lemmy

...
lemmy_1     | 	Is the server running on host "postgres" (172.18.0.3) and accepting
lemmy_1     | 	TCP/IP connections on port 5432?
...

I was in the middle of typing a message to ask for help on r/lemmy, when something occurred to me. Maybe Ansible doesn't do the external services. It's plausible that I'd already installed nginx, Ansible just added a config. But postgres, pictrs (?), the smpt service..?

sudo apt install postgresql

The postgresql service is an "umbrella" service whose purpose is to start or stop the services named postgresql@version-instance which are the actual services that you are interested in. To get the statuses of these, run sudo systemctl status 'postgresql*'

a PostgreSQL Cluster is running...

https://www3.ntu.edu.sg/home/ehchua/programming/sql/PostgreSQL_GetStarted.html

initial client access :

there is a DB postgres, owner postgres

wait, https://github.com/LemmyNet/lemmy/issues/1847

postgres was already running under docker :

making sure it's using the same password as in lemmy.hjson

VkNyXOFMCgb0sGjraJUz

root@localhost:/home/hmks-apps/lemmy/hyperdata.it# docker ps -a
CONTAINER ID   IMAGE                        COMMAND                  CREATED      STATUS      PORTS                                                 NAMES
e519719e228e   nginx:1-alpine               "/docker-entrypoint.…"   2 days ago   Up 2 days   80/tcp, 0.0.0.0:20140->8536/tcp, :::20140->8536/tcp   hyperdatait_proxy_1
58f4af628105   dessalines/lemmy-ui:0.18.4   "docker-entrypoint.s…"   2 days ago   Up 2 days   1234/tcp                                              hyperdatait_lemmy-ui_1
9da7a413d90b   dessalines/lemmy:0.18.4      "/app/lemmy"             2 days ago   Up 2 days                                                         hyperdatait_lemmy_1
3394683cf319   postgres:15-alpine           "docker-entrypoint.s…"   2 days ago   Up 2 days   5432/tcp                                              hyperdatait_postgres_1
3792edf55801   asonix/pictrs:0.4.0          "/sbin/tini -- /usr/…"   2 days ago   Up 2 days   6669/tcp, 8080/tcp                                    hyperdatait_pictrs_1
157491f6495f   mwader/postfix-relay         "/root/run"              2 days ago   Up 2 days   25/tcp

ew, nginx & posgres

docker stop e519719e228e (nginx) docker restart 3394683cf319 docker restart 3792edf55801 docker restart 157491f6495f docker restart 9da7a413d90b docker restart 58f4af628105

no change

systemctl stop nginx docker start e519719e228e

slight change

https://lemmy.hyperdata.it/ -> http://lemmy.hyperdata.it/

503 Service Unavailable

that's a change. Next need to check the logs. But enough for today.

 


llama_index SPARQL Notes 02

Published on 2023-08-28 by @danja

Aaargh! OpenAI API credit $0.21, and I won't have any money to top it up for another week or so

What doesn't need the API?

FOR TEMP CHANGES, UNMODIFIED FILES ARE IN ../original (I should probably make a branch in git, but then I'm likely to forget...)

My attention's been skipping around, spent a lot of time trying to engage with ChatGPT. There are bits of prompts I tried in llama_index SPARQL Notes 01.

Time to get back to the code proper (continuing from llama_index SPARQL Notes 00).

I've roughed out the shape of RDF I think I need :

# Simple Entity-Relation

@base <http://purl.org/stuff/data> .
@prefix er: <http://purl.org/stuff/er> .

<#T123> a er:Triplet ;
er:id "#T123" ;
er:subject <#E123> ;
er:property <#R456> ;
er:object <#E567> .

<#E123> a er:Entity ;
er:value "one" .

<#R456> a er:Relationship ;
er:value "two" .

<#E567> a er:Entity ;
er:value "three" .

But before writing the SPARQL I want to go back to nebulagraph.py, add some logging calls to see what it's actually passing around. Then go back to the tests, then forward...

$0.21 mode

Is there still data in my local NebulaGraph? I could populate a SPARQL store with that.

wait, there's the GUI, NebulaGraph Studio -

http://localhost:7001/login

USE guardians;

-- Fetch 10 vertices with the 'entity' tag
MATCH (v:entity)
RETURN v
LIMIT 10;

Yay!

One results column labelled v

("$118.4 million" :entity{name: "$118.4 million"})
...
-- Fetch 10 edges with the 'relationship' type
MATCH (src:entity)-[e:relationship]->(dst:entity)
RETURN src, e, dst
LIMIT 10;

Results table column labels are src, e, dst

("production on Vol.3" :entity{name: "production on Vol.3"})	[:relationship "production on Vol.3"->"February 2021" @-8998665471782897487 {relationship: "was put on hold until"}]	("February 2021" :entity{name: "February 2021"})

Ok, enough for today.

Tomorrow)

 


llama_index SPARQL Notes 01

Published on 2023-08-28 by @danja

FOR TEMP CHANGES, UNMODIFIED FILES ARE IN ../original

Today's task : as yesterday.

It's occurred to me that it would be helpful to look at the data that nebulagraph.py actually looks at - pop a few logging calls in there.

(Maybe also probe - https://github.com/vesoft-inc/nebula-python )

Before that, need to have python looking at my llama_index tree. PYTHON_PATH is empty so -

>>>import sys
>>> print(sys.path)
>>> ['', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '/home/danny/.local/lib/python3.11/site-packages', '/usr/local/lib/python3.11/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.11/dist-packages']

Ok, found it at :

~/.local/lib/python3.11/site-packages/llama_index

Hmm, which of Wey Gu's demos is the most minimal that uses NebulaGraph?

Custom instructions

Act as a expert in Python and graph data structures, in particular the RDF model and SPARQL. Be prepared for questions relating to Large Language Models, Retrieval-Augmented Generation and the llama_index libraries. Follow best practices everywhere, eg. including short comments as necessary in code. Outside of code, keep all responses and descriptions very concise and matter-of-fact. Whenever you think it's possible I have further information or code files to upload that will help in the tasks, please ask.
My name is Danny.

~~Act as a expert in Python and graph data structures as implemented in NebulaGraph and RDF and SPARQL. ~~

Existing code in llama_index includes communications to a NebulaGraph database via the nebula-python library. The goal for us today is to develop code in a file sparql.py having the same functionality as nebulagraph.py, but instead of NebulaGraph will connect to a SPARQL store via the sparqlwrapper library from RDFLib. There are several subtasks to do first, which I will describe after uploading some relevant code.

Here a zip of llama_index/llama_index/graph_stores

Before writing Python, we will look at the data model. The following are the nGQL statements used to prepare the database for for nebulagraph.py. How would this an equilent schema be expressed in RDF, and what would an example of instance data look like?

CREATE SPACE guardians(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);
:sleep 10;
USE guardians;
CREATE TAG entity(name string);
CREATE EDGE relationship(relationship string);
:sleep 10;
CREATE TAG INDEX entity_index ON entity(name(256));

ChatGPT returns :

In Nebula Graph, you're defining a space guardians with vertex IDs as fixed-length strings. You also define a tag entity with a property name of type string, and an edge type relationship with a property relationship of type string. Finally, you create an index on the name property of the entity tag.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.org/> .

# Define the Entity class

ex:Entity rdf:type rdfs:Class ;
rdfs:label "Entity" ;
rdfs:comment "An entity with a name" .

# Define the name property

ex:name rdf:type rdf:Property ;
rdfs:label "name" ;
rdfs:comment "The name of the entity" ;
rdfs:domain ex:Entity ;
rdfs:range xsd:string .

# Define the Relationship class

ex:Relationship rdf:type rdfs:Class ;
rdfs:label "Relationship" ;
rdfs:comment "A relationship between entities" .

# Define the relationship property

ex:relationship rdf:type rdf:Property ;
rdfs:label "relationship" ;
rdfs:comment "The type of relationship" ;
rdfs:domain ex:Relationship ;
rdfs:range xsd:string .

and

@prefix ex: <http://example.org/> .

# Entities

ex:Alice rdf:type ex:Entity ;
ex:name "Alice" .

ex:Bob rdf:type ex:Entity ;
ex:name "Bob" .

# Relationship

ex:Alice ex:relationship ex:Bob ;
rdf:type ex:Relationship ;
ex:relationship "friend" .

using existing files under llama_index/llama_index/graph_stores as a guide, in particular the interface defined


ex:Relationship rdf:type rdfs:Class ; rdfs:label "Relationship" ; rdfs:comment "A relationship between entities" .


Ok, do it by hand ...

a er:Triplet ; er:id "123" ; er:subject "one" ; er:property "two" ; er:object "three" .

But what/where are the IDs needed? ok, maybe better :

# Simple Entity-Relation
@base <http://purl.org/stuff/data> .
@prefix er: <http://purl.org/stuff/er> .

<#T123> a er:Triplet ;
er:id "#T123" ;
er:subject <#E123> ;
er:property <#R456> ;
er:object <#E567> .

<#E123> a er:Entity ;
er:value "one" .

<#R456> a er:Relationship ;
er:value "two" .

<#E567> a er:Entity ;
er:value "three" .

RDFS something like -

@prefix er: <http://purl.org/stuff/er> .

er:Entity a rdfs:Class ;
rdfs:label "Entity" ;
rdfs:comment "An entity..." .

er:Relationship a rdfs:Class ;
rdfs:label "Relationship" ;
rdfs:comment "A relationship between entities" .

er:Triplet a rdfs:Class ;
rdfs:label "Triplet" ;
rdfs:comment "A 3-tuple expressing a relationship between entities" .

er:subject a rdf:Property ;
rdfs:label "subject" .
er:subject rdfs:domain er:Entity .
er:subject rdfs:range er:Entity .

er:subject a rdf:Property ;
rdfs:label "subject" .

er:subject a rdf:Property ;
rdfs:label "subject" .

rdfs:comment "An entity..." .

Probably not needed.

Time to move onto another doc

 


Graph of Thoughts, initial thoughts

Published on 2023-08-27 by @danja

Work in progress, scrappy notes & thoughts to pull together. What sprang to mind, not thought through, likely much rubbish.

So I finally got around to reading the recent Graph of Thoughts : Solving Elaborate Problems with Large Language Models paper. "Finally" in the sense of it only appeared at Arxiv just over a week or so ago, I printed a couple of days ago. Things are moving sooo fast...

I do need to re-read it a few more times, also check some of the refs, there are bound to be papers that address the bits mentioned below (several are graph-related).So, initial thoughts. First is meta, the most important, subsuming everything else: It's really well written!. Research only really becomes useful when it's communicated. The text is as plain English as you can get with such material. Loads of acronyms, but that's unavoidable. But where they are significant, they're expanded and explained. Only as much maths as necessary, lovely - so often in this field, complicated sums muddy the water. Lots of clear-enough diagrams. Content-wise : Strong Accept.

Regarding content, in no particular order :

There are really tasty chunks, I really like the general approach.

For starters, identification of concrete goals : How best to aggregate thoughts to maximise accuracy and minimise cost?. Some of the terms used (I've no idea which originated here or elsewhere, is irrelevant), like Graph of Operations, GoO and Graph Reasoning State, GRS are great markers for particular concepts. The metric Volume the authors provide looks like it could be incredibly useful. It's stated in a couple of places, the plain English version is :

We define volume - for a given thought t - as the number of preceding LLM thoughts that could have impacted t

The evaluation seems broad enough to be reliable, applied in a way that makes sense. The use cases are seriously uninspiring, but I'd say that adds weight to this thing having potential - I think it would do better on harder problem.

Frankly I am very surprised they got results as good as they did. Ok, I personally think approaches along these lines can offer huge benefits over current Tree-of-Thought etc techniques. But I've had given it another year or so of architecture trial and error before seeing any noticeable gain. One of the use cases is keyword counting - from what I've seen, ChatGPT's ability with arithmetic is below that of a 5-year old, only marginally above that of a certain former US president.

Description Logics as a generalization

is all quasi-procedural logic I things like applying tabl

https://en.wikipedia.org/wiki/Method_of_analytic_tableaux

IBIS

to ping @DrIBIS @doriantaylor

https://web.archive.org/web/20030418021226/http://ideagraph.net/xmlns/ibis/

https://vocab.methodandstructure.com/ibis

I hadn't thought through how to do it (see this paper!), but the Graph of Thoughts notion was a very visible next step after Chain... and Tree.... I discussed it with ChatGPT a little while ago. This reinforced my feeling that it should be a productive avenue, so at the end of our conversation I asked ChatGPT for advertizing copy to promote it :

"Unleash the power of graph-of-thought in LLM dialogues! Represent ideas as nodes, relationships as edges, and explore a web of knowledge. Navigate non-linearly, uncover intricate connections, and foster context-aware discussions."

https://www.siwei.io/en/posts/

Conceptual Maps (Sowa) - his site? Concept Maps, Semantic Web/Linked Data

Layer Cake (bis)! LLM stack https://medium.com/@brian_90925/llms-and-the-emerging-ml-tech-stack-6fa66ee4561a https://outerbounds.com/blog/llm-infrastructure-stack/

WebLM? WebML? Intelligent Web

Agents

 


Post One

Published on 2023-08-26 by @danja

This is a test

Let's see it it works

Note to self : check layoutof of original makesite

hmm...

 


llama_index SPARQL Notes 00

Published on 2018-08-27 by @danja

This is a continuation of notes from GraphRAG. I'm currently writing a SPARQL connector for llama_index, and jotting notes as I go along. My Wordpress install recently broke, but because a SPARQL-backed blog engine was on my TODO list I let it ride. But it occurred to me that an easy stopgap would be to use a minimal static blog/site builder as an interim fix (makesite.py fit the bill), use it for the GraphRAG notes too. Nice layout with bells & whistles to follow!

Today

Implementing a naive SPARQL connector, just enough to replicate the functional of Wey Gu's original demo with a SPARQL store rather than NebulaGraph.

I've read the books, but still my inclination on something like this would usually be to go straight to the implementation, only making tests when it (predictably) doesn't work. But my Python is weak and this codebase is new to me, so test-driven it is.

There don't appear to be any tests around the graph stores yet, so start against existing implementations, so, to build :

  1. llama_index/tests/graph_stores/test_simple.py
  2. llama_index/tests/graph_stores/test_nebulagraph.py - see how far I get, I probably won't understand what the helpers do, Wey Gu should be happy to improve
  3. llama_index/tests/graph_stores/test_sparql.py
  4. llama_index/graph_stores/sparql.py

for reference :

Trying existing tests

pytest tests

missing dependency -

try again, 4 FAILED, all from what I roughed out yesterday :)

5 ERROR, eg.

ERROR tests/llm_predictor/vellum/test_predictor.py::test_predict__basic - ModuleNotFoundError: No module named 'vellum'

dependency?

9 failed, 247 passed, 79 skipped, 31 warnings in 2.75s

not-me failures around vellum

FAILED tests/llm_predictor/vellum/test_prompt_registry.py::test_from_prompt__new - ModuleNotFoundError: No module named 'vellum.core'

https://pypi.org/search/?q=vellum

try -

Lucky! 4 failed, 252 passed, 79 skipped, 31 warnings in 3.40s

Those 4 are my rubbish.

How to run individual test files? https://stackoverflow.com/questions/36456920/specify-which-pytest-tests-to-run-from-a-file

collected 0 items

test naming conventions https://stackoverflow.com/questions/37353960/why-pytest-is-not-collecting-tests-collected-0-items

Ooops - I used the wrong path.

5 failed, 1 passed in 0.25s

1 failed in 0.17s

I'd see the annotation @pytest.fixture elsewhere, but for pytest the test_naming was what it needed :

def test_one_false():
    assert 1 == 0

ok, time to actually write code...

Ask the assistant first

I hadn't really explored ChatGPT's Code Interpreter. Had seen that you could upload source files, best as a zip.

So I thought, see how far it gets with llama_index/graph_stores/

 


Post Two

Published on 2001-01-01 by @danja

This is a test 2

Let's see it it works

Check layoutof of original makesite

 


Blog TODO

Published on 2000-01-01 by @danja

far in the future

I feel very stupid.

I've written several blog engines in various languages. Never once started from scratch on design. What I mean is that I've always taken as my requirements the bits I liked about other blog engines, prioritised a bit, followed that to-do list. Never started with an absolutely minimal reverse-chrono

I rather like this mode of blogging, type markdown into VSCode, makesite.py converts/formats it, use git (in a script) to transfer to server.

First, add a bit to p (publish) to ssh into server, then :

cd /home/hkms-apps/hyperdata-static
git pull

Soonish :

I reckon I'll split makesite.py up, modularise a bit. Even if I move to all SPARQL-backed for the blog, it's bound to be useful somewhere.

Things for later :

Turtle plugin for markdown

similar to https://github.com/ozekik/markdown-ld

https://raw.githubusercontent.com/ozekik/markdown-ld/master/examples/foaf.md

Requirements :

  1. independent of, but compatible with existing markdown tools
  2. extensible, reasonably modular
  3. block identifier & delimiters
  4. useful defaults, easily overriden

For most of 1 & 2, make it easy to put in pipeline, something like :

markdown, other = mx(markdown_with_extensions)

for 3, maybe :

mx:turtle '''

<this> a x:That.

For 4, for starters I'd probably be good to had some standard prefixes as default, added by the processor, which could be overriden :

mx:turtle mx:prefixes '''

prefix ...

<this> a x:That.

'''

>>> pair = (1, 2)
>>> pair
(1, 2)
>>> x, y = pair
>>> x
1
>>> y
2

>>> pair[0]
1
>>> pair[

Markdown Extension Vocab, mx:

 


RDF-Ext : Tutorial

Published on 1970-01-01 by @danja

Step One : GitPod

I don't remember using it before, but I may well have done. First thing that happened when I followed the link (using GitHub credentials) it said I was upgraded to an organisation. Appears free 50 hours, then pay-as-you-go.

I got to an in-browser (I've running Brave on Ubuntu) VS Code.

At the terminal it appears to have run :

...
  vite build --outDir=../../dist examples/browser

Dependencies installed ok, dist built ok. Somewhere I briefly saw the option to launch it in a local VS Code. Try that later...

 


HKMS Desktop

Published on 1970-01-01 by @danja

An immediate irritation I have is that I'm coding in VSCode and also writing notes in it. Navigating between the two activities is a pain.

The problem is broader still because during the coding I have to do other bits of context-switch to look at/edit data, project management, bookmarks...

I've got most of these things at various stages of dev in HKMS, but accessing these via a browser bypasses my local FS, access there is desirable.

So I've been looking at Pulsar, a fork of the discontinued Atom editor. Plan is to tweak it to fit my needs.

I did like Atom as an editor, and hopefully as it's built on Electron it'll make a convenient local host for the HKMS apps.

Set up a repo : hkms-desktop

I'm having a few teething problems with Pulsar.

The ppm package manager it uses is in a different repo. The install instructions didn't work for me, so playing it by ear.

hmm, lost the paths

export ATOM_HOME=/home/danny/.pulsar
export APM_PATH=/home/danny/HKMS/hkms-desktop/ppm/bin/apm
export ATOM_ELECTRON_VERSION=12.2.3

ok, now packages seem to work.

Added https://github.com/denieler/save-workspace-atom-plugin not sure it's gonna be useful...

Added PlatformIO IDE Terminal

https://www.electronjs.org/docs/latest/tutorial/using-native-node-modules

 


Schema Editor Notes

Published on 1970-01-01 by @danja

I wanted an RDF Schema for the LlamaIndex sparql.py. It isn't necessary at this point in time, but nice to have for documentation & completeness.

Just links for now.

http://purl.org/stuff/er

Oops! LODE docs are meant to be generated live, so that's missing bits. TODO fix

https://hyperdata.it/xmlns/er/er-schema.ttl

https://hyperdata.it/xmlns/

https://hyperdata.it/schema-editor/

https://github.com/danja/schema-editor

https://essepuntato.it/lode/

https://fuseki.hyperdata.it/#/dataset/schema-test/query

https://fuseki.hyperdata.it/#/dataset/vocabs/query

Interesting

https://howarddierking.com/2016/12/02/generating-documentation-sites-for-linked-data-vocabularies/

https://github.com/howarddierking/metalsmith-ldschema

https://metalsmith.io/

Sources

https://answers.knowledgegraph.tech/t/generating-documentation-from-rdfs-and-or-owl-vocabularies/4829

https://www.reddit.com/r/semanticweb/comments/tr1v5o/tools_to_generate_documentation_site_for_classes/

https://www.w3.org/2011/prov/wiki/Generating_HTML_documentation_of_OWL

https://protege.stanford.edu/software.php

https://www.w3.org/DesignIssues/TopTen.html

 


Turtle Markdown Extensions

Published on 1970-01-01 by @danja

A bit of forward-planning for blog engine stuff. This went on my todo list the other day, since then I've had a think, thought I'd better get it down before I forget.

The implementation for makesite.py should be straightforward, but I've other things to do. But I can start using these extensions in posts today, no matter that they won't do anything yet. When I've got a demo implementation and tried it out a bit I'll give it a repo on GitHub.

How to express RDF statements in Markdown?

Uses Cases

  1. make statements about the md doc
  2. extract a block of arbitrary Turtle from md doc

General Requirements

  1. simple to use, simple to implement
  2. independent of, but compatible with existing markdown tools
  3. extensible, reasonably modular
  4. block identifier & delimiters
  5. useful defaults, easily overriden

Note re. (2) : the markup syntax used will be interpreted as a processing instruction, so while Turtle creation/extraction is the immediate goal, it makes sense for extensibility to consider other possible uses.

0. General Syntax

` :term fur`

``` :term fur```

TODO express in BNF TODO provide regexes

1. Statements about Current Markdown Document

` :tag fur`

TODO result

In this example fur is one word, a simple string delimited by spaces. Alternatives will include quoting of literals "as in Turtle" for the object as well as the use of URIs using standard Turtle syntax.

TODO longer example

Useful Terms

TODO fill out as needed, find standard vocab equivalents

2. Arbitrary Turtle in Markdown Document

Where a block of Turtle should be extracted, the term mx:x should be used, e.g.

```:x @base http://example.org/ . @prefix foaf: http://xmlns.com/foaf/0.1/ . @prefix rel: http://www.perceive.net/schemas/relationship/ .

<#green-goblin> rel:enemyOf <#spiderman> ; a foaf:Person ; # in the context of the Marvel universe foaf:name "Green Goblin" . ```

3. Interpretation Rules

TODO

for eg. mx:tags - provide a simple list syntax

Terms MAY be interpreted as those in the mx namespace and/or well-known equivalents

How to say what should be passed to standard markdown processor, what should be cut?

Implementation Notes

eg. :

contents = mx(markdown_with_extensions)

markdown = contents['markdown']
turtle = contents['turtle']

html = to_html(markdown)
store.add(turtle)
 


Impedance Matching LLMs and Linked Data

Published on 1970-01-01 by @danja

If you see this message, this post is only half-done. I know what's needed so it should be done in a couple of hours. I've published prematurely to check image linking

Unifying Large Language Models and Knowledge Graphs: A Roadmap

An Observation

Large Language Models are sizeable knowledgebases which, at least in part, encapsulate sentence-oriented data structures derived from human language. The Web is a massive knowledgebase which at a structural level, has embedded sentence-like data (clearly apparent when viewed from a Linked Data perspective). There isn't an obvious direct mapping between these systems, but they both feature shapes that look very similar from 1,000ft. However you look at it, the future potential of a combined system is...TBD. We are in a position to take (long-legged) baby steps in that direction.

A Problem

For the purposes here, have a loose, back-of-envelope working definition of 'knowledge' :

A collection of structured data that represents information, together with a means of navigating that information.

Navigation isn't usually something highlighted in these parts. But applied in a very broad sense, I reckon is useful, as I hope to show here. Leave notions of agency to one side to avoid the bigger tarpit around intelligence, biological or artificial.

A Particular Characterization of LLMs

Deep Learning systems somehow embody knowledge somehow derived from their training data. Forget their internals for now, consider them as black boxes with external interfaces, communication protocols.

A Particular Characterization of the Web

I believe 'Semantic Web' hit the Peak (of Inflated Expectations) in the Gartner Hype Cycle around 2001, the time of a certain Scientific American article (PDF). But lot has happened since then. Masses of work has been done by people working directly in the field. There's been significant deployment by people from every imaginable field using the associated technologies for practical applications. Most web developers will have seen something related in their peripheral vision, quite possibly used such things in their day job without realising it. But for various historical reasons the big picture isn't that widely known.

First, a wormhole-speed trip from the

LLM

The cat sat on the mat. A lot of mats are blue.

Q & A

That there's some common topology between these systems shouldn't be a surprise. Both are representations of knowledge with humans as the immediate source.

But there are low-level hacks that might offer approaches good enough for many practical applications.

A Potential Path to a Partial Solution

relevance Similarity overlays on the web

 


RDF-Ext : Starting

Published on 1970-01-01 by @danja

I reached a first little milestone with SPARQL GraphRAG, sparql.py for LlamaIndex in being able to get a demo running on a Colab notebook.

But I need to switch focus for a bit.

I'm about to start kicking the tyres/writing up RDF-Ext. It's a dev-friendly toolkit for using RDF in browser/node JS, bridging to the fairly opaque/formal RDFJS specs.

The Tutorial seems as good a place as any to start.


Links

] RDF-Ext Home

 


This is a Blog Post

Published on 1970-01-01 by @danja

work in progress

Why Blog?

Retro Perspective

The Web's original purpose was to simplify information access and exchange [quote/link timbl proposal]. A subsystem started emerging in the early 2000s that demonstrated one way in which this potential could be fulfilled : the blogosphere. This didn't require any novel technologies, all components were simple evolutions of those needed for the first generation Web Home Page.

Underneath, there's a relatively simple protocol and document format (HTTP and HTML), which supports the Web's defining feature, the affordance of the hyperlink. The development of content management systems greatly simplified the process of content creation. This enabled a significant democratization of the facilities, giving access to people without the specific technical skills. It's also worth noting this reduced the friction of authoring for those in the development community. Less time fiddling with markup, more time expressing

Previously, content/site structure was typically hierarchical and static. At a first approximation, this corresponded to the textbook in traditional media. But this was a poor match for the requirements of news sites. [Slashdot]

reverse-chrono maps intuitively to diary/journal, elements of the episodic style of TV series etc.

Dogfood

It's no coincidence that many of the early adopters of the blog format were exactly the people pioneering tech developments, eg. Shelley, Winer, Tim Bray. As an aside, this also demonstrated how robust the blogosphere was as an open forum. I didn't choose those three blogger-developers at random. Each had great ideas to offer this realm. But each was approaching from a different direction, often with strongly-held views on particular issues. Such opinions were sometimes diametrically opposed to each other. Boiling-blood slips from civil discourse weren't exactly rare, but a first real take-away is that there was useful exchange of information. A second, probably more important, these arguments weren't taking place in a void. Anyone could chip in and offer their views, either in short-form as a comment on the relevant blog post, or long-form on their own blog.

This elegantly showed how the philosophy of design for serendipity [timbl], while nebulous at design time can lead to concrete utility later.

What went Wrong?

rise of big social media

voluntary tolerance of theft of marketable content backdoor theft of identity

smartphone impact

unwebby local apps attached to silos

low-level tech factors

eg. 'danny', google & no-follow;

but blogosphere not dead!

A little gem of history is that on top of everything else, Tim Berners-Lee was prescient in creating the first proto-blog back in the early 1990s : his real-ish time documentation of the development of the Web.

Recommendations


Thinking aloud.

A Little

A little learning is a dangerous thing;
drink deep, or taste not the Pierian spring:
there shallow draughts intoxicate the brain,
and drinking largely sobers us again.

https://en.wikipedia.org/wiki/Pierian_Spring

The first line is commonly used with 'knowledge' in place of 'learning', but with the same intent.

It sometimes works.

corollary.

Moderate ignorance is a powerful thing.
 


Blog

Published on 1970-01-01 by @danja

 


Blog

Published on 1970-01-01 by @danja

using standard linux tools, how do I modify a text file to remove the spaces from any strings in it of the following form : '<mbox:dylan_beadle -at- yahoo.com>'. Match the < > characters, I'd like the result 'mbox:dylan_beadle-at-yahoo.com'

(some fails with sed)

perl -pe 's/mbox:([^]+)>/my $x = $&; $x =~ s| -at- |-at-|g; $x =~ s| ||g; $x/ge' main_backup.n3 > main_backup-cleaner.n3

Could you now extend the previous Perl so that : <mbox:dylan_beadle -at- yahoo.com> becomes mbox:dylan_beadle@yahoo.com mbox:thaynes{at}openlinksw.com becomes mbox:thaynes@openlinksw.com

perl -pe 's|mbox:([^]+)>|my $x = $&; $x =~ s|\s*-at-\s*|@|g; $x =~ s|{at}|@|g; $x =~ s|\s||g; $x|ge' main_backup.n3 > main_backup-cleaner.n3

perl -pe 's|mbox:([^]+)>|my $x = $1; $x =~ s|\s*-at-\s*|@|g; $x =~ s|{at}|@|g; $x =~ s|\s||g; "mbox:$x"|ge' main_backup.n3 > main_backup-cleaner.n3

 


Blog

Published on 1970-01-01 by @danja

The description here will only cover the SPARQL-specific details, for how the system as a whole works see the original Notebook and the LlamaIndex Documentation.

 


RSS