public:: true blog-date:: 2023-06-12

briefly

Inspired by the super popular #[[schrodinger logseq plugin]], I wrote up something in #python today, to publish to #Hugo like https://github.com/sawhney17/logseq-schrodinger but also with support for block embeds.

In particular, I was also really inspired by [[Bas Grolleman]]’s concept here around how to be able to use #interstitial-journaling and be able to nicely coalesce selected sources through their block embeds, into a target logseq concept they all refer to.

Full code from this blog post

By the way, this code is still just at the “first stab” / #proof-of-concept stage, but it is here, https://github.com/namoopsoo/logseq_utils

And usage is just below

Say you have a page “blogpost/2023-06-12-name-of-your-logseq-page” , where you happen to use embeds like,

{{embed ((64864e08-de92-4127-9162-8b5b946b021b))}}

then to create a markdown file like “content/post/2023-06-12-name-of-your-logseq-page.md” , locating it in “content/post/” say if that is your Hugo post location, then use,

from pathlib import Path
import logseq_utils as lu

page = "blogpost/2023-06-12-name-of-your-logseq-page"
filename = page.split("/")[1] + ".md"
target_dir = "content/post/"
target_loc = str(Path(target_dir) / filename)
print("target_loc", target_loc)
# target_loc content/post/2023-06-12-logseq-publish-hugo-with-python.md

lu.build_markdown(page, target_loc)

The logseq REST API

Would have loved to help w/ logseq-schrodinger but

Ideally I would love to attempt a pull request on https://github.com/sawhney17/logseq-schrodinger , and I have left the block embed as a feature idea here, but my knowledge of the logseq dev setup , including Clojure Script and react is smaller than my desire to first get something working to help solve my immediate problem hah 😅.

But the logseq REST API looks great!

According to https://docs.logseq.com/#/page/local%20http%20server , and https://plugins-doc.logseq.com , looks like one just needs to add a local API token and then you are basically ready to interact with your logseq using “127.0.0.1:12315”

Iterating

Using getPageBlocksTree gave the blocks on a page

Like this

id:: 6487a8b5-ac96-4a45-aa2a-28bbc7002aa4

def get_page_blocks_tree(name, include_children=False):
    token = os.getenv("LOGSEQ_TOKEN")
    url = "http://127.0.0.1:12315/api"
    headers = {"Content-Type": "application/json",
              "Authorization": f"Bearer {token}"}
    payload = {
    "method": "logseq.Editor.getPageBlocksTree",  
      "args": [name, 
               {"includeChildren": include_children}]
    }
    response = requests.post(url, json=payload, headers=headers)
    return response
  

A stab at a recursive call here

13:51 ok some simple stab at using this API then, to build #[[log-seq markdown hugo integration]] [[Hugo]] id:: 64875626-e55d-4613-9351-e6a5806e8912 14:27 ok, wrote some code in log_utils.py , and running,

import requests
import time

def build_markdown_from_page_blocks(blocks):
    print("DEBUG", [x["level"] for x in blocks])
    time.sleep(1)

    stuff = []
    for block in blocks:
        stuff.append({"level": block["level"], "content": block["content"]})
        if block["children"]:
            stuff.extend(build_markdown_from_page_blocks(block["children"]))

    return stuff
page = "blogpost/2023-06-11-semantic-code-search-first-stab"
response = lu.get_page_blocks_tree(page)
response.status_code
blocks = response.json()
len(blocks)  # 4

stuff = lu.build_markdown_from_page_blocks(blocks)

ok nice haha worked on first try as intended #moment/satisfaction 😀

In [57]: stuff = lu.build_markdown_from_page_blocks(blocks)
DEBUG [1, 1, 1, 1]
DEBUG [2, 2]
DEBUG [3, 3, 3]
DEBUG [2, 2, 2, 2, 2, 2, 2]
DEBUG [3, 3]


[{'level': 1, 'content': 'public:: true'},
 {'level': 1, 'content': ''},
 {'level': 1, 'content': '# Initial Learnings'},
 {'level': 2,
  'content': '## Learned about [[symmetric vs asymmetric semantic search]]'},
 {'level': 3,
  'content': 'This distinction refers to #[[sentence similarity task]] where the query is of the same size or asymmetrically, smaller size, such as a one or two word query.'},
 {'level': 3, 'content': 'And wow that is exactly what. Iwas looking for !'},
 {'level': 3, 'content': 'So apparently this includes the "msmarco" models.'},
 {'level': 2, 'content': ''},
 {'level': 1, 'content': '# Final test run'},
 {'level': 2,
  'content': 'Idea is to build a corpus with, hey why not, the code from the #sentence-transformers repo.'},
 {'level': 2,
  'content': 'Had a few test runs today, iterating on the approach, using queries from files other than the ones I built a corpus for , #moment/doh haha . Also weirdly the msmarco model documented as the v3 that should be used is MIA somehow, but the v2 seems fine. And "msmarco-MiniLM-L-6-v3" is fine too. \n\nBut here is the last run for today.'},
 {'level': 2, 'content': '{{embed ((64864e08-de92-4127-9162-8b5b946b021b))}}'},
 {'level': 2, 'content': '## And code for building that corpus'},
 {'level': 3, 'content': '{{embed ((648728d8-5958-4df6-95d4-b81de6665974))}}'},
 {'level': 3, 'content': ''},
 {'level': 2, 'content': ''},
 {'level': 2, 'content': ''},
 {'level': 2, 'content': ''}]

And then filling out the embeds

14:47 ok just need to fill the embeds with the block embed outputs then, id:: 648764a3-6ede-4550-82cf-5ce07e765ac0 14:56,

import re
block = {"content": "{{embed ((648728d8-5958-4df6-95d4-b81de6665974))}}"}
if match := re.match(
  r"^{{embed \(\(([a-zA-Z0-9-]+)\)\)}}$",
  block["content"]
):
    print("yes", match.groups()[0])

# yes 648728d8-5958-4df6-95d4-b81de6665974

hmm so for instance if we have the block below, which is at level 3,

{'properties': {},
      'unordered': True,
      'parent': {'id': 28919},
      'children': [],
      'id': 29214,
      'pathRefs': [{'id': 28906}, {'id': 29113}, {'id': 29220}],
      'level': 3,
      'uuid': '64875997-b4fd-4079-9a06-19b1606d5f33',
      'content': '{{embed ((648728d8-5958-4df6-95d4-b81de6665974))}}',
      'journal?': False,
      'macros': [{'id': 29221}],
      'page': {'id': 28906},
      'left': {'id': 28919},
      'format': 'markdown',
      'refs': [{'id': 29113}, {'id': 29220}]}

And we fetch the block “648728d8-5958-4df6-95d4-b81de6665974” , collapsed:: true

In [59]: block_uuid = "648728d8-5958-4df6-95d4-b81de6665974"

In [60]: response = lu.get_block(block_uuid)

In [61]: response.json()
Out[61]: 
{'properties': {'id': '648728d8-5958-4df6-95d4-b81de6665974'},
 'parent': {'id': 28846},
 'children': [{'properties': {},
   'parent': {'id': 29113},
   'children': [],
   'id': 29219,
   'pathRefs': [{'id': 28},
    {'id': 1967},
    {'id': 5812},
    {'id': 27790},
    {'id': 27795},
    {'id': 28383},
    {'id': 28504},
    {'id': 28556},
    {'id': 28906},
    {'id': 28907}],
   'level': 1,
   'uuid': '64875a31-5142-4f50-a542-5d5c00021cb4',
   'content': 'Put the following into a file `code_search.py`\n```python\nfrom pathlib import Path\nfrom itertools import chain\n\n\ndef build_texts_from_repository(repo_dir):\n    """Return a dataset of the code\n    """\n    dataset = []\n    file_types = []\n    for path in chain(\n        Path(repo_dir).glob("**/*.py"),\n        Path(repo_dir).glob("**/*.md"),\n    ):\n        assert path.is_file() and path.suffix\n        lines = path.read_text().splitlines()\n        \n        dataset.extend(\n            [{"line_number": i,\n               "line": line,\n               "path": str(path.relative_to(repo_dir))}\n        for i, line in enumerate(lines)\n         ]\n        )\n    return dataset\n\n```',
   'page': {'journalDay': 20230611,
    'name': 'jun 11th, 2023',
    'originalName': 'Jun 11th, 2023',
    'id': 28504},
   'left': {'id': 29113},
   'format': 'markdown'},
  {'properties': {},
   'parent': {'id': 29113},
   'children': [],
   'id': 29218,
   'pathRefs': [{'id': 28},
    {'id': 1967},
    {'id': 5812},
    {'id': 27790},
    {'id': 27795},
    {'id': 28383},
    {'id': 28504},
    {'id': 28556},
    {'id': 28906},
    {'id': 28907}],
   'level': 1,
   'uuid': '64875a2e-7b2a-47fb-891b-419cd3347643',
   'content': '```python\nimport os\nimport code_search as cs\nfrom pathlib import Path\nrepos_dir = os.getenv("REPOS_DIR")\ntarget_dir = Path(repos_dir) / "sentence-transformers"\ndataset = cs.build_texts_from_repository(target_dir)\n\n```\ndouble checking , \n```python\n\nIn [12]: dataset[:10]\nOut[12]: \n[{\'line_number\': 0,\n  \'line\': \'from setuptools import setup, find_packages\',\n  \'path\': \'setup.py\'},\n {\'line_number\': 1, \'line\': \'\', \'path\': \'setup.py\'},\n {\'line_number\': 2,\n  \'line\': \'with open("README.md", mode="r", encoding="utf-8") as readme_file:\',\n  \'path\': \'setup.py\'},\n {\'line_number\': 3,\n  \'line\': \'    readme = readme_file.read()\',\n  \'path\': \'setup.py\'},\n {\'line_number\': 4, \'line\': \'\', \'path\': \'setup.py\'},\n {\'line_number\': 5, \'line\': \'\', \'path\': \'setup.py\'},\n {\'line_number\': 6, \'line\': \'\', \'path\': \'setup.py\'},\n {\'line_number\': 7, \'line\': \'setup(\', \'path\': \'setup.py\'},\n {\'line_number\': 8,\n  \'line\': \'    name="sentence-transformers",\',\n  \'path\': \'setup.py\'},\n {\'line_number\': 9, \'line\': \'    version="2.2.2",\', \'path\': \'setup.py\'}]\n\nIn [13]: set([Path(x["path"]).suffix for x in dataset])\nOut[13]: {\'.md\', \'.py\'}\n```',
   'page': {'journalDay': 20230611,
    'name': 'jun 11th, 2023',
    'originalName': 'Jun 11th, 2023',
    'id': 28504},
   'left': {'id': 29219},
   'format': 'markdown'}],
 'id': 29113,
 'pathRefs': [{'id': 28},
  {'id': 1967},
  {'id': 5812},
  {'id': 27790},
  {'id': 27795},
  {'id': 28383},
  {'id': 28504},
  {'id': 28556},
  {'id': 28906},
  {'id': 28907}],
 'propertiesTextValues': {'id': '648728d8-5958-4df6-95d4-b81de6665974'},
 'uuid': '648728d8-5958-4df6-95d4-b81de6665974',
 'content': 'ok cool, let me just focus on markdown and python\nid:: 648728d8-5958-4df6-95d4-b81de6665974',
 'page': {'journalDay': 20230611,
  'name': 'jun 11th, 2023',
  'originalName': 'Jun 11th, 2023',
  'id': 28504},
 'left': {'id': 29215},
 'format': 'markdown',
 'refs': [{'id': 29214}]}

15:12 hmm strangely it does not have its own level and its children start from level 1, so maybe this is meant to be incremental sort of. 15:26 ok added some kind of offset then

def build_markdown_from_page_blocks(blocks, level_offset=0):
    print("DEBUG", [x["level"] for x in blocks])

    stuff = []
    for block in blocks:

        # Replace embed
        if match := re.match(
            r"^{{embed \(\(([a-zA-Z0-9-]+)\)\)}}$",
            block["content"]
        ):
            block_uuid = match.groups()[0]
            print("yes", block_uuid)

            response = get_block(block_uuid)
            assert response.status_code == 200

            new_block = response.json()
            new_block["level"] = block["level"]

            stuff.append({"level": new_block["level"], "content": new_block["content"]})
            if new_block["children"]:
                stuff.extend(
                    build_markdown_from_page_blocks(
                        new_block["children"],
                        level_offset=(level_offset + new_block["level"])
                    ))

        else:
            stuff.append(
                {"level": block["level"] + level_offset,
                 "content": block["content"]})
            if block["children"]:
                stuff.extend(
                    build_markdown_from_page_blocks(
                        block["children"], level_offset=level_offset)
                )

    return stuff

And the markdown output

15:26 ok then now , to generate the markdown, id:: 64877179-6d5e-4a1a-88ef-7826edf78ee5 Did this in a super simple way,

def build_markdown(page_name, target_loc):
    response = get_page(page_name)
    assert response.status_code == 200 and response.json()

    response = get_page_blocks_tree(page_name)
    assert response.status_code == 200 and response.json()
    blocks = response.json()

    blog_date = blocks[0]["properties"]["blogDate"]

    stuff = build_markdown_from_page_blocks(blocks)

    page_title = page_name.split("/")[1]  # 
    if match := re.match(r"(\d{4}-\d{2}-\d{2})-(.*)", page_title):
        date_from_title, page_title = match.groups()
    print("page_title", page_title)

    page_title = page_title.replace("-", " ")
    print("page_title", page_title)

    text = [
        "---",
        f"date: {date_from_title}",
        f"title: {page_title}",
        "---",
    ] + [x["content"] for x in stuff]
    path = Path(target_loc)
    assert path.parent.is_dir()

    path.write_text("\n".join(text))
    ...

15:50 ok lets try this,

target_dir = "content/post/"
# 2023-06-12-spinoza-vs-descartes.md
page = "blogpost/2023-06-11-semantic-code-search-first-stab"
filename = page.split("/")[1] + ".md"
target_loc = str(Path(target_dir) / filename)
print("target_loc", target_loc)
# 'content/post/2023-06-11-semantic-code-search-first-stab.md'
lu.build_markdown(page, target_loc)

So per above, the first page built ended up being this page And the second page was actually this one

Some tips

no hyphens in your API token

Oddly enough I was getting

Out[12]: {'statusCode': 401, 'error': 'Unauthorized', 'message': 'Access Denied!'}

and then switched to a token without hyphens and it was fine.

The right REST endpoint

Initially I was getting a 404 'Route POST:/ not found' when hitting

url = "http://127.0.0.1:12315"

as opposed to

url = "http://127.0.0.1:12315/api"

getPage returned no children

Oddly enough, hitting https://plugins-doc.logseq.com/logseq/Editor/getPage on a valid page, even with setting includeChildren to true, I was getting a 200 but an empty response payload. However, I found using https://plugins-doc.logseq.com/logseq/Editor/getPageBlocksTree gave me all the child blocks for a page.

Maybe that is the intended behavior, but I’m not sure.

Retrieving a page with a literal slash worked but not with the “%2F”

12:52 let me try getting a page too, id:: 64874d2d-6999-48bb-b19a-a6cba44c4f96

def get_page(name, include_children=False):
    token = os.getenv("LOGSEQ_TOKEN")
    url = "http://127.0.0.1:12315/api"
    headers = {"Content-Type": "application/json",
              "Authorization": f"Bearer {token}"}
    payload = {
    "method": "logseq.Editor.getPage",  
      "args": [name, 
               {"includeChildren": include_children}]
    }
    response = requests.post(url, json=payload, headers=headers)
    return response
  
response = get_page(
    "blogpost%2F2023-06-11-semantic-code-search-first-stab"
)
response.status_code

hmm weird saying it got a page but response.json() is null. Ok, trying by id now, and below, this is working now,

response = get_page(28504)

In [12]: response.json()
Out[12]: 
{'updatedAt': 1686580509936,
 'journalDay': 20230611,
 'createdAt': 1686463272305,
 'id': 28504,
 'name': 'jun 11th, 2023',
 'uuid': '648728d8-8c86-4afd-8bb2-4cd519ec4c76',
 'journal?': True,
 'originalName': 'Jun 11th, 2023',
 'file': {'id': 28527},
 'format': 'markdown'}

ok but then how to get a page by name then? 13:06 ok interesting, so I tried again but this time instead of the “%2F” where the “/” forward slash is, using the literal, and now worked ! collapsed:: true


In [13]: response = get_page(
    ...:     "blogpost/2023-06-11-semantic-code-search-first-stab"
    ...: )
    ...: response.status_code
Out[13]: 200

In [14]: response.json()
Out[14]: 
{'properties': {'public': True},
 'updatedAt': 1686526563885,
 'createdAt': 1686526225155,
 'id': 28906,
 'propertiesTextValues': {'public': 'true'},
 'name': 'blogpost/2023-06-11-semantic-code-search-first-stab',
 'uuid': '648728d8-d545-42ce-b487-986d6cac3d55',
 'journal?': False,
 'originalName': 'blogpost/2023-06-11-semantic-code-search-first-stab',
 'file': {'id': 28909},
 'namespace': {'id': 28907},
 'format': 'markdown'}

ok