Dev Week 18

Another backdated blog post.

I've found some nirvana: AI + BookStack for a self-writing Knowledge Base.

I have tried Wiki softwares for developing knowledge bases, including MediaWiki (the software behind Wikipedia family of websites) and Trac (Python-based), but they always fail my use cases. Those are excellent for knowledge graphs, which have arbitrarily many connections, but not so great for a general knowledge base. You must manage the indexes and connections yourself, which becomes tedious very quickly.

Enter BookStack (GitHub). The simplest way to use BookStack is to have Books with pages. If you fall in love with it as much as I have, you'll also be using chapters and shelves.

Shelves have books, books have pages and chapters; chapters have pages. This provides 4 levels deep of organization.

A book can appear on multiple shelves; a shelf is a lightweight grouping of some books into a logical unit. Your "Network Security" book could appear on both the Network shelf and the Security shelf.

There's a BookStack API integration on GitHub/Pypi, which is just a lightweight wrapper that dynamically loads the OpenAPI spec from your BookStack instance to know what the API looks like (i.e. it doesn't have the API pre-built in; this library could likely be aimed at other APIs without substantial effort).

There was a bug, as the url they constructed to grab the OpenAPI spec ended with an extraneous /, resulting in a redirect, which caused the authentication to fail to apply after the client tried to visit the redirected url that didn't contain the /. I submitted a PR and hopefully it will get merged in someday.

Using this BookStack API library, I wrote a script (in a Jupyter Notebook) to download the entire hierarchy of content from my BookStack, and a printer that accepts an optional depth. Here's an example snippet:

Systems and Infrastructure
Gentoo
📄 Miscellaneous
📄 Portage Package Management
Desktop Environments
📗 Input Configuration
📄 X11 Input Tools (xev, xinput)
📄 Kernel Input Tools (evtest, showkey)
📄 udev Configuration
📄 libinput Setup
📗 X11 System
📄 Window Management with xdotool and X11 Tools
📄 X Authentication
Linux
📗 Application Standards
📄 XDG Base Directory Specification
📄 Icon Themes and Resources
📄 Desktop Entry Files
📄 Application Autostart
📄 Runtime Directories
System Administration
📄 Sudo Configuration
📄 Borg and Frontends
📄 Common CLI Tools Reference
📄 Mount Information
📗 System Logging
📄 Logrotate Configuration
📄 Rsyslog Setup
📄 Systemd Journal Management

The benefit of this is that I can quickly and easily share my structure with [insert your friendly chatbot's name here] and ask for recommendations on the existing structure and how to restructure based on new content.

As an example, the XDG Base Directory Specification would have originally gone in the Desktop Environments book since it's related to X11, but Claude suggested that since it's a specification and more general to Linux and not tied to desktop environments, it makes more sense to put it in the Linux book. A big part of my development of this whole structure is understanding how the pieces relate to each other. For example, I don't know whether a Riesling wine is a type of grape or a company (I just looked it up, and it's a variety of grape). By codifying my knowledge into a knowledge base as I learn, it helps me solidify the hierarchy of knowledge and relationships between concepts.

Sometimes I expect a knowledge base article for a given topic to be pretty straightforward, in which case I just ask the AI, "please write a md kb article about common SSH tunneling commands." (md: markdown format; kb: Knowledge Base.) For concepts I'm less certain about, such as XAuth, I follow a line of questioning until I feel the conversation covers my previous lack of understanding reasonably well, and then ask, "would you please summarize our conversation as a md kb article?"

From there, if it's not clear where it fits in the hierarchy, I can some context from the output of my program, such as the snippet included above, removing anything clearly unrelated, and ask for it to propose where the new KB article fits, and to suggest any improvements to the overall structure and organization. Just a few days ago I did a major restructuring, because one of my books was originally "Linux and Gentoo", which was extremely restrictive and made it more difficult to add content for either one. Claude even included some of the 📗 and 📄 emojis in its output along with following the same indentation structure, making it very easy to verify and integrate the suggestions into my BookStack instance.

Overall, I'm elated with this combination. I haven't verified every piece of information in all the articles, and I'm certain there's a lot of information that isn't correct, but I feel like the happiest rat collecting all manner of shiny objects and gathering them back in my tunnel. It gives me a sense of satisfaction with building my own knowledge base of things I care about and am interested in. It also incentivizes me to try to understand things that I previously had no hope of remembering later. Now I have a hierarchy clearly laid out. Sometimes it's more important to know where to find information than it is to know the information itself. This is especially true today in the Information Age (AI Age? AI Era?), when it's impossible to know everything and vital to be able to figure anything out quickly, since that's the majority of my day-to-day life now.

Dev Week 17

This post is actually being written 2025-01-01, but I'm backdating some posts to make up for the weeks I missed. This post is about OpenAI's recent announcements of its o1 models, which makes it all the more puzzling that I'm backdating something to a time before it became known/relevant to the world.

The typical end-user AI that most people have access to are still pretty dumb and I wouldn't count on it getting a whole lot smarter. I'm not saying it won't; I'm saying I'm keeping my expectations low. However, I always thought it was a bad idea to feed input directly from the user, to an LLM, and provide the output directly back to the user. Even with all the attempts at "guardrails", this makes a lot of assumptions of trust in both the user and the AI. But researchers have shown that AI can be made to spill their secrets or act/speak against their safety directives.

I always felt that there needs to be some smaller, simpler AI watching the output as it comes out of the model to check for basics such as: is the output response actually addressing the input request? Is its reasoning sound? Is it hallucinating? This seems pretty obvious to me, and is apparently one of the biggest breakthroughs recently for OpenAI to make their model produce many (hundreds? thousands?) of candidate outputs, and have another model which was trained directly on assessing the output of other models to pick the best (or at least most sound?) between all the responses. My idea would have involved a more programmatic/software approach with more communication overhead, but I'm certain those smart people over there could figure out how to integrate it all into a single model architecture and reduce communication between the components if that's what they're going for.

Apparently their new stuff is blowing everything previous out of the water. Every time a new benchmark gets created with the intention of stumping the models for a while longer, it gets beaten within 1-2 months.

The foreshadowing to OpenAI's release of these models was Sam Altman's earlier unsupported claim that we could be hitting AGI soon. This originally sounded like a way to appease shareholders, but now we know what he was so excited about.

This raises a million possible next conversation points, from politics to capitalism to marginalization to World War 3, but I'll leave it at that and get to my next back-dated post.

Dev Week 16

I'm backfilling weeks I missed. December was rough.

I'm currently running Vikunja for my todos. I only started using it recently. It's pretty neat, but it's lacking on some of the basics. The UI feels tacky. Specifically, when switching between projects, on mobile you must press on the same project twice before it will actually open it up. There's no protection from double-entry; if you hit enter twice after typing in a todo or new project name, they will get created twice.

I tried out StandardNotes as an alternative. It looks really cool and useful, but as a self-hoster, I couldn't get it working fully. I got the server running, and could see that I could make a network connection to it; I got the client working, and was able to create a note in the local browser store. But I couldn't log in and sync from the client to the server. They were communicating, but the login failed and there weren't enough helpful messages to diagnose it. From watching the server error messages, I saw "No cookies provided for cookie-based session token." One second, the client would say that I was signed in and my workspace had been synced; the next, it said that login credentials were required to continue.

I think StandardNotes has a really nice stance on privacy, TNO, and generally on their business practices, but with a software and ecosystem that touts itself as privacy- and security-oriented from the ground up, the absolute minimum basics of being able to connect the client and server must be fully working and stable to be able to use them. The concern with this buggy, opaque behavior with insufficient error messages is that one day it could stop working and be near impossible to diagnose, losing all of your hard work.

I'm sure StandardNotes works better for someone who just says "heck it" and uses their client with their server. As long as you trust that the channel where you downloaded the client from works as advertised without any supply-chain attacks, and that you trust the software given that it is unmolested in the supply chain, I can see that this would be a beautiful piece of software.

However, I'm heavily focused on self-hosted. I carefully followed through their instructions, and everything seemed to almost be working perfectly. But I hit a roadblock with getting the client to talk to the server correctly.

Maybe I'll try again someday, but considering Vikunja is working for the moment, and the great risk of losing any data I enter into StandardNotes, I can't recommend StandardNotes as a self-hosted option for notes.

Posted in Dev

Dev Week 15

This week I worked on the podcast player's autoplay!

Previously, when you'd be listening to an episode, you'd get near the end and hit the next key to listen to the next episode.. and the next episode would start loading, but it wouldn't play. This would require you to now manually hit the play button, which especially on small devices, can be difficult (think: trying to start the next episode on your mobile device while driving..). Also, the playback speed wouldn't persist, so you'd have to click another tiny radio button to restore your playback speed. Now, whether or not the player would automatically switch to the next episode when you got to the very end of an episode without pushing the next button, I don't know. It was supposed to, but I didn't test it thoroughly because I always hit "next" manually.

Now, it automatically refreshes the playback speed and starts playing the episode! Single-click to progress!

I thought I had seen a page on MDN (Mozilla Developer Network) about the audio tag event lifecycle, but I couldn't find it after what I felt was too much time spent. Instead, I added some logging scaffolding to watch every audio element event as it occurred. This code automatically overrides console.log to add all log messages to a div element to the page instead of only to the console, enabling the viewing of logging even on mobile!

const originalLog = console.log;

const logToScreen = (...args: any[]) => {
  originalLog.apply(console, args);

  // Create debug element if it doesn't exist
  let debugDiv = document.getElementById('debug-log');
  if (!debugDiv) {
    debugDiv = document.createElement('div');
    debugDiv.id = 'debug-log';
    debugDiv.style.cssText = 'position: fixed; bottom: 0; left: 0; right: 0; background: rgba(0,0,0,0.8); color: white; padding: 10px; max-height: 200px; overflow-y: auto; font-family: monospace; font-size: 12px;';
    document.body.appendChild(debugDiv);
  }

  // Add new log entry
  const entry = document.createElement('div');
  entry.textContent = `${new Date().toISOString().split('T')[1].split('.')[0]} ${args.join(' ')}`;
  debugDiv.appendChild(entry);

  // Keep only last 5 entries
  while (debugDiv.childNodes.length > 100) {
    debugDiv.removeChild(debugDiv.firstChild);
  }
};

// console.log = logToScreen;

All you need to do is comment out the last line to enable it, and ensure your code has some console.log calls.

This allowed me to realize what I really needed to do: record speed and play state whenever they change; when metadataloaded is finished, trigger a play; when play is triggered, restore the values. There's some extra boolean logic to make sure it only restores at the correct times, but that's the basic gist of it.

This also reminds me that I'd also like to style the UI so the playback speed radio boxes look like nice big clickable buttons, increase the size of the forward/backward/previous/next buttons to overall make it more mobile-friendly.

As if that's not enough, I also want to show the date of the episode on the page, but that requires a backend change and I've been super focused on what I can do only by changing frontend code. It gives me a greater appreciation for communication and teamwork, because if you only know how to do frontend or only backend, and depend on someone else for vital functionality, if they don't do it then it simply doesn't get done and you're stuck! That's certainly a nice thing about knowing backend and frontend.

I've always been more of a backend gal, but I think I'm finally getting the hang of React, reasonably, especially with using lots of custom hooks to keep functionality nice and self-contained. The code needs another refactor after working on the autoplay, but it's worlds apart from the raw HTML/JavaScript I wrote on my dashboard/weather app back in 2017. Unfortunately that code was a bit too tied-in with my "personal stuff" and I never published it on Github. Thinking back, that "personal stuff" may be entirely irrelevant and non-sensitive today, so hopefully I'll get a burst of motivation someday to go find it, clean it up, and publish it.

Back on "As if that's not enough", I have a lot of functionality I want to implement, and I'm trying to avoid making a huge list and overwhelming myself, instead focusing on the most important usability functionality right now. The past two weeks I got playback to be really smooth, being able to play the audio as soon as you hit the page instead of waiting several seconds, and the ability to switch to playing the next episode at the same speed with a single click. I think these are massive usability improvements. I'm wanting to implement the improvement I mentioned on last week's post about the server loading segments from file in increasing sized groups, instead of one segment from the filesystem per request while running an additional backend job to load the entire file. While the algorithm is unclear, the idea is not.

As I was working on these two features, it's really making me itch to start making automated browser testing using Selenium. I've been exposed to Selenium but I actually got to use it very briefly for work at least once or twice, and it's not terribly scary. My time spent manually sliding and scrolling and skipping and clicking, all would have been a lot faster (or at least less tedious on my part) if I had an automated testing suite. There's so much I want to do!!!

In case anyone is wondering why the image of the console logs shows the "Adding event listeners to videoRef number 1" and another message each twice, it's because in non-production mode, React useEffect runs twice in a row each time to help developers catch bugs, because useEffect is supposed to be idempotent, i.e. running it twice in a row should behave exactly the same as if it had run once. See the docs if you're interested.

I've made it up to SN episode 146. Steve mentioned Black Viper (https://www.blackviper.com/), who has an "In Loving Memory" page, which I'm considering making one of my own. Additionally, I need to rename Dev Week to Tech Week.

In my home lab stuff, I struggled with Element/Synapse/Matrix a little bit. I needed to manually reset permissions on my Element client (which had gone from something like rwxr-xr-x to ------ [?!]), change the configs to stop automatically logging my dang clients out, fix router DNS settings (there was an old record that was wrong and overriding the correct record), and fixing up the Nginx server to redirect the .well-known paths for server discovery. Phew! All that just to post the image above of the logs in the playlist app, but now my entire system is much nicer, with additional comments in all the config files I touched, setting up a highway that I can drive on the next time I have to deal with something (no, I'm not using Ansible or Terraform, which I'd love to use if I didn't already have so much stuff set up without them; maybe the next time I migrate servers, or maybe I'll experiment with a raspberry pi.. if I weren't so busy building a podcast app). I'm getting much better at commenting and leaving myself breadcrumbs to get back faster.

I interact with a lot of different FOSS technologies and I'm wondering about including them in my blog posts more, hence Element/Synapse/Matrix, Nginx, Ansible, (Terraform isn't FOSS), and Raspberry Pi. I don't even know if I've mentioned anywhere on my site: I use Gentoo Linux with Xfce4 as my windowing manager. I love looking at Github repos and installing various FOSS software, setting up lots of Docker containers with different apps. I have around 27 apps installed as Docker Stacks/Compose, and I'd like to mention at least 1-2 each week.

Thanks for joining us for this week's Dev Tech Week!

Posted in Dev

Dev Week 14

Oh no, I missed another blog post this week! (This is being written after the fact.) Fear not, though, for I have not stopped work on my podcast-/playlist-listening app!

This week I implemented a dramatically improved instant-listening feature. Previously, when switching to a new episode, there would be a nontrivial delay before the episode would start playing. Now, it starts playing instantly!

It's still not perfect as I've been playing with it, but it's much better. In particular, it doesn't handle switching episodes quickly, because it's still loading an entire episode in the background while it's feeding up short segments. I have ideas to write a nice little algorithm to load up chunks in a more intelligent fashion, however. To understand the improvement, we first need to understand how it works now after the updates:

  • When the first request for a given episode comes in for an audio segment, load only that segment and feed it up.
  • Start a background process that begins loading the entire episode.
  • If additional requests for more segments come in while the whole episode is still in the process of loading, then make additional single-segment reads while the episode is still loading.
  • Finally, when the whole episode is done loading into RAM, feed segments up from memory.

There are a few scenarios this doesn't handle well:

  • Switching to an episode and then scrobbling through the episode multiple times right away.
  • Switching back and forth between episodes.
  • Clients that request a large number of segments upfront.

In all these cases, it bogs down the server with requests for individual segments which each individually have to context switch to reading only a specific segment of the file, all stacked on top of each other. This is very painful to the end user as the server stops responding quickly.

The new process that I'll eventually write goes something like this:

  • When a request for a given episode comes in, load that segment and feed it up immediately. Keep track of this request in a data structure but don't do any additional loading (note that at this point some clients such as iOS/Safari will have already sent in 5+ requests at a time).
  • When we see the second request for a following contiguous segment, we see that we just had a request for the previous segment. Instead of loading the entire episode or only one segment, load the following 2 segments into the cache and serve up the next 1 segment (the request for the following segment that we're loading from file is likely already in the pipeline, so at least now we'll already have that in RAM and don't need to issue another file read).
  • When the next request comes in for a segment just beyond what we just loaded, we see that they are more contiguous blocks. The first time, we loaded 1 segment; the second time, we loaded 2 segments; now load 4 segments.
  • Each time this process occurs, we load 2x the number of segments as we loaded the last time. The assumption is that if the client keeps requesting segments from the same episode then they will likely continue to do so, but by not loading the entire episode all at once, we leave the server ready to make requests for segments from other episodes.

This still isn't perfect, but would be another huge improvement over the client response speed.

I'm also considering talking about non-development concepts in my blog posts, and I already renamed it from Game Week to Dev Week, but I'm thinking I'll have to rename it again, to Tech Week. There are plenty of times when listening to Security Now that I think I might like to talk about a security/privacy issue instead of development, especially on slower development weeks.

Thanks for joining me for another Tech Week! 🙂

Posted in Dev

Dev Week 13

pixel art of violet's profile picture

This week was all about refactoring, and fixing that pesky problem where playback position wouldn't work on iOS/Safari!

As a side note, I've tried my hand at some pixel art (see featured image above).

I wanted to give iOS users a way to reload their last playback position when returning to the app via a button. Due to the way browsers work and iOS/Safari in particular, I had to create some workarounds to be able to detect when the audio had loaded enough to see if it was possible to programmatically change the playback position. Then the button would become pressable. However, in the process, I found that the button was unnecessary, because if we programmatically changed the playback position once this event was detected, it worked on iOS too!

I also updated it so that it will show the episode information and picture on the playback info section! See image:

Right now it's hard-coded to "Security Now!" and their podcast logo as a proof of concept, but the episode information shows the current episode number and title. This should work on Android devices as well since it uses the browser-based API.

Sadly, iOS has other issues which I and many other people struggle with. It is common that if you are listening/watching to something in Safari, which then backgrounds when you leave the app, if you pause, wait a few seconds, and then try to play again, it will often completely forget what you were listening to. To add insult to injury, your play button will now start random music from your (formerly iTunes) Music app! As a developer, it seems there's nothing that can be done about this. People complain about it as recently as iOS 17 which is the current or most recent previous iOS, and I'm experiencing the issue on iOS 16. If you can find an app which doesn't have this problem, please let me know and I'll try to figure out their secret sauce incantation that they use to ask iOS pretty please to nicely not wipe your current playback information when you pause for two seconds.

This week I also did some refactoring into more custom hooks, making each part of the code self-contained and understandable. While it might not be apparent as an end-user, it certainly makes the application much more scalable for adding more features.

I also received my first request for additional podcasts!! This will probably become my primary next goal. I'm not quite sure what this will look like yet, but this will require some reworking to ensure that the application supports multiple podcasts. I will most likely also need to write another program (or make my current podcast generator program more generic) for being able to produce more podcast metadata files.

Thank you for joining us this week! See you next week!

Posted in Dev

Dev Week 12

Screenshot of the podcast player web user interface.

The past two weeks, my main focus has been on my podcast/playlist app! It's received some serious improvements. The most important new functionality is that the audio plays directly in the webapp instead of on the server, so you can take it with you!

We could simply feed up the full audio file to the client, but this has several problems. If you had already listened to some of the episode and wanted to continue where you left off, and you were in a bandwidth-constrained environment (using mobile in a building that doesn't get good connection speed), it means you'd have to wait for the entire full-size file to download up to the point you had been listening before you could continue where you left off.

There are two primary technologies which support mobile-friendly streaming called HLS and DASH. I implemented HLS, but in writing up this blog post I realized HLS is Apple-specific. I'll be implementing support for DASH in a bit.

HLS and DASH both support capabilities that allow audio and video data to be streamed in chunks. This is what you see when you are watching a Youtube video and see the little loading bar load slightly past your current listening position, rather than loading the entire video. This makes it easier to jump around the episode, and also reduces bandwidth waste, especially if you end up not wanting to watch the whole video.

HLS and DASH also support streaming with an adaptive bitrate. This means that if your internet connection is very slow, the client (your phone or other device) can request the audio/video data at a lower bitrate, so it can continue streaming with low or no interruption, just at a lower quality.

Because the audio is streamed directly to the device, we were able to rip out all of the audio controls from both the client and the server. The server only has to serve up metadata and audio data, while the client has full control over playback and requesting the media chunks it needs as it needs them.

Another feature added is registration with your device's media controls so that you can use your OS-based player or headphones' media keys such as playpause, next/forward, and back/previous controls. This is called "Media Session" in web browser parlance.

A minor feature added is showing correct episode information on the page. The screenshot from the previous blog post (dev week 10) shows the episode's date and the url to the original media file; now it shows the episode number, actual title, and the description. Although m3u8 doesn't appear to directly support the concept of an "episode description", we're able to fit it into the metadata and then parse it out properly to include that data anyway. Now the date isn't showing, but we'll be adding that back in. That requires a few steps to get the date to propagate all the way through, but it's not difficult at all.

My main upcoming goal is to speed up the loading of audio. We managed to make it much faster and less buggy already by using a few tricks, primarily ensuring that different parts of the UI load on the client at the right times and stay loaded when they need to be; however, it's still too slow for my liking. In particular, if you use the arrow keys to switch between episodes too quickly (i.e. try to switch episodes twice in a row before the first one loads), the whole app will hang while your queued requests to the server get processed one-by-one. In particular, it loads the entire audio file of the switched-to episode, hanging the server and the client.

There are a few main next steps to support this.

The first one is to make the server asynchronous, which should be pretty straightforward, and would allow the server to provide metadata for the next episode even if it's still loading audio data for another episode.

One of the main problems slowing things down is that using the pydub python library for extracting the audio segments, it loads the entire audio file into memory before it can break it into chunks and serve them to the client. Perhaps pydub has a lazy-loading capability; otherwise we'll have to figure out another way to load chunks out of an audio file more quickly. Maybe even a secondary simple ffmpeg command that's used only if pydub hasn't finished loading the file into memory yet, so it can respond as quickly as possible, and use pydub for all subsequent requests when the audio has been loaded into memory. Either way, the goal is to make this very fast and user-friendly.

We have lots of other fantasies of features, for example, a listing of every episode where you can scroll through and click to switch between them arbitrarily. Right now the user must manually type a number into the box, which is especially unfriendly right now because of instant React re-renders (debouncing would improve this but I haven't done that yet), or click the "next" button and wait for each episode to load individually before being able to fully skip to the next-next-next one.

Right now this whole project is still a little bit toy-like, but for my slim use case, it's 100% usable and delightful. I'm excited to share that I've managed to reach episode 99 of Security Now!! I think I may have surpassed the point I had reached the last time I started listening from episode 1 around 2021, as I recently recognized some episode content but now it's less familiar. Also, right now, this week marks episode 999 and next week will be 1000!!!! Congratulations to Steve Gibson, Leo Laporte, TWiT, and the whole Security Now community!!

I'd love to hear feedback, especially if anyone has any feature requests or feature priority feedback. Feel free to post an issue on https://github.com/violet4/playlist_player/issues or go through my contact page and email me at the email address described there. Apologies to anyone who attempted to send me an email to the address that was listed there (which wasn't and still isn't valid); I had to delete that address a while back after receiving too much spam, and apparently never went in and replaced that address with a new working one. I also made a new PGP key and ensured all is working (I highly recommend kleopatra).

Thanks for joining us for Dev Week 12!

Posted in Dev

Dev Week 10

Screenshot of the podcast player web user interface.

This week I got a bit sidetracked making a new podcast/playlist player! We’ll take a pause on the machine learning / LLM / AI stuff since this has taken up most of my week and given me a source of focused motivation.

I’ve been listening to the Security Now podcast since December 2019. I started listening in reverse order from the end, which wasn’t too difficult at the time using the Podcast app on iOS, which would go to the previous episode when the current episode was finished. However, I wanted to start listening from the beginning (Episode 1), which I tried, but due to technical reasons, it wasn’t practical, easy, or possible to get the Podcast player to recognize the old episodes. At the time, I tried to create a workaround by making my own XML file that the Podcast app would recognize and allow me to play from, but for whatever reasons at the time, I couldn’t make it work robustly and eventually gave up.

Then I tried manually keeping track of what episode I was on, and the time within the episode, but my imperfect memory and record-keeping for this particular use case resulted in entirely forgetting where I was, especially if I hadn’t listened for a week. I'm still not certain how far I had listened when I started from episode 1 in 2021.

This week a work project came up where I needed to come up with a new coding problem to solve. I decided to start working on a little podcast/playlist listener app. At first my ambitions were too high as I tried to build something complex on day one, but this allowed me to consider, “What’s the most impactful right now to get something working?” I had already (within the week prior) created a script that would convert Security Now episode metadata from Steve Gibson’s website into a m3u8 file that I could play with VLC. (m3u is a playlist file format, a collection of pointers to other media; the 8 signifies utf-8 encoding which basically means the file only uses characters [numbers, letters, symbols, etc] from a fixed set that improves compatibility.) However, VLC had no easy/friendly way of remembering my playback position.

I searched for an app that could solve this for me, but I wasn’t happy with any of the solutions I found and felt it wouldn’t be too difficult to make something myself.

The primary/first goals for something useful were:

  • Keep track of the current episode
  • Keep track of the current position within each episode, not just the current episode
  • Play/pause previous/next directly from the player so I wouldn't need to manually transcribe this data between the app and the source playing the episodes

I managed to reach that in just a couple days. At the moment it’s only a remote player; I can visit and control it from my mobile device, even away from home (thanks to Tailscale/Wireguard VPN), but it doesn’t stream the audio itself to the mobile device; it plays the audio directly on the machine that’s running the server.

The next goals are:

  • Stream the audio to mobile
  • Friendly to bandwidth-constrained environments (i.e. mobile)
  • Ability to control the media using media controls (headphone/keyboard/remote control/OS/phone volume/playpause/forward/backward buttons)

I already have self-contained code examples for each of those goals, but they need to be integrated together into the app.

To achieve the first two goals, we’ll be using HLS, which is basically just a fancy way to split a media file up into small chunks, give the client (phone etc) metadata so it knows about all the chunks, and then the client can request the right chunks at the right times so the audio/video can play seamlessly as if it were one media file on the client (this is what you see when you play a youtube video and the little gray bar loads only a little bit ahead of the current playback position, while the majority of the video is not downloaded yet). I tried this on my phone and it was supported natively (iOS); on Firefox desktop I installed a addon that wraps a javascript library that does this; on Tera’s phone it didn’t seem to work. However, I’ll be using the javascript library directly within the app so that it works seamlessly across clients without the client/user needing to install anything extra.

For the third goal, ability to control the media using phone/headphone/keyboard/… controls, we hook into the web browser’s media controls API. This worked directly in Firefox and on my phone, although browsers/clients tend to not allow this control capability unless it’s directly tied to media playing directly on the device. Since I originally set this up as a remote control so I could get something working more quickly, there’s no media playing on the client yet, so clients ignored the request for media controls and the buttons simply didn't trigger the commands. We have a working proof of concept that fully worked in both Firefox and iOS when it feeds up an actual media file, so it shouldn’t be difficult to get this working once the previous goal is working (streaming the audio data to the client).

For HLS/streaming, originally I got it working by first using an ffmpeg command to split the audio file into a m3u8 metadata playlist file (playlist metadata file with information about all the chunks) and lots of .ts media segment files, and then simply served the entire folder up statically, which the client was able to fully interact with to stream the audio. However, this splits the original media file into many small media files and requires not only more space on the drive where the media is stored, but also managing all these extra files and making sure folders are set up. So we got a prototype that dynamically generates a m3u8 including filenames to files that don’t actually exist, but as the client requests those chunks, the server splits the mp3 file up in-memory and serves the chunks as though they were coming from .ts files on disk. It’s like magic!!!

Within just this past week, I’ve already managed to listen to Security Now episodes 1-25. Note, I listen on 2x speed most of the time, and although episodes are now upwards of 2 hours, they used to be 20-40 minutes back in 2005. It seemlessly keeps track of my current playback episode and position within the episode!!!

We could take this in lots of different directions. I envision an episode viewer which lists all the episodes along with their titles and descriptions, allowing the user to click on one and start listening to it right away. Right now it’s also pointing directly to my Security Now podcast playlist m3u8 file, which I generated using a script that isn’t in the repository just yet (but it will be). This means that while it might be technically possible for someone else to use it, it’s not friendly to end-users or even experienced developers, yet, so I think focusing on usability for a general audience will probably be most prudent since I’m revealing this little project to the world (my tiny world at the moment, but still).

If you have any interest in this app, or any ideas on the direction it should take, feel free to comment here, or post an issue in the github, or follow instructions on my contact page, or consider any other means you have to get ahold of me.

Thank you for joining us for another Dev Week!

P.S. when I get around to it, I'll be renaming this to Dev Week. As much as I want to make a game, and have some amount of vision about how it should look, my attention is a bit too split. There are so many things I want to do!

P.P.S. another feature of HLS is adaptive bitrate, so that if the client has poor network connectivity, the server can send lower-quality but lower-size chunks of the media. The current setup doesn't support this yet, but we can use pydub (which we're already using to split the media into chunks) to downsample the audio before we send it to the client. However, the server needs to know how much to downsample or if downsampling is even required, and for that the client may need to be able to communicate to the server about its current network speed. This will be another research topic and potential feature that I hope to get in, especially if I'm listening to episodes in a grocery store or other large store where network bandwidth seems to mysteriously thin or drop entirely.

Posted in Dev

Dev Week 9

This week we'll talk a little bit about NLP, or Natural Language Processing.

ChatGPT took the world by storm, and now we have Llama, Claude, Gemini, whatever. What are they and where did they come from? I was obsessed with watching the progression of AI and Deep Learning around 2016 when Google first released TensorFlow, back when it was still only cool in the tech world, but now you'd probably have to hide under a rock to entirely avoid hearing about it.

Word2Vec

We'll start at Word2Vec since this is my first favorite machine learning algorithm. Let's borrow this image from School of Informatics at the University of Edinburgh:

King - man + woman = queen: the hidden algebraic structure of words

Word2vec is a clever algorithm used to create what are called "word embeddings". We can start with some graph paper and start writing out some clusters of words that are related and unrelated. Since our graph paper surface is 2d (2 dimensions; logically, not physically), we can consider that each word has an x,y coordinate pair, and that words that are related are close to each other, while words that are less related are far apart. We'll notice quickly that as we start writing out more than just a few words, we start having difficulty showing the far-ness of some pairs that ended up close together because of their related pairing with another common word.

However, using only 2 dimensions is limiting, so we can use 3, 4, 5, or even 100 dimensions. Since we can't easily envision more than 3-4 dimensions (space and perhaps time), we can imagine that each of these dimensions is a different aspect of meaning of words. Maybe one dimension is related to sweetness, one dimension is related to respect, one or more dimensions are related to color. We don't actually know what they are, as they are automatically "learned" by an algorithm, but this helps us get some feeling of what a "word embedding" is. Elephants and horses might be close together in the "mammal" or "animal type" dimension but far apart in the "size" dimension.

The algorithm works by first assigning a semirandom vector to each word in the training data, where each number is pulled from a random normal distribution centered around 0 with values approximately between -1 and 1. Imagine something like [0.12971, -0.2614, 0.08536, ...]. Since they are random, they don't actually contain any semantic meaning yet. Next, the algorithm runs by comparing proximity relationships between words in training data. This training data can be Shakespeare, emails, books, Wikipedia articles, or anything that contains lots of words and shows usage of those words and the relationships between words. There are multiple ways to look at words that are close to each other, for example checking every pair of words, or every 3 words, or every 5 words, but let's imagine we're looking at every group of 4 consecutive words.

We'll assume our training text data from Wikipedia etc is already cleaned: special characters are removed and the text is broken up into individual words.

The algorithm starts at the beginning of the training text data, looking at the first group of 4 words. If our first sentence is "1 2 3 4 5 6 7 8," then our first group is "1 2 3 4", and then we slide, our second group is "2 3 4 5", third "3 4 5 6", and so on.

Every time we process a group of words like this, we grab the vectors corresponding to those words, and we perform some basic mathematical options to make their vectors closer together. In our 2-dimensional example, imagine that we look at the coordinates for our 4 words on the graph, find the middle point of all of them, and then nudge each one slightly closer to the middle point. We only nudge them slightly, because if we nudge them too close, we'll "overfit" the training. We have lots of training data, so there's no need to forcefully nudge words together the first time we encounter their pairings, as if those words are actually meant to be close to each other, certainly we'll encounter them near each other in many sentences in our training data.

Nudge nudge.

There are all kinds of mathematical tricks to deciding how close to nudge the words together, how many times to train on the training data set, how to initialize the numbers in the vectors at the beginning, and so on, none of which we will concern ourselves with.

Next, we can grab some interesting word out of a dictionary, such as "ocean". Then we can perform some calculations to figure out which other vectors are closest to our "ocean" vector, and we might find similar words that we'd expect to see in the thesaurus: sea, tide, waters. We might even find the names of oceans, things that would be found in the ocean, other large parts of our planet such as "sky" and "Earth". To play with a realtime demo and search, see https://projector.tensorflow.org for an interactive version of the following:

Finally, getting back to the graphic from earlier in this blog post:

King - man + woman = queen: the hidden algebraic structure of words

After all this training, if we grab the vectors for king, man, and woman, and then we calculate king - man + woman with our vectors, receiving another vector, then compute the real vector that's closest to our calculated vector, what do we find? Queen!!! No one told the Word2Vec algorithm explicitly how these words are related. It learned it through this rather simple algorithm of nudging word vectors closer together based on where they appear in textual data. This "understanding" or meaning that is captured by our vectors is called "semantic meaning".

I want to work us up to understanding ChatGPT, but we're beginning with Word2Vec because I believe it's one of the most miraculous pieces of technology and tools along the many steps it took to get to ChatGPT. Although it's primarily used for word embeddings, it's also been applied to whole documents and dubbed "Doc2vec", and we could also use it for the topics on Wikipedia themselves (think of how all the Wikipedia articles are linking to each other in a complex web), webpages, or anything else that has some sequential relationship. While ChatGPT may not use "Word2Vec" exactly, it similarly uses "embeddings". These embeddings represented as vectors are data that can be operated on quickly in a GPU card by machine learning models, so it's an excellent way to imbue our machines with some kind of understanding about human language.

Thanks for joining us this week and we look forward to next week! We'll be continuing down this machine learning / LLM path for a while as we work our way up to using LLMs locally on your own machine (not in the ChatGPT/etc clouds).

Posted in Dev