Processing text with state machines

This article was also published on Hackernoon .

Processing text is a very common problem in computer science. Using state machines or finite automatons (FAs) is a great technique for processing text. This article provides a rough framework for tackling string processing problems.

Finite automatons#

An FA is a mathematical model of computation. An FA has one “start” state, one or more “final” states, and may have other intermediate states as well. The only information it stores is its current state. At any point of time during a computation, it does not know “how” it got to its current state. An FA has a transition function that maps the current state and input token to the next state. You can read more about FAs on Wikipedia .

An alternate perspective#

Implementations of FAs can be simple or complex based on the input string and the problem being solved. Devs often try to use regular expressions or recursive algorithms to tackle problems that would be much simpler with an FA. I have also come across devs who have rejected this technique, finding it “hacky” and “unreliable”. This happens because of a lack of prior knowledge about automata theory.
However, it is easy to write programs that use FAs without going deep into the theory. We can think of FAs as an alternate way to structure the program. The program goes through a set of states as it processes the input. Behaviour of the program on the next input token is defined by its current state. The FA “accepts” or “rejects” an input string based on its final state.

The sitemap parsing task#

An example of a string processing task is to parse a sitemap file and extract all the page links (the contents of the <loc> tags). This example is perfect because while the initial problem is simple, it becomes progressively harder as we move towards a more complete sitemap processor.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>
      <lastmod>2004-12-23</lastmod>
   </url>
</urlset>

We will use the library htmlparser2 for parsing the XML. This library allows us to hook callbacks which are called with the tag name and attributes whenever an XML tag is opened/closed as the parser processes the input. This makes our task trivial. Think of it as a two-stage parser where the first converts a string into XML tokens and the second converts XML tokens into the required output.
The input for our FA will look like:

urlset
url
loc
/url
url
loc
/url
/urlset

Building an FA#

This FA can be used to extract all the page links. Our program should save the text content of a tag only when the current state is LOC_OPENED. FA for Sitemap 1 Another convincing argument in favour of FAs is that it is easy to add new states and transitions to it. If the specification of the input string changes, states can be added or removed from an FA to accommodate those changes. This FA handles the <lastmod> tags as well. FA for Sitemap 2

Implementation#

Our FA implementation does not require any extra libraries. A generic hashmap can be used to store the states.

// sitemapStates.js

module.exports = {
  START: {
    transitions: { urlset: 'URLSET_OPENED' },
  },
  URLSET_OPENED: {
    transitions: { url: 'URL_OPENED' },
  },
  URLSET_CLOSED: {
    final: true,
    transitions: {},
  },
  URL_OPENED: {
    transitions: { loc: 'LOC_OPENED' },
  },
  URL_CLOSED: {
    transitions: {
      '/urlset': 'URLSET_CLOSED',
      url: 'URL_OPENED',
    },
  },
  LOC_OPENED: {
    transitions: { '/loc': 'LOC_CLOSED' },
    extractText: true,
  },
  LOC_CLOSED: {
    transitions: {
      '/url': 'URL_CLOSED',
      lastmod: 'LASTMOD_OPENED',
    },
  },
  LASTMOD_OPENED: {
    transitions: {
      '/lastmod': 'LASTMOD_CLOSED',
    },
  },
  LASTMOD_CLOSED: {
    transitions: {
      '/url': 'URL_CLOSED',
      loc: 'LOC_OPENED',
    },
  },
};
// index.js
const fs = require('fs');
const htmlparser2 = require('htmlparser2');
const sitemapStates = require('./sitemapStates');

function parseSitemap(sitemapBuffer) {
  let state = sitemapStates.START;
  const pageLinks = [];

  function onOpenTag(name) {
    const nextStateName = state.transitions[name];
    if (nextStateName === undefined) throw new Error('Failed to process input');
    state = sitemapStates[nextStateName];
  }

  function onCloseTag(name) {
    const tagWithWithSlash = `/${name}`;
    const nextStateName = state.transitions[tagWithWithSlash];
    if (nextStateName === undefined) throw new Error('Failed to process input');
    state = sitemapStates[nextStateName];
  }

  function onText(text) {
    if (state.extractText) pageLinks.push(text);
  }

  const parser = new htmlparser2.Parser({
    onopentag: onOpenTag,
    onclosetag: onCloseTag,
    ontext: onText,
  });

  parser.write(sitemapBuffer);
  if (!state.final) throw new Error('Failed to process input');

  return pageLinks;
}

fs.promises.readFile('sitemap.xml')
  .then(parseSitemap)
  .then(console.log);

Running this will output:

[
  'http://www.example.com/',
  'http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii',
  'http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand'
]

Getting real#

A sitemap file has a lot more than what we’ve dealt with till now. One sitemap file can link to other sitemap files as well. You can read more about the sitemap protocol on sitemaps.org .
A sitemap file can either be a simple sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>
      <lastmod>2004-12-23</lastmod>
   </url>
</urlset>

or a sitemap index file that links to other sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

FAs that can handle this can be constructed easily. FA for Sitemap 3 Please note that the sitemap spec defines two more tags <changefreq> and <priority> that we’ve omitted from our example.

Learning more#

I hope this article provides a gist of how FAs can be used. Remember that while finite automatons look simple, more powerful automatons also exist. These are required for more complex parsing tasks (for example: saving the text of the <loc> tags only when the date in <lastmod> is later than a given date).
This course on edX is a great resource for learning more about automata theory.