Apache HttpClient

For a hackaton I wanted to read some files from our BitBucket server. I knew the URLs of the files, but there were some complications. First, you need to be authenticated. According to the documentation, the preferred way of authentication is HTTP Basic Authentication when authenticating over SSL. We are using an SSL connection, but with self-signed certificates.

When working with SSL, Java uses keystores and truststores. The difference between the two is that keystores store private keys, which are used by the server side, and truststores contain public keys, which are used by the client side. We have our own custom truststore, and we can tell the JVM to use that one by passing the following parameters:

-Djavax.net.ssl.trustStore=/truststore/location -Djavax.net.ssl.trustStorePassword=password

This works when you only want to access sites using your custom truststore. As soon as you want to make a connection to public sites, this fails. By default you can only use one truststore at a time. If you want more, you have create some custom code.

Or you can use the Apache HttpClient.

HTTP Basic Authentication

There are a couple of ways to use Basic Authentication. For BitBucket we need to use Preemptive Basic Authentication, which means we need to configure a HttpClientContext.
The first thing we need to do is to setup the CredentialsProvider. This doesn’t need much explanation.

        // Configure CredentialsProvider
        final CredentialsProvider provider = new BasicCredentialsProvider();
        final UsernamePasswordCredentials credentials
                = new UsernamePasswordCredentials("username", "password");
        provider.setCredentials(AuthScope.ANY, credentials);

Next we need to configure an AuthCache. We’re going to cache the authentication for a specific host.

        // configure AuthCache
        final HttpHost targetHost = new HttpHost("host", PORT, "https");
        final AuthCache authCache = new BasicAuthCache();
        authCache.put(targetHost, new BasicScheme());

Last, we use the previous steps to configure our HttpClientContext

        // configure HttpClientContext
        final HttpClientContext context = HttpClientContext.create();
        context.setCredentialsProvider(provider);
        context.setAuthCache(authCache);

HttpClient with custom SSL context

Now it’s time to configure our HttpClient. We’re going to load our truststore specifically for this client. This means that other clients and connections will still use the default Java truststore.

        SSLContext sslContext = new SSLContextBuilder()
                .loadTrustMaterial(
                        new File(configuration.getTruststoreLocation()),
                        configuration.getTruststorePassword().toCharArray()
                ).build();

        SSLConnectionSocketFactory sslSocketFactory = 
                new SSLConnectionSocketFactory(sslContext);
        return HttpClientBuilder.create()
                .setSSLSocketFactory(sslSocketFactory)
                .build();

Using the HttpClient to execute the request

Now that we have configured both the HttpClient and its context, executing a request also becomes easy. Note that we need to pass the context to the execute method.

        final HttpGet request = new HttpGet(link);
        HttpResponse response = httpClient.execute(request, context);
        InputStream connectionDataStream = response.getEntity().getContent();

Linux: Spring Boot as a service on port 80

Well, that’s a mouth full. This blog shows how to run your Spring Boot application on port 80 when your linux server starts. We are going to use Ubuntu or Linux Mint for this, and we’re going to assume that Java is installed.

Setup Spring Boot

The first thing we need to do is tell Maven to make our Spring Boot application runnable. We’ll configure the spring-boot-maven-plugin to make the jar that’s going to be built executable.

	<build>
		<plugins>
			<plugin>
				<groupId>org.springframework.boot</groupId>
				<artifactId>spring-boot-maven-plugin</artifactId>
				<configuration>
					<executable>
						true
					</executable>
				</configuration>
			</plugin>
		</plugins>
	</build>

By default Spring Boot runs on port 8080. There’s no need to change that here, we’ll deal with that later. But if you want, you could start the application on a different port by configuring it in application.properties

server.port=8079

Run Application as a Service

Set permissions

The first thing we need to do is to copy the application to its final destination, and create a user to run the app. By creating a separate user for this application, we limit the damage that can be done in case there is a security breach. I’m going to call the application “website” and I’m going to place it under /var/website, to make things easy. Once the application is in its place, change directory to /var/website, create a new user and transfer ownership to the new user. We’re going to allow the newly created user to read and execute the file, but not to write to the file. This, again, limits the damage that can be done when there’s a security breach.

sudo useradd website
sudo passwd website
sudo chown website:website website.jar
sudo chmod 500 website.jar

Setup systemd service

To manage starting and stopping our application, we’re going to turn it into a service maintained by systemd. We need to tell systemd how to run our service by creating a configuration file.

create /etc/systemd/system/website.service

[Unit]
Description=My spring-boot application

Wants=network.target
After=syslog.target

[Service]
User=website
Type=simple
ExecStart=/var/website/website.jar
Restart=on-failure
RestartSec=60
KillMode=mixed

[Install]
WantedBy=multi-user.target

This is what the options mean:
Description: A text description of what the service is for.
Wants: The services that should, but are not required to, be started before this service.
After: The services that should be started (if they are not already running) after this service has been started.
User: The user that should be used to start this service.
Type: This indicates the way the service is started. We’ll set it to simple.
ExecStart: The command with parameters that needs to be executed to start this service.
Restart: When the service should be restarted.
RestartSec: If the service needs to be restarted, how many seconds systemd should wait.
KillMode: How the process will be stopped by systemd. We’ll set it to mixed, which sends a SIGTERM signal to our service which allows the service to shut down itself.
WantedBy: We want our service to be started when the system is up and running, and accepting connections from users.

For a list of options, look here.
KillMode is described here.
For more info on WantedBy, see this.

For the security reasons mentioned above, we’ll change the permissions of this file:

sudo chmod 640 /etc/systemd/system/website.service

Enable service to start on boot

The next thing we want to do is to indicate that this service should be started when the server is starting.

sudo systemctl enable website

Setup nginx

Now we have our application installed as a service, and it’s started when the server starts up. But the application is running on port 8079, which is not what we want. Linux only allows the root user to open TCP ports below 1000. To fix this, we’re going to use nginx to forward port 80 to our application.

create /etc/nginx/sites-enabled/website.conf with the following content:

server {
        listen  80;
        server_name     _;
        location /  {
            proxy_pass          http://localhost:8079/;
            proxy_redirect      off;

            proxy_set_header   Host             $host;
            proxy_set_header   X-Real-IP        $remote_addr;
            proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
        }
}

This tells nginx to listen to port 80, for incoming connections on any hostname, and forward it to our application.

Multiple Full-Screens in Java

Working with Java’s Full-Screen Exclusive Mode is a bit different than working with AWT, Swing or JavaFX. The tutorial describes how, and why, this works. The information, however, is a bit spread out. Also, it doesn’t mention how to work with multiple full screens at the same time. Not that it’s very different from working with only one screen, but it’s at least fun to try out.

The Frame

The first part of this exercise is to create a Frame that should be displayed. Since the OS is managing the video memory, we should guard against it. We could lose our drawing at any time, because the OS can simply reclaim the memory. The draw method looks a bit complex because of this, with its double loop structure. But on the positive side, we can just ignore any hint by the OS that the frame should be repainted.

package nl.ghyze.fullscreen;

import java.awt.Color;
import java.awt.Frame;
import java.awt.Graphics;
import java.awt.image.BufferStrategy;

public class TestFrame extends Frame {
    private final String id;

    public TestFrame(String id) {
        this.id = id;

        // ignore OS initiated paint events
        this.setIgnoreRepaint(true);
    }

    public void draw() {
        BufferStrategy strategy = this.getBufferStrategy();
        do {
            // The following loop ensures that the contents of the drawing buffer
            // are consistent in case the underlying surface was recreated
            do {
                // Get a new graphics context every time through the loop
                // to make sure the strategy is validated
                Graphics graphics = strategy.getDrawGraphics();

                int w = this.getWidth();
                int h = this.getHeight();

                // clear screen
                graphics.setColor(Color.black);
                graphics.fillRect(0,0,w,h);

                // draw screen
                graphics.setColor(Color.ORANGE);
                graphics.drawString("Screen: " + this.id, w / 2, h / 2);

                // Dispose the graphics
                graphics.dispose();

                // Repeat the rendering if the drawing buffer contents
                // were restored
            } while (strategy.contentsRestored());

            // Display the buffer
            strategy.show();

            // Repeat the rendering if the drawing buffer was lost
        } while (strategy.contentsLost());
    }
}

The ScreenFactory

This is just a simple utility class to figure out which screens are available, and what quality images we can show on these screens.

package nl.ghyze.fullscreen;

import java.awt.DisplayMode;
import java.awt.GraphicsDevice;
import java.awt.GraphicsEnvironment;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;

public class ScreenFactory {

    private final GraphicsDevice[] graphicsDevices;

    /**
     * Constructor. Finds all available GraphicsDevices.
     */
    public ScreenFactory(){
        GraphicsEnvironment localGraphicsEnvironment = GraphicsEnvironment.getLocalGraphicsEnvironment();
        graphicsDevices = localGraphicsEnvironment.getScreenDevices();
    }

    /**
     * Get a list of the IDs of all available GraphicsDevices
     * @return the list of the IDs of all available GraphicsDevices
     */
    public List<String> getGraphicsDeviceIds(){
        return Arrays.stream(graphicsDevices).map(GraphicsDevice::getIDstring).collect(Collectors.toList());
    }

    /**
     * Get a single GraphicsDevice, by ID. Return an empty optional if none is found.
     * @param graphicsDeviceId the ID of the requested GraphicsDevice
     * @return an optional which contains the GraphicsDevice, if found.
     */
    public Optional<GraphicsDevice> getGraphicsDevice(String graphicsDeviceId){
        return Arrays.stream(graphicsDevices)
                .filter(graphicsDevice -> graphicsDevice.getIDstring().equals(graphicsDeviceId))
                .findAny();
    }

    /**
     * Get all available DisplayModes for the selected GraphicsDevice
     * @param graphicsDeviceId the ID of the GraphicsDevice
     * @return a list of DisplayModes
     */
    public List<DisplayMode> getDisplayModes(String graphicsDeviceId){
        GraphicsDevice gd = Arrays.stream(graphicsDevices)
                .filter(graphicsDevice -> graphicsDevice.getIDstring().equals(graphicsDeviceId))
                .findFirst()
                .orElseThrow(IllegalArgumentException::new);

        return Arrays.stream(gd.getDisplayModes()).collect(Collectors.toList());
    }

    /**
     * Get the best DisplayMode for the selected GraphicsDevice.
     * Best is defined here as the most pixels, highest bit-depth and highest refresh-rate.
     * @param graphicsDeviceId the ID of the GraphicsDevice
     * @return the best DisplayMode for this GraphicsDevice
     */
    public DisplayMode getBestDisplayMode(String graphicsDeviceId){
        List<DisplayMode> displayModes = getDisplayModes(graphicsDeviceId);
        DisplayMode best = null;
        for (DisplayMode displayMode : displayModes){
            if (best == null){
                best = displayMode;
            } else {
                if (isScreensizeBetterOrEqual(best, displayMode) ){
                    best = displayMode;
                } else if (isScreensizeBetterOrEqual(best, displayMode)
                        && isBitDepthBetterOrEqual(best, displayMode)){
                    best = displayMode;
                } else if (isScreensizeBetterOrEqual(best, displayMode)
                        && isBitDepthBetterOrEqual(best, displayMode)
                && isRefreshRateBetterOrEqual(best, displayMode)){
                    best = displayMode;
                }
            }
        }
        return best;
    }

    private boolean isScreensizeBetterOrEqual(DisplayMode current, DisplayMode potential){
        return potential.getHeight() * potential.getWidth() >= current.getHeight() * current.getWidth();
    }

    private boolean isBitDepthBetterOrEqual(DisplayMode current, DisplayMode potential){
        if (current.getBitDepth() == DisplayMode.BIT_DEPTH_MULTI) {
            return false;
        } else if (potential.getBitDepth()  == DisplayMode.BIT_DEPTH_MULTI){
            return true;
        }
        return potential.getBitDepth() >= current.getBitDepth();
    }

    private boolean isRefreshRateBetterOrEqual(DisplayMode current, DisplayMode potential){
        if (current.getRefreshRate() == DisplayMode.REFRESH_RATE_UNKNOWN) {
            return false;
        } else if (potential.getRefreshRate()  == DisplayMode.REFRESH_RATE_UNKNOWN){
            return true;
        }
        return potential.getRefreshRate() >= current.getRefreshRate();
    }
}

Bringing it together

For every screen that we have, we’re going to make a Frame. If the screen supports a Full Screen mode, we’re going to use it. Otherwise, we’re just going to use a maximized Frame. Once we’ve setup the screens, we’re going to loop over the frames, and draw each of them. We’ll do this in an infinite loop until we reach some stop condition. In this case, we’re going to stop after two seconds, but you can implement it in any way you’d like. I’ve found that if you don’t implement a stop condition, and just use an infinite loop, it can be quite challenging to actually stop the program. When the program should shut down, we’re going to reset the screens to normal and dispose of the Frames. Once the Frames are disposed, the program stops.

package nl.ghyze.fullscreen;

import java.awt.DisplayMode;
import java.awt.GraphicsDevice;
import java.util.ArrayList;
import java.util.List;

public class MultiFullScreen {
    private final List<TestFrame> frames = new ArrayList<>();

    private final long startTime = System.currentTimeMillis();
    private final ScreenFactory screenFactory = new ScreenFactory();

    public MultiFullScreen(){
        try {
            for (String graphicsDeviceId : screenFactory.getGraphicsDeviceIds()) {
                GraphicsDevice graphicsDevice = screenFactory.getGraphicsDevice(graphicsDeviceId).orElseThrow(IllegalStateException::new);
                DisplayMode best = screenFactory.getBestDisplayMode(graphicsDeviceId);

                TestFrame tf = new TestFrame(graphicsDeviceId);
                // remove borders, if supported. Not needed, but looks better.
                tf.setUndecorated(graphicsDevice.isFullScreenSupported());

                // first set fullscreen window, then set display mode
                graphicsDevice.setFullScreenWindow(tf);
                graphicsDevice.setDisplayMode(best);

                // can only be called after it has been set as a FullScreenWindow
                tf.createBufferStrategy(2);

                frames.add(tf);
            }
            run();
            shutDown();
        }catch (Exception e){
            e.printStackTrace();
        } finally {
            shutDown();
        }
    }

    private void shutDown() {
        // unset full screen windows
        for (String graphicsDeviceId : screenFactory.getGraphicsDeviceIds()) {
            try {
                GraphicsDevice graphicsDevice = screenFactory.getGraphicsDevice(graphicsDeviceId).orElseThrow(IllegalStateException::new);
                graphicsDevice.setFullScreenWindow(null);
            } catch (Exception e){
                e.printStackTrace();
            }
        }

        // Dispose frames, so the application can exit.
        for (TestFrame frame : frames){
            frame.dispose();
        }
    }

    public void run(){
        while(shouldRun()){
            for(TestFrame tf : frames){
                tf.draw();
            }
        }
    }

    private boolean shouldRun(){
        final long runningTime = System.currentTimeMillis() - startTime;
        return runningTime < 2000L;
    }

    public static void main(String[] args) {
        new MultiFullScreen();
    }
}

Jackson, JAXB, OpenAPI and Spring

Imagine that you have a service that enriches some data. The model for the data that needs to be enriched is generated from an XSD. The model for the enriching data (the wrapper) is generated from OpenAPI specs. Those two separate models need to be combined and sent out via Spring’s RestTemplate. The problem is that the generated models interfere with each other, and the combination of models doesn’t serialize correctly.

To solve the problem of interference, we’re going to transform the XSD-generated model to a generic JSON model (using the org.json library). That way we eliminate the XML annotations, and we can just send the OpenAPI model.

XML model to JSONObject

Our wrapper model accepts any Object as payload. We can set our XmlModel there, but we can also set a JSONObject. However, we need to transform our XML model to JSONObject. We can use Jackson for this, when we use the right AnnotationInspector. We’re not going to transform the model directly to JSONObject, but we use a String representation as an intermediate step. Once we have converted our XmlModel to JSONObject, we can set that as our payload.

private JSONObject toJSONObject(XmlModel xmlModel) {
  try {
    ObjectMapper objectMapper = new ObjectMapper();
    objectMapper.setAnnotationIntrospector(
    new JaxbAnnotationIntrospector(TypeFactory.defaultInstance()));
    String content = objectMapper.writeValueAsString(xmlModel);
    return new JSONObject(content);
  } catch (JsonProcessingException e) {
    // Shouldn't happen.
    log.severe("Error translating to JSONObject");
    throw new IllegalStateException(
      "Error translating to JSONObject");
  }
}

Configure RestTemplate

Now we have a model that can serialize, but our RestTemplate can’t handle it yet. Specifically, The ObjectMapper that the RestTemplate uses can’t handle the JSONObject. But we can configure our custom ObjectMapper and tell the RestTemplate to use that one. To serialize the JSONObject, we need to add the JsonOrgModule. And while we’re at it, we’re going to add the JavaTimeModule so we can serialize dates and times. We can’t set the ObjectMapper directly, we need to set it on a MessageConverter. Then we need to explicitly set our MessageConverter as the first to be sure that it’s going to be used.

@Bean
public RestTemplate myRestTemplate() {
  // Create and configurae
  ObjectMapper objectMapper = new ObjectMapper();
  objectMapper.registerModule(new JsonOrgModule());
  objectMapper.registerModule(new JavaTimeModule());

  // Configure MessageConverter
  MappingJackson2HttpMessageConverter converter = new MappingJackson2HttpMessageConverter();
  converter.setObjectMapper(objectMapper);

  // Configure RestTemplate
  RestTemplate restTemplate =
     // create the RestTemplate;
     restTemplate.getMessageConverters()
       .add(0, outputgeneratiebeheerMessageConverter);

  return restTemplate;
}

Maven dependencies

The artifact jackson-module-jaxb-annotations contains the JaxbAnnotationIntrospector that we used to serialize the XML model. The jackson-datatype-json-org contains the JsonOrgModule that we used to serialize the JsonObjects. These dependencies need to be added, in addition to jackson

<properties>
  <jackson.version>2.12.1</jackson.version>
</properties>

<dependencies>
  <dependency>
    <groupId>com.fasterxml.jackson.module</groupId>
    <artifactId>jackson-module-jaxb-annotations</artifactId>
    <version>${jackson.version}</version>
  </dependency>
  <dependency>
    <groupId>com.fasterxml.jackson.datatype</groupId>
    <artifactId>jackson-datatype-json-org</artifactId>
    <version>${jackson.version}</version>
  </dependency>
</dependencies>

Rich Domain vs. Structs & Services

When joining a project, you have to work with the code that you’re given. It’s only when you start to understand the project that you can start to change the way the problems are being solved. Most projects that I’ve joined use an object oriented language, but not all projects use object oriented principles. In order to be able to discuss this issue, I’ve started to call it “Rich Domain vs Structs & Services”. Both are ways of solving problems, or of implementing the functionality that is needed. Both have their advantages, but I think that the Rich Domain is usually preferable.

Structs & Services

Structs refer to (complex) data structures, without any logic. The most common use is in C, which is a procedural language. It was a step towards an object oriented language, but it’s not a class yet since it doesn’t contain any logic. Operations on this data are handled in procedures. These procedures are replicated in object oriented services with (usually static) functions.

Rich Domain

A Rich Domain uses classes with fields and methods that operate on those fields. All domain logic is captured in those methods. Services are only used to interact with other systems or dependencies. These services are not part of the core domain, since the core domain should not have any dependencies.

The discussion

In my opinion, using Structs & Services in an object oriented language is a mistake. Not a big mistake, because you can still create clear, working programs using that pattern. It is more intuitive to separate the data and logic, and therefore it’s easier to build. But when the subject matter gets more complex, it’s easier to have a robust and flexible domain. According to Domain Driven Design, the insight that is encoded into the domain will help to evolve the domain to be more valuable. Another argument against Structs & Services is that you’ll end up having to implement a single change in multiple locations. If a struct needs another field, the service needs to make use of that field. This makes things more complicated than needed.

Boxes

When I started university, I told myself to save all my projects. That way I could create a library of useful bits that I could use in later projects.

That was nearly a quarter of a century ago. Since then I’ve collected a projects folder of about 20 GB. This contains lots of duplicates, special purpose projects, and code that just doesn’t run anymore. (Software rot is real!)

Over the past couple of weeks, I started to go through my projects folder. And I found this little game, appar-ently written in October/November 2012 and I forgot about it until now. It’s written in Java 1.6, but still runs in java 15.

The controls are simple: use the right and left arrow to move the blocks. Esc to quit. Don’t let the blocks pile up too high. You get points for three or more connecting blocks (diagonal doesn’t count).

Here’s the Github.

Java XML Api

Who uses XML in 2021? We all use JSON these days, aren’t we? Well, it turns out XML is still being used. These code fragments could help get you up to speed when you’re new to the Java XML API.

Create an empty XML document

To start from scratch, you’ll need to create an empty document:


Document doc = DocumentBuilderFactory
                .newInstance()
                .newDocumentBuilder()
                .newDocument();

Create document from existing file

To load an XML file, use the following fragment.


File source = new File("/location/of.xml");
Document doc = DocumentBuilderFactory
                    .newInstance()
                    .newDocumentBuilder()
                    .parse(target);

Add a new element

Now that we have the document, it’s time to add some elements and attributes. Note that Document and Element are both Nodes.


final Element newNode = doc.createElement(xmlElement.getName());
newNode.setAttribute(attributeName, attributeValue);
	
node.appendChild(newNode);

Get nodes matching XPath expression

XPath is a powerful way to search your XML document. Every XPath expression can match multiple nodes, or it can match none. Here’s how to use it in Java:


final String xPathExpression = "//node";
final XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodeList = (NodeList) xpath.evaluate(xPathExpression, 
        doc, 
        XPathConstants.NODESET);

JSON to XML

JSON is simpler and much more in use these days. However, it has less features than XML. Namespaces and attributes are missing, for example. Usually these aren’t needed, but they can be useful. To convert JSON to XML, you could use the org.json:json dependency. This will create an XML structure similar to the input JSON.

<properties>
	<org-json.version>20201115</org-json.version>
</properties>

<dependencies>
	<dependency>
		<groupId>org.json</groupId>
		<artifactId>json</artifactId>
		<version>${org-json.version}</version>
	</dependency>
</dependencies>

JSONObject content = new JSONObject(json);
String xmlFragment = XML.toString(content);

Writing XML

When we’re done manipulating the DOM, it’s time to write the XML to file or to a String. The following fragment does the trick


private void writeToConsole(final Document doc) 
    throws TransformerException{
	final StringWriter writer = new StringWriter();
	writeToTarget(doc, new StreamResult(writer));
	System.out.println(writer.toString());
}

private void writeToFile(final Document doc, File target) 
    throws TransformerException, IOException{
	try (final FileWriter fileWriter = new FileWriter(target)) {
		final StreamResult streamResult = new StreamResult(fileWriter);
		writeToTarget(doc, streamResult);
	}
}

private void writeToTarget(final Document doc, final StreamResult target) 
    throws TransformerException {
	final Transformer xformer = TransformerFactory.newInstance()
              .newTransformer();
	xformer.transform(new DOMSource(doc), target);
}

Can you solve this puzzle?

Every once in a while, I come across this little puzzle. I have no idea who made it, or what the intended answer is.

Usually there are a lot of people giving the same answer. And although that answer is reasonable, I don’t think it’s correct. When we modify the calculation to (x * y) + x, we’ll get the answers matching the question. But we’ll need to modify the calculation, and I don’t think that is needed.

Another solution would be to take the previous answer, and add the current sum.


There’s an old joke among nerds that goes like this:

There’s a hint in that joke that points to a different answer to the puzzle. You can write numbers in different ways. In computer science we use a couple of different number systems, like binary, or Base 2, which deals with the digits 0 and 1. That’s not the only system we use. There’s also octal (Base 8, digits 0 to 7) and hexadecimal (Base 16, digits 0 to 9 and A to F). And of course we still use decimal.

That joke about 10 kind of people, we can do better:


If we now go back to the puzzle, taking into account these number systems, we can arrive at a solution that doesn’t need to modify the calculation. We only need to modify the representation of the answer.

And there we have it, the answer is 13! Which is written in Base 3 as 111. There’s no need to modify the calculation. All we needed to do is change the representation of the answer.

Leveraging Lucene

Imagine a catalog of a few hundred thousand items. These items have been labeled into a few hundred categories. Each item can be linked to up to three categories. New categories need to be added in order to make things easier to find. However, categories without content are useless. So some content needs to be linked to the new categories. Luckily both the items and the categories have a description, and that makes things easier.

The idea is simple:

  • Put all items into a searchable index.
  • For each category, find out what the most important words are.
  • Create a search term using these most important words, by just sticking them together.
  • Search the index of items for best matches.

And we’re done, sort of. It’s a bit more complicated than that, but that’s mostly because of finetuning.

Building the searchable index

This is where Apache Lucene comes in. Lucene is an open source full text indexing and search library, supported by the Apache Software Foundation.
First released in 1999, it is still in active development.

To create an index, you need to create an IndexWriter, and use it to add Documents to the index.

public IndexWriter createIndexWriter() throws IOException {
        Directory indexDir = FSDirectory.open(Paths.get(INDEX_DIR));
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig icw = new IndexWriterConfig(analyzer);
        icw.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        return new IndexWriter(indexDir, icw);
    }

Only one writer is needed for adding items to the index. Note that IndexWriterConfig.OpenMode.CREATE will create a new index.
If there is anything already in the index, it will be removed. IndexWriterConfig.OpenMode.CREATE_OR_APPEND could be used if you want to add to an existing index.

Next up is actually adding things to the index. Each item in the index is called a Document. Documents have Fields that can be used for searching.

Creating and adding a document can be done like this:

try {
	Document document = new Document();
	document.add(new StringField("url", "http://ghyze.nl", Field.Store.YES));
	document.add(new TextField("title", "My awesome blog",  Field.Store.YES));
	document.add(new TextField("description", "Blogging about things",  Field.Store.YES));

	writer.addDocument(document);
} catch (IOException e){
	e.printStackTrace();
}

When we’re done with adding the documents, we need to close the IndexWriter:

writer.close();

Find important words

Let’s first define what an “important word” is. Important words are words that are the most relevant for each document in the collection.
There is a nice algorithm to determine the relevance of each word: tf-idf (short for term frequency–inverse document frequency).

The premise of this algorithm is that a word is more relevant for a document the more it appears in that document, and less relevant for a single document when it appears in more documents.

  • For each document, we count how many times each word appears and we divide that by the total number of words in this document. This is the term-frequency part.
  • For each word in every document, take the total number of documents and divide it by the number of documents containing this word. We don’t want this number to be too large, so we take the log of this. This is the inverse document frequency part.
  • Multiply these numbers to get the relative importance of each word for every document. A Higher score means the word is more important.

The code for this algorithm consists of two classes and a value object

De first class represents a single document in the collection. It is responsible for calculating the importance of the words it contains in relation to the words of all other documents.

import lombok.Getter;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class Document {

    /**
     * The identifier for this document
     */
    @Getter
    private final String id;

    /**
     * The complete text for this document
     */
    @Getter
    private final Collection<String> lines;

    /**
     * Every word in this document, with the number of times it appears
     */
    private Map<String, Integer> wordsInDocument;

    /**
     * Constructor
     */
    public Document(String id, Collection<String> lines){
        this.id = id;
        this.lines = lines;
    }

    /**
     * Get the map with the unique words in this document, and the number of times they appear
     */
    public Map<String, Integer> getWordsInDocument(){
        if (wordsInDocument == null){
            calculateWordMap();
        }
        return wordsInDocument;
    }

    /**
     * Calculate the number of times each unique word appears in this document
     */
    private void calculateWordMap(){
        wordsInDocument = new HashMap<>();
        for (String line : lines){
            String[] words = line.split("\\s");
            for (String word : words){
                if (word.trim().length() > 1) {
                    Integer wordCount = wordsInDocument.getOrDefault(word, Integer.valueOf(0));
                    wordsInDocument.put(word, wordCount+1);
                }
            }
        }
    }

    /**
     * Calculate the importance of each word, compared to other words in this document
     * and all other documents in the index
     * @param index The collection of documents that also contains this document.
     * @return An ordered list indicating the importance of each word in this document.
     */
    public List<WordImportance> calculateWordImportance(WordIndex index){
        List<WordImportance> wordImportance = new ArrayList<>();
        double totalWordsInDocument = getWordsInDocument().values().stream().mapToInt(Integer::intValue).sum();
        double totalNumberOfDocuments = index.getNumberOfDocuments();
        for (String word : getWordsInDocument().keySet()){
            double tf = ((double) getWordsInDocument().get(word)) / totalWordsInDocument;
            double idf = Math.log(totalNumberOfDocuments / ((double) index.getNumberOfDocumentsContaining(word)));
            wordImportance.add(new WordImportance(word, tf*idf));
        }

        // most important word first
        wordImportance.sort(Comparator.comparing(WordImportance::getImportance).reversed());

        return wordImportance;
    }
}

The next class represents the collection of all documents. It is responsible for calculating for each word the number of documents that contain it.

import lombok.Getter;

import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

public class WordIndex {

    /**
     * The index, key is the identifier of the document
     */
    @Getter
    private Map<String, Document> index = new HashMap<>();

    /**
     * A map with all the words in all the documents, and the number of documents containing those words
     */
    private Map<String, Integer> documentCountForWords = null;

    /**
     * Constructor
     */
    public WordIndex(){
    }

    /**
     * Add a document to the index, overwriting when it already exists.
     * @param document the document to add
     */
    public void addDocument(Document document) {
        if (index.containsKey(document.getId())) {
            System.out.println("Overwriting document with ID "+ document.getId());
        }

        index.put(document.getId(), document);
    }

    /**
     * Get all words in all documents. If a word appears multiple times, it is only returned once.
     * @return All words in this document
     */
    public Collection<String> getAllWords(){
        Set<String> allWords = new HashSet<>();

        index.values()
                .forEach(e -> allWords.addAll(e.getWordsInDocument().keySet()));

        return allWords;
    }

    /**
     * Get the map with number of documents per word
     * @return the map with number of documents per word
     */
    public Map<String, Integer> getDocumentCountForWords(){
        if (documentCountForWords == null){
            calculateDocumentCountForAllWords();
        }
        return documentCountForWords;
    }

    /**
     * Iterate over every word in every document, and count the number of documents that word appears in.
     */
    private void calculateDocumentCountForAllWords(){
        Collection<String> allWords = getAllWords();
        documentCountForWords = new HashMap<>();
        for (String word : allWords){
            for (String documentId : index.keySet()){
                Map<String, Integer> document = index.get(documentId).getWordsInDocument();
                if (document.keySet().stream().anyMatch(e -> e.equals(word))){
                    Integer count = documentCountForWords.getOrDefault(word, 0);
                    documentCountForWords.put(word, count+1);
                }
            }
        }
    }

    /**
     * Get the total number of documents
     * @return
     */
    public int getNumberOfDocuments(){
        return index.size();
    }

    /**
     * Get the number of documents this word appears in.
     * @param word The word we're interested in
     * @return The number of documents containing this word
     */
    public int getNumberOfDocumentsContaining(String word){
        Map<String, Integer> wordCount = getDocumentCountForWords();
        return wordCount.getOrDefault(word, 0);
    }
}

Then we have a value object, that is used by the document to indicate the relative importance of each word.

import lombok.AllArgsConstructor;
import lombok.Value;

@Value
@AllArgsConstructor
public class WordImportance {
    String word;
    double importance;
}

The easy steps

Now it’s time to search for content for the new categories. To do this, we take most important words of the new categories and string them together, separated by spaces. Then we use that search term to search the index, and extract the result.

This is what the code for performing the search would look like:

/** 
 * Prepare the search engine
 */
public void initializeSearch(){
	try {
		indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(BuildStudyIndex.INDEX_DIR)));
		searcher = new IndexSearcher(indexReader);
		analyzer = new StandardAnalyzer();

		queryParser = new MultiFieldQueryParser(new String[]{"title", "shortDescription", "longDescription"}, analyzer);
	} catch (IOException e){
		e.printStackTrace();
	}
}

/**
 * Perform search
 */
public TopDocs search(Collection<String> words) throws IOException, ParseException{
	String searchTerm = String.join(" ", words);
	Query query = queryParser.parse(searchTerm);
	TopDocs results = searcher.search(query, 200000);
	return results;
}

The code to extract the search results would be something like this:

TopDocs result = search(searchTerms);
for (ScoreDoc hit : result.scoreDocs){
	Document found = searcher.doc(hit.doc);
	double score = hit.score;
}

Conclusion

There are some steps that we needed to do that I haven’t mentioned. But those are mainly plumbing and finetuning.

We have seen how to use Apache Lucene as a custom search engine. First we’ve built a searchable index, and later we have searched that index for relevant items.
We have also seen how to implement an algorithm that determines the most relevant words in a specific document, compared to the other documents in a collection.

The reason this works is that words have meaning. I know, stating the obvious. Each word gives meaning to the text, and this meaning has varying degrees of relevance to that text. The words that are most relevant to the text distinguish the meaning of the text from the other texts in the collection. We don’t need to know the actual meaning of the text, we just need to separate it from all other texts. Then, through the magic of search engines, we can match the texts that have the most similar meanings.

Be agile, don’t do “Agile”

Over the last decade I’ve been involved in several projects that are doing Agile. Most of them were using Scrum, and most of them didn’t really deliver what was promised. The promise of Scrum is that the team delivers more value, but what it actually delivers is more reporting tools and ceremonies to keep managers busy. Functionality is also delivered, but I’ve found that the process is more of a burden than it helps.
Our team at Studyportals uses something inspired on GrowthBan, which is a variation on Kanban aimed at teams that are concerned with growth. I’m not really a fan of naming things, because that leads to cargo-culting. “Hey, look! That team uses GrowthBan, and it works for them! Let’s use that too!”. And then you implement the rituals and stuff, and you find out that it may not be working for you at all.

Why does this work for us?

Team setup

First of all, we have a multi-disciplinary team with access to anybody in our organisation that might be able to help us. Our team consists of the following roles:

  • A Product Owner who decides what we should work on
  • A Marketeer who handles the marketing side of our product, like partnerships
  • A Growth Hacker who performs experiments to increase the KPIs of our product
  • A Team Lead who makes sure that the team performs optimally
  • Two frontend engineers, one of them also has another role.
  • A backend engineer
  • Since we’re not doing scrum, we don’t have a scrummaster. We invented the role of Mastermaster instead.

Since our team is not a pure engineering team, we have more knowledge about what can be done and why we do it. This helps us to deliver the right things faster. Each of the team members has his own area of expertise, and is trusted to do the best he can.

Autonomy

What I’m used to in Scrum is that the product owner, toghether with business analysts and architects decide what needs to be done. They create an epic, divide it up in stories, and describe what they would like to see as outcome. In this way, Scrum is a bit like mini-waterfall. Engineers only have a say in how these stories are built, not in what we should work on.
Since our team setup is a bit different, we have actual experts in the domains that we are working on. Everybody can, and should, create stories for work that they think is important. In the end it’s the product owner that decides on the priority, but he can be influenced.

Rhytm

While on projects that use Scrum, I was used to a rhytm om two to four weeks. We now have a rhytm of one week: we have a week planning on monday morning, and a short retrospective on thursday. Fridays are always a strange day here, since we have hack-days. During our week planning session on monday, we look back on the previous week, and determine the focus for the upcomming week. What we don’t do is commit to the work that must be finished at the end of the week. If we’re finished early, we start on new stories. If stories are not finished (because, unfortunate accidents happen), we continue next week. Every day we have a normal standup meeting, where we look at the tasks at hand. I did this before, and I prefer this over the format where you tell about what you did, what you’re going to do and if you need help. At the end of the standup meeting, we take a short look at our KPIs to see if we’re still on track.

Looking back

Retrospective is the time where you look back on the past sprint. Most of the time this session is spent on trying to make the process better and more effective. We also do this.
However, we also look at the effect of our work. Did we improve our google ranking? Did we increase the speed of our website? Did google pick up our content changes? Did we make a difference? Success is celebrated.
We also make a big deal about kicking off and finishing stories. Each story that we start to work on is kicked off, and the team is informed about that. Each story that is finished is burned, and celebrated.

Agile

In the end, it looks like we’re finally doing what the agile manifesto told us to do. Working together is more important than our processes (though we have them), we deliver working software (though we also document), we track what our users are doing to see if they like what we did, and when we need to change course, we do. We still have plans and things that we would like to work on.