Lucene8：Query Time Joining～JoinUtil

2020年1月15日2020年1月24日

Table of Contents

クエリ時にJOINを行う例

Luceneでクエリ時にJOINを行う例である。ネット上にある他の例とは異なり、この例では、JOINする両方のテーブル（RDBの用語を使用した方がわかりやすいのでこちらにする）について、検索条件を指定することができる。

また、以下では現時点で最新の8.4.1を使用しているが、（現在私が使用している）8.2.0でも全く同じ動作になった。

  api group: 'org.apache.lucene', name: 'lucene-analyzers-common', version: '8.4.1'
  api group: 'org.apache.lucene', name: 'lucene-core', version: '8.4.1'
  api group: 'org.apache.lucene', name: 'lucene-join', version: '8.4.1'

import static java.util.stream.Collectors.*;
import static org.junit.Assert.*;

import java.io.*;
import java.util.*;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.search.join.*;
import org.apache.lucene.search.join.ScoreMode;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;
import org.apache.lucene.util.packed.*;
import org.junit.*;

public class QueryTimeJoinTest {

  public static class NoTokenizer extends CharTokenizer {    
    protected boolean isTokenChar(int c) {
      return true;
    }    
  }

  public static class NoAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
      Tokenizer tokenizer = new NoTokenizer();      
      return new TokenStreamComponents(tokenizer);
    }
  }

  @Test
  public void test() throws Exception {

    boolean forceSingleSegment = false;

    // 名称は何でも良いが、当然ながら製品と属性で同一にする必要がある。
    final String joinField = "id" + "productId";

    Directory dir = new RAMDirectory();

    IndexWriterConfig config = new IndexWriterConfig(
        new NoAnalyzer()
    );
    IndexWriter indexWriter = new IndexWriter(dir, config);

    // 製品の定義
    Document doc = new Document();
    doc.add(new TextField("id", "1", Field.Store.YES));
    doc.add(new TextField("name", "カメラ", Field.Store.YES));
    // この"1"という値は製品IDと同一でよいらしい
    doc.add(new SortedDocValuesField(joinField, new BytesRef("1")));
    indexWriter.addDocument(doc);

    doc = new Document();
    doc.add(new TextField("id", "2", Field.Store.YES));
    doc.add(new TextField("name", "バッグ", Field.Store.YES));
    doc.add(new SortedDocValuesField(joinField, new BytesRef("2")));
    indexWriter.addDocument(doc);
    indexWriter.commit();

    // カメラの属性
    doc = new Document();
    doc.add(new TextField("productId", "1", Field.Store.YES));
    doc.add(new TextField("type", "価格", Field.Store.YES));
    doc.add(new TextField("value", "10", Field.Store.YES));
    doc.add(new SortedDocValuesField(joinField, new BytesRef("1")));
    indexWriter.addDocument(doc);

    doc = new Document();
    doc.add(new TextField("productId", "1", Field.Store.YES));
    doc.add(new TextField("type", "価格", Field.Store.YES));
    doc.add(new TextField("value", "20", Field.Store.YES));
    doc.add(new SortedDocValuesField(joinField, new BytesRef("1")));
    indexWriter.addDocument(doc);


    // バッグの属性
    doc = new Document();
    doc.add(new TextField("productId", "2", Field.Store.YES));
    doc.add(new TextField("type", "価格", Field.Store.YES));
    doc.add(new TextField("value", "30", Field.Store.YES));
    doc.add(new SortedDocValuesField(joinField, new BytesRef("2")));
    indexWriter.addDocument(doc);

    doc = new Document();
    doc.add(new TextField("productId", "2", Field.Store.YES));
    doc.add(new TextField("type", "価格", Field.Store.YES));
    doc.add(new TextField("value", "40", Field.Store.YES));
    doc.add(new SortedDocValuesField(joinField, new BytesRef("2")));
    indexWriter.addDocument(doc);

    // ※これを行った場合、セグメントが唯一になるためOrdinalMapはnullでよい。
    if (forceSingleSegment) {
      indexWriter.forceMerge(1);
    }

    indexWriter.close();

    IndexReader indexReader = DirectoryReader.open(dir);    
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);

    OrdinalMap ordinalMap= null;
    if (!forceSingleSegment) {

      // ※indexWriter.forceMerge(1)を行わない場合は以下の処理が必要
      // ※indexWriter.forceMerge(1)を行う場合はordinalMap=nullのままでよい
      SortedDocValues[]sortedDocValues = indexReader.leaves()
        .stream()
        .map(leafReaderContext->leafReaderContext.reader())
        .map(leafReader-> {
          try {
            return DocValues.getSorted(leafReader, joinField);
          } catch (IOException ex) { throw new RuntimeException(ex); }
        })
        .toArray(SortedDocValues[]::new);
        ;
      // acceptableOverheadRatioの値をPackedInts.DEFAULT=0.25にしてあるが、
      // これはHashMapのloadFactorのようなものだろうか？
      ordinalMap = OrdinalMap.build(null, sortedDocValues, PackedInts.DEFAULT);
    }


    // バッグの価格属性を取得
    {
      Query fromQuery = new TermQuery(new Term("name", "バッグ"));
      Query toQuery = new TermQuery(new Term("type", "価格"));

      // Search for product and return prices
      Query joinQuery = JoinUtil.createJoinQuery(
          joinField, fromQuery, toQuery, indexSearcher, ScoreMode.None,
          ordinalMap);
      TopDocs result = indexSearcher.search(joinQuery, 10);
      assertEquals(
        "[id:null,name:null,productId:2,value:30],[id:null,name:null,productId:2,value:40]",
        hitString(indexSearcher, result)
      );
    }

    // カメラの価格属性を取得
    {
      Query fromQuery = new TermQuery(new Term("name", "カメラ"));
      Query toQuery = new TermQuery(new Term("type", "価格"));
      Query joinQuery = JoinUtil.createJoinQuery(joinField, fromQuery, toQuery, indexSearcher, ScoreMode.None,
          ordinalMap);
      TopDocs result = indexSearcher.search(joinQuery, 10);
      assertEquals(
        "[id:null,name:null,productId:1,value:10],[id:null,name:null,productId:1,value:20]",
        hitString(indexSearcher, result));
    }

    // 価格属性が20のカメラを取得
    {
      // Search for prices and return products
      Query fromQuery = new TermQuery(new Term("value", "20"));
      Query toQuery = new TermQuery(new Term("name", "カメラ"));
      Query joinQuery = JoinUtil.createJoinQuery(joinField, fromQuery, toQuery, indexSearcher, ScoreMode.None,
          ordinalMap);
      TopDocs result = indexSearcher.search(joinQuery, 10);
      assertEquals(
        "[id:1,name:カメラ,productId:null,value:null]",
        hitString(indexSearcher, result));
    }

    indexSearcher.getIndexReader().close();

    dir.close();
  }

  /**
   * 値を取得できるのは、"to"の側のField.Store.YESのみであることに注意
   */
  String hitString(IndexSearcher indexSearcher, TopDocs topDocs) throws Exception  {
    return Arrays.stream(topDocs.scoreDocs)
      .map(scoreDoc-> {
        try {
          return indexSearcher.doc(scoreDoc.doc);
        } catch (Exception ex) { throw new RuntimeException(ex); }
      })
      .map(doc->
        "[id:" + doc.get("id") + ",name:" + doc.get("name") + 
          ",productId:" + doc.get("productId") + ",value:" + doc.get("value") + "]"
      )
      .sorted()
      .collect(joining(","));        
  }
}

解説

LuceneでJOINを行うにはいくつかのやり方がある。

JOINしない。必要なすべてのテーブルをあらかじめ一つのテーブル（再度だが、RDBの用語を使う）にまとめてしまう。つまり、あらかじめJOINされた形のテーブルだけを扱う。当然だが、この戦略は非常にコストがかかる場合がある。
Index Time Joiningを使う。あまり調査していないが、この戦略も「JOINしない」戦略と同様のものに思われる。各テーブルは分離したままだが、インデックス書き込み時には「それらのレコードを束ねて一緒に」書き込まなければならない。
Query Time Joiningを使う。以下に説明する。

Query Time Joining 1

Class JoinUtilに説明されている中の、以下のメソッドを使う。

static Query    createJoinQuery(String fromField, boolean multipleValuesPerDocument, String toField, Class<? extends Number> numericType, Query fromQuery, IndexSearcher fromSearcher, ScoreMode scoreMode)
---- Method for query time joining for numeric fields.

static Query    createJoinQuery(String fromField, boolean multipleValuesPerDocument, String toField, Query fromQuery, IndexSearcher fromSearcher, ScoreMode scoreMode)
---- Method for query time joining.

しかし、これらのメソッドでは、片方のテーブルfromについての検索条件しか指定できない。fromが見つかったら、それに該当するtoのレコードを取り出すだけである。

Query Time Joining 2

今回例を示したのは、Class JoinUtilに説明されている中の、以下のメソッドである。

static Query    createJoinQuery(String joinField, Query fromQuery, Query toQuery, IndexSearcher searcher, ScoreMode scoreMode, OrdinalMap ordinalMap)
---- Delegates to createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, OrdinalMap, int, int), but disables the min and max filtering.

static Query    createJoinQuery(String joinField, Query fromQuery, Query toQuery, IndexSearcher searcher, ScoreMode scoreMode, OrdinalMap ordinalMap, int min, int max)
---- A query time join using global ordinals over a dedicated join field.

これらのメソッドでは、両者のテーブルについての検索条件を指定することができる。二つのテーブルについて検索条件を指定することはできるが、しかしこの場合にも取り出せるのはto側の値のみである。from側にField.Store.YESが付けられていても、その値を取り出すことはできない。

また、ソート指定もまたto側のフィールドしか使えない。from側のフィールドを指定しても単純に無視されてしまうようだ。

注意事項

サンプルコードの中にforceSingleSegmentフラグとして示したが、Luceneのセグメントが複数に分離している場合には、OrdinalMapを作成する必要がある。これは非常にコストがかかるそうで（複数のセグメントについてすべて処理するらしい）、例えば、一つの検索を行う都度再生成しない方が良いという。

これに対して、セグメントが一つであれば、OrdinalMapは必要無いので、常にindexWriter.forceMerge(1)をすれば良いように思えるが、しかしこれもコストのかかる処理と思われる。

どちらを選択するかは、様々なトレードオフを考慮する必要があるだろう。

参考

技術メモlucene

Posted by ysugimura

Lucene8：Facetの使い方サンプル