Max Maischein

TO DO

Name

* Come up with a good name

* YODA (from the image in the talk)?

* "Answers you seek?"

Pages

Grid / Layout

The pages should use Dancer::Layout::Bootstrap, or at least an upgraded version of that.

HTTP Manifest

To speed up "loading" of the assets even more, the assets should be stored on the client as a HTTP / HTML manifest

robots.txt

The app should prevent being spidered itself by always providing an automatic robots.txt that allows only /about and maybe / without any parameters to be spidered.

This is somewhat ironic as the scrapers currently don't respect any robots.txt yet.

Search page

* Autocomplete/recommend

http://twitter.github.io/typeahead.js/

Simple (HTML) Results page

* Search images

* Strip out or silence even more HTML tags (H1 etc.)

Result fragment / document rendering

Come up with a concept to render different mime types differently.

Ideally, this would avoid the hardcoding we use for audio/mpeg currently.

This also entails information about things that are not files. Ideally, we can render information about a "person" using a different template as well, even though a "person" does not have a mime type associated with it.

Customization

Auto-session

Refinement using the last search, if the last search was "recently"

Basically, add the new term to the last terms instead of doing a new search based only on the new term. Usually, boost the new term, maybe by factor 2 over the old terms. Provide a link to only search for the new term instead.

Plack

* Plack-hook/example for /search to tie up the search application into arbitrary websites

Dancer

* ElasticSearch plugin / configuration through YAML

* Upgrade to Dancer 2

Mojolicious

* ElasticSearch plugin / configuration through YAML

Search multiple indices

Having different Elasticsearch clusters available (or not) should be recognized and the search results should be combined. For example, a work cluster should be searched in addition to the local cluster, if the work network is available.

This calls for using the asynchronous API not only for searching but also for progressively enhancing the results page as new results become available.

Recognizing new versions of old documents

How can we/Elasticsearch recognize similarity between two documents?

If two documents live in the same directory, the newest one should take precedence and fold the similar documents below it.

Java ES plugins

Currently better written in Perl

ES Analyzers

FS scanner

* Don't rescan/reanalyze elements that already exist in Elasticsearch

* Delete entries that don't exist in the filesystem anymore

Video data

Which module provides interesting video metadata?

Use Video::Subtitle::SRT for reading subtitle files

How can we find where / on what line search results were found? If we include a magic marker (HTML comment?) at the end/start of a line, we could hide it when displaying the results to the user while still using it to orient ourselves in the document.

Audio data

* MP3s get imported but could use a nicer body rendering.

* Playback duration should be calculated

* Also import audio lyrics - how could these be linked to their mp3s?

Playlist data

Playlists should get custom rendering (album art etc.)

Playlists should ideally also hotlink their contents

Test data

* Consider importing a Wikipedia dump

* Some other larger, mixed corpus, like http://eur-lex.europa.eu/

Synonyms

Find out which one(s) we want:

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

From first glance, we might want Simple Expansion, but Genre Expansion also seems interesting.

We want to treat some synonyms as identical though, like 'MMSR' and its German translation 'Geldmarktstatistik'.

User Introduction

Videos

Create screencasts using http://www.openshot.org/videos/

First Start Experience

The first start should be as configuration-free as possible.

Site walk through

Use one of the fancy Javascript walk-through implementation to offer an optional walk-through through the search page and results page.

Code structure

Crawlers

File system crawler

Don't import hidden files by default

Have a file .search or .index which contains options, like no-index or ignore for this folder and its subfolders.

DBI crawler

Show example SELECT statement

  SELECT
      product_name as title
    , 'http://productserver.internal/product/' || convert(varchar,product.id) as url
    , product_description as content
  FROM products

Create Dancer-crawler

Skip the HTTP generation process and reuse App::Wallflower for crawling a Dancer website.

Create tree-structure-importer

Both IMAP and file systems are basically directed graphs and far easier to crawl than the cyclic graphs of web pages. Abstract out the crawling of a tree into a common module.

* Turn index-imap and index-filesystem into modules so they become independent of being called from an outside shell.

This also implies they become runnable directly from the web interface without an intermediate shell.

* Add attachment import to the imap crawler

Calendar crawler

CardDAV crawler

To pull in information about people you know

Xing / LinkedIn / Facebook / Google+ crawler

To pull in information about people you know

LDAP crawler

To pull in information about people you know

Metasearch

Implement metasearch across multiple ES instances

Search index structure / data structures

Elasticsearch index

Last-verified field

We want a field to store when we last visited an URL so we don't always reindex files with every run.

Crawl queue(s)

We want to have queues in which we store URLs to be crawled to allow for asynchronous submission of new items. This also allows us to be rate limited and restartable.

This could be an SQLite database, or just a flat text file if we have a way to store the last position within that text file.

SQL-index into filesystem

Is there any use in reviving FFRIndex?

System integration

Automatically (re)scan resources by using a notification method like the following to be notified about new or changed resources.

Resource modification

Filesystem watchers

RSS scanner

Google Sitemap scanner

Automatic search should be triggered for incoming phone calls. This allows to automatically show relevant emails if the sender is calling and has their phone information in their email.

Also, the automatic search should be easily triggered by a command line program. This likely needs something like HTTP::ServerEvent to keep a channel open so the server can push new information.

Data portability

Data portability is very important, not at least because of seamless index upgrades/rollbacks/backups.

Export

Export index to DBI

Update indices from database

Share indices

Sharing indices would also be nice in the sense of websites or people offering datasets

DBI connectivity

How can we get DBI and Promises work nicely together?

Schema migration/update via DBI

DBI import queue

New items to be imported into Elasticsearch could be stored/read from a DBI table. This would allow for a wider distributed set of crawlers feeding through DBI to Elasticsearch.

Index/query quality maintenance

To improve search results, a log of "failed" queries should be kept and the user should be offered manual correction of the failed queries.

top 10 failed queries

If a query had no results at all, the user should/could suggest some synonyms or even documents to use instead

top 10 low-score queries

If a query had only low-score results/documents, the results are also a candidate for manual improvement. How can we determine a low score?

top 10 abandoned queries

How will we determine if a query/word was abandoned?