TO DO

Name

  • Come up with a good name

  • YODA (from the image in the talk)?

  • "Answers you seek?"

Pages

HTTP Manifest

To speed up "loading" of the assets even more, the assets should be stored on the client as a HTTP / HTML manifest

robots.txt

The app should prevent being spidered itself by always providing an automatic robots.txt that allows only /about and maybe / without any parameters to be spidered.

This is somewhat ironic as the scrapers currently don't respect any robots.txt yet.

Search page

Autosearch without pressing "Go"

Display entry URL + title in the autocomplete dropdown

Simple (HTML) Results page

* Search images

Result fragment / document rendering

Come up with a concept to render different mime types differently.

Ideally, this would avoid the hardcoding we use for audio/mpeg currently.

This also entails information about things that are not files. Ideally, we can render information about a "person" using a different template as well, even though a "person" does not have a mime type associated with it.

Currently, links to mails are hardcoded to use Thunderlink for Thunderbird. Lotus Notes mails will need different deep links as outlined in http://www.wissel.net/blog/d6plinks/SHWL-7PL67C.

  notes://servername/database/view/documentuniqueid

Basically, this means that for mails, we will need to store more than one "unique" ID or alternatively decide on the ID to store in the crawler.

Maybe we should store a (preferred) rendertype for items and render a subtemplate based on that rendertype. This would allow different URLs for links to Message-ID mails and Lotus Notes mails. It would still mean that we need to store more fields for email entries.

Also, for example Perl files should get a Perl syntax highlighter or at least a "code" view. The same should likely hold for all other (text) files whose more refined type we can recognize.

Customization

Auto-session

Refinement using the last search, if the last search was "recently"

Basically, add the new term to the last terms instead of doing a new search based only on the new term. Usually, boost the new term, maybe by factor 2 over the old terms. Provide a link to only search for the new term instead.

Lock down all pages according to OWASP

Just in case some malicious content gets through our (lame) filters or gets inserted by a script that doesn't properly sanitize the input, make sure we can't get rehosted in a (non-localhost) iframe and we can't run (non-localhost) Javascript.

Also consider reproxying all external resources, thus allowing absolutely no outside links at all on our pages.

Plack

* Plack-hook/example for /search to tie up the search application into arbitrary websites

Dancer

* ElasticSearch plugin / configuration through YAML

* Upgrade to Dancer 2

Mojolicious

* ElasticSearch plugin / configuration through YAML

Search multiple indices

Having different Elasticsearch clusters available (or not) should be recognized and the search results should be combined. For example, a work cluster should be searched in addition to the local cluster, if the work network is available.

This calls for using the asynchronous API not only for searching but also for progressively enhancing the results page as new results become available.

Recognizing new versions of old documents

How can we/Elasticsearch recognize similarity between two documents?

If two documents live in the same directory, the newest one should take precedence and fold the similar documents below it.

Java ES plugins

Currently better written in Perl

ES Analyzers

FS scanner

* Don't rescan/reanalyze elements that already exist in Elasticsearch

* Delete entries that don't exist in the filesystem anymore

Video data

Which module provides interesting video metadata?

Use Video::Subtitle::SRT for reading subtitle files

How can we find where / on what line search results were found? If we include a magic marker (HTML comment?) at the end/start of a line, we could hide it when displaying the results to the user while still using it to orient ourselves in the document.

Audio data

* MP3s get imported but could use a nicer body rendering.

* Playback duration should be calculated

* Also import audio lyrics - how could these be linked to their mp3s?

Playlist data

Playlists should get custom rendering (album art etc.)

Playlists should ideally also hotlink their contents

Test data

Consider importing a Wikipedia dump

Some other larger, mixed corpus, like http://eur-lex.europa.eu/

Use the Enron mail corpus?

Synonyms

Find out which one(s) we want:

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

From first glance, we might want Simple Expansion, but Genre Expansion also seems interesting.

We want to treat some synonyms as identical though, like 'MMSR' and its German translation 'Geldmarktstatistik'.

User Introduction

Videos

Create screencasts using http://www.openshot.org/videos/

First Start Experience

The first start should be as configuration-free as possible.

Site walk through

Use one of the fancy Javascript walk-through implementation to offer an optional walk-through through the search page and results page.

Code structure

Crawlers

Single URL submitter

Submit HTML and an URL into the index

    submit-url --url 'https://example.com' --html '<html><body>Hello World</body></html>'

    # Remind ourselves when we search for "user list" where it lives:
    submit-url --file '/etc/passwd' --html '<html><pre>machine user list password</pre></html>'

    submit-url --json '{ url: "", "content" : "", ... }'

This allows for custom handling of single entries

Detect "genre" of web page (forum, product, social, blog, ...)

Detect porn page by using the list of word pairs at https://github.com/searchdaimon/adult-words

File system crawler

Don't import hidden files by default

Have a file .search or .index which contains options, like no-index or ignore for this folder and its subfolders.

DBI crawler

Show example SELECT statement

  SELECT
      product_name as title
    , 'http://productserver.internal/product/' || convert(varchar,product.id) as url
    , product_description as content
  FROM products

Lotus Notes Crawler

Repurpose https://perlmonks.org?node_id=449873 (and its replies) for better enterprise integration

Create Dancer-crawler

Skip the HTTP generation process and reuse App::Wallflower for crawling a Dancer website.

Create tree-structure-importer

Both IMAP and file systems are basically directed graphs and far easier to crawl than the cyclic graphs of web pages. Abstract out the crawling of a tree into a common module.

* Turn index-imap and index-filesystem into modules so they become independent of being called from an outside shell.

This also implies they become runnable directly from the web interface without an intermediate shell.

* Add attachment import to the imap crawler

Calendar crawler

CardDAV crawler

To pull in information about people you know

Xing / LinkedIn / Facebook / Google+ crawler

To pull in information about people you know

LDAP crawler

To pull in information about people you know

Metasearch

Implement metasearch across multiple ES instances

Search index structure / data structures

Elasticsearch index

Last-verified field

We want a field to store when we last visited an URL so we don't always reindex files with every run.

Index maintenance

Autocompletion

Autocompletion needs to associate keywords with documents. These could come from a local .searchapp file or better be stored per-URL / per-document in an SQLite database for easy index reconstruction.

This needs close correlation with synonyms, which also could be (filesystem-) local for a (shared) folder or (user-)global in an SQLite database.

Crawl queue(s)

We want to have queues in which we store URLs to be crawled to allow for asynchronous submission of new items. This also allows us to be rate limited and restartable.

This could be an SQLite database, or just a flat text file if we have a way to store the last position within that text file.

SQL-index into filesystem

Is there any use in reviving FFRIndex?

System integration

Automatically (re)scan resources by using a notification method like the following to be notified about new or changed resources.

Resource modification

Filesystem watchers

RSS scanner

Google Sitemap scanner

Hibiscus importer

This would immediately make all money transactions from Hibiscus available for searching.

Can Hibiscus directly show a single transaction from the outside?

Interesting additional datasets

Open movie database http://omdbapi.com/ - has dumps available

Discogs data dumps - http://data.discogs.com/

Automatic search should be triggered for incoming phone calls. This allows to automatically show relevant emails if the sender is calling and has their phone information in their email.

Also, the automatic search should be easily triggered by a command line program. This likely needs something like HTTP::ServerEvent to keep a channel open so the server can push new information.

Data portability

Data portability is very important, not at least because of seamless index upgrades/rollbacks/backups.

Export

Export index to DBI

Update indices from database

Share indices

Sharing indices would also be nice in the sense of websites or people offering datasets

DBI connectivity

How can we get DBI and Promises work nicely together?

Schema migration/update via DBI

DBI import queue

New items to be imported into Elasticsearch could be stored/read from a DBI table. This would allow for a wider distributed set of crawlers feeding through DBI to Elasticsearch.

Index/query quality maintenance

To improve search results, a log of "failed" queries should be kept and the user should be offered manual correction of the failed queries.

top 10 failed queries

If a query had no results at all, the user should/could suggest some synonyms or even documents to use instead

top 10 low-score queries

If a query had only low-score results/documents, the results are also a candidate for manual improvement. How can we determine a low score?

top 10 abandoned queries

How will we determine if a query/word was abandoned?

Keep track of clickthrough

We should keep (server-side) track of click-throughs to actually find out which files/documents are viewed and rank those higher

Also, we should have a "unrank this" link to give the user a way to make the engine forget misclicked "ranked" items easily from the results.