Come up with a good name
YODA (from the image in the talk)?
"Answers you seek?"
To speed up "loading" of the assets even more, the assets should be stored on the client as a HTTP / HTML manifest
robots.txt
The app should prevent being spidered itself by always providing an automatic robots.txt that allows only /about and maybe / without any parameters to be spidered.
/about
/
This is somewhat ironic as the scrapers currently don't respect any robots.txt yet.
Autosearch without pressing "Go"
Display entry URL + title in the autocomplete dropdown
* Search images
Come up with a concept to render different mime types differently.
Ideally, this would avoid the hardcoding we use for audio/mpeg currently.
audio/mpeg
This also entails information about things that are not files. Ideally, we can render information about a "person" using a different template as well, even though a "person" does not have a mime type associated with it.
Currently, links to mails are hardcoded to use Thunderlink for Thunderbird. Lotus Notes mails will need different deep links as outlined in http://www.wissel.net/blog/d6plinks/SHWL-7PL67C.
notes://servername/database/view/documentuniqueid
Basically, this means that for mails, we will need to store more than one "unique" ID or alternatively decide on the ID to store in the crawler.
Maybe we should store a (preferred) rendertype for items and render a subtemplate based on that rendertype. This would allow different URLs for links to Message-ID mails and Lotus Notes mails. It would still mean that we need to store more fields for email entries.
Also, for example Perl files should get a Perl syntax highlighter or at least a "code" view. The same should likely hold for all other (text) files whose more refined type we can recognize.
Refinement using the last search, if the last search was "recently"
Basically, add the new term to the last terms instead of doing a new search based only on the new term. Usually, boost the new term, maybe by factor 2 over the old terms. Provide a link to only search for the new term instead.
Just in case some malicious content gets through our (lame) filters or gets inserted by a script that doesn't properly sanitize the input, make sure we can't get rehosted in a (non-localhost) iframe and we can't run (non-localhost) Javascript.
Also consider reproxying all external resources, thus allowing absolutely no outside links at all on our pages.
* Plack-hook/example for /search to tie up the search application into arbitrary websites
/search
* ElasticSearch plugin / configuration through YAML
* Upgrade to Dancer 2
Having different Elasticsearch clusters available (or not) should be recognized and the search results should be combined. For example, a work cluster should be searched in addition to the local cluster, if the work network is available.
This calls for using the asynchronous API not only for searching but also for progressively enhancing the results page as new results become available.
How can we/Elasticsearch recognize similarity between two documents?
If two documents live in the same directory, the newest one should take precedence and fold the similar documents below it.
Currently better written in Perl
* Don't rescan/reanalyze elements that already exist in Elasticsearch
* Delete entries that don't exist in the filesystem anymore
Which module provides interesting video metadata?
Use Video::Subtitle::SRT for reading subtitle files
How can we find where / on what line search results were found? If we include a magic marker (HTML comment?) at the end/start of a line, we could hide it when displaying the results to the user while still using it to orient ourselves in the document.
* MP3s get imported but could use a nicer body rendering.
* Playback duration should be calculated
* Also import audio lyrics - how could these be linked to their mp3s?
Playlists should get custom rendering (album art etc.)
Playlists should ideally also hotlink their contents
Consider importing a Wikipedia dump
Some other larger, mixed corpus, like http://eur-lex.europa.eu/
Use the Enron mail corpus?
Find out which one(s) we want:
https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html
From first glance, we might want Simple Expansion, but Genre Expansion also seems interesting.
We want to treat some synonyms as identical though, like 'MMSR' and its German translation 'Geldmarktstatistik'.
Create screencasts using http://www.openshot.org/videos/
The first start should be as configuration-free as possible.
Use one of the fancy Javascript walk-through implementation to offer an optional walk-through through the search page and results page.
Submit HTML and an URL into the index
submit-url --url 'https://example.com' --html '<html><body>Hello World</body></html>' # Remind ourselves when we search for "user list" where it lives: submit-url --file '/etc/passwd' --html '<html><pre>machine user list password</pre></html>' submit-url --json '{ url: "", "content" : "", ... }'
This allows for custom handling of single entries
Detect "genre" of web page (forum, product, social, blog, ...)
Detect porn page by using the list of word pairs at https://github.com/searchdaimon/adult-words
Don't import hidden files by default
Have a file .search or .index which contains options, like no-index or ignore for this folder and its subfolders.
.search
.index
no-index
ignore
Show example SELECT statement
SELECT
SELECT product_name as title , 'http://productserver.internal/product/' || convert(varchar,product.id) as url , product_description as content FROM products
Repurpose https://perlmonks.org?node_id=449873 (and its replies) for better enterprise integration
Skip the HTTP generation process and reuse App::Wallflower for crawling a Dancer website.
App::Wallflower
Both IMAP and file systems are basically directed graphs and far easier to crawl than the cyclic graphs of web pages. Abstract out the crawling of a tree into a common module.
* Turn index-imap and index-filesystem into modules so they become independent of being called from an outside shell.
index-imap
index-filesystem
This also implies they become runnable directly from the web interface without an intermediate shell.
* Add attachment import to the imap crawler
To pull in information about people you know
Implement metasearch across multiple ES instances
We want a field to store when we last visited an URL so we don't always reindex files with every run.
Autocompletion needs to associate keywords with documents. These could come from a local .searchapp file or better be stored per-URL / per-document in an SQLite database for easy index reconstruction.
.searchapp
This needs close correlation with synonyms, which also could be (filesystem-) local for a (shared) folder or (user-)global in an SQLite database.
We want to have queues in which we store URLs to be crawled to allow for asynchronous submission of new items. This also allows us to be rate limited and restartable.
This could be an SQLite database, or just a flat text file if we have a way to store the last position within that text file.
Is there any use in reviving FFRIndex?
Automatically (re)scan resources by using a notification method like the following to be notified about new or changed resources.
This would immediately make all money transactions from Hibiscus available for searching.
Can Hibiscus directly show a single transaction from the outside?
Open movie database http://omdbapi.com/ - has dumps available
Discogs data dumps - http://data.discogs.com/
Automatic search should be triggered for incoming phone calls. This allows to automatically show relevant emails if the sender is calling and has their phone information in their email.
Also, the automatic search should be easily triggered by a command line program. This likely needs something like HTTP::ServerEvent to keep a channel open so the server can push new information.
Data portability is very important, not at least because of seamless index upgrades/rollbacks/backups.
Sharing indices would also be nice in the sense of websites or people offering datasets
How can we get DBI and Promises work nicely together?
New items to be imported into Elasticsearch could be stored/read from a DBI table. This would allow for a wider distributed set of crawlers feeding through DBI to Elasticsearch.
To improve search results, a log of "failed" queries should be kept and the user should be offered manual correction of the failed queries.
If a query had no results at all, the user should/could suggest some synonyms or even documents to use instead
If a query had only low-score results/documents, the results are also a candidate for manual improvement. How can we determine a low score?
How will we determine if a query/word was abandoned?
We should keep (server-side) track of click-throughs to actually find out which files/documents are viewed and rank those higher
Also, we should have a "unrank this" link to give the user a way to make the engine forget misclicked "ranked" items easily from the results.
To install Dancer::SearchApp, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Dancer::SearchApp
CPAN shell
perl -MCPAN -e shell install Dancer::SearchApp
For more information on module installation, please visit the detailed CPAN module installation guide.