Dashboard > AutoFocus > Aduna AutoFocus Development > Future Work
Future Work Log In View a printable version of the current page.

Added by Christiaan Fluit , last edited by Christiaan Fluit on 2008-04-09  (view change)
Labels: 
(None)

AutoFocus Future Work

To be extended and refined in the future. For now this is a rough list of functionalities that we would like to add to AutoFocus.

Note that this only includes large changes to AutoFocus that often require some research on precise functionality and technical obstacles before we can implement them. Simple ideas for improvements that are planned on a shorter time scale are typically entered as issues in the AutoFocus Issue Tracker.

Background Scanning

Besides manual refreshing sources, be able to specify an automatic refresh schedule.

Also, when issuing a refresh, this should take place in the background, so that users can continue querying.

How this is best achieved depends on a number of factors. Most Windows users will e.g. expect to be able to enable and edit schedules from a preferences window within the app, requiring the app to be able to register itself as a service. Power users and users on platforms such as Linux may want to have a scan app with commandline options to integrate with cron. Of course, these options are not mutually exclusive.

Use OS' Indexing Service

A feature that would make background scanning partially unnecessary would be to use the index and search service provided by the OS. Windows Vista and MacOS (Spotlight) already provide such services, Windows XP provides it optionally.

These services have the advantage that they are semi-uptodate, at least the AutoFocus user does not have to worry about this.

Potential disadvantages are that you are limited to the set of documents that are being indexed (e.g. the default settings may skip a lot of useful documents without the user realizing it) as well as limited by the quality of the full-text and metadata extraction provided by the platform.

Shell integration

Full integration with the OS's shell, e.g. a right-click menu in the Details panel (both table and list) or Cluster Map that allows people to copy a document, create shortcuts, open the containing folder, etc.

Challenges: getting this to work in Java (can we control Windows Explorer's file context menu through JDIC?), how to handle resources that are not in the local file system, (once we have archive support: ) what to do with nested objects, whether to allow Cut and Delete (means that the metadata gets outdated), etc.

Archive Support

Be able to search in ZIP files, etc. Requires some work in Aperture. Seems like a weird combination of Crawler and Extractor interface, e.g. a Crawler-like component operating on a DataObject. Also, processing of mail contents should fall under this API, so that the same code can be used to interpret the contents of a mail from an IMAP server and the contents of a locally stored .eml file (now it's duplicated in ImapCrawler and MailExtractor, the latter being only able to determine the mail body).

Quick Search

Add an AutoFocus system tray icon with a "Quick Search" option in its popup menu that lets the user enter a keyword search. Handy for incidental searchers who don't want to start a full-blown app before being able to do a simple search. May be not that difficult to realize once we have implemented background scanning as we then have a continuously running application in the background, probably with a tray icon to manage it.

Semantic Zooming

Be able to show search result details within the Cluster Map, when the user somehow zooms in on the results (slider, focus+context distortion, etc). An individual result may look like this:

Try out this jar file to see some interaction (save to disk and double-click it, requires Java 5+).

Sharing Sources

Source definitions (and perhaps also crawl results) may be shared among users. This could be from one AutoFocus user to another but also from an AutoFocus installation to a Metadata Server or vice versa. Likewise between Metadata installations.

For administration purposes this could be very helpful. For example, AutoFocus is a very handy tool for checking whether you have correctly defined your source, after which you could upload the defined source to a Metadata Server for deployment in an organization.

An interesting technology to look at is Bonjour (http://www.apple.com/macosx/features/bonjour/). This is for example used in iTunes: based on your own settings and that of other users, you can see the other iTunes users in your network and are able to browse and play their music collection. We could enhance AutoFocus so that you are able to search your colleagues files (of selected sources) and optionally even retrieve those files. This turns AutoFocus into groupware and makes it possible to share information without the need for a central AutoFocus Server.

Preview Pane / Keyword Highlighting

In order to read the contents of a search result, one now needs to open an external application. This is often slow and disruptive. Instead, a preview pane could be offered that allows for a quick glimpse of the result's contents.

This pane can additionally be used to highlight keyword occurrences.

Another way to realize keyword occurrences is to make use of the abilities of certain native viewers to highlights words. See e.g. http://www.pdfbox.org/userguide/highlighting.html.

Configurable Document Root

The paths of indexed documents are currently hard-coded in the Lucene and Sesame indices. This means that when a document folder tree is moved to a different location or becomes available under a different name (e.g. different drive letter, different Windows share, ...), the entire tree needs to be re-indexed. Also, these hard-coded paths make distribution of a set of documents and accompanying index on e.g. a CD, USB key or other type of removable medium hard to realize.

We could alter the indexing process, so that the document identifiers in our index start with a symbolic variable (e.g. "$FOLDERROOT") which is then configured to contain some path. A similar mechanism is used in Gnowsis.

Alternatively, we could use the URI transformation mechanism that is already used when you define a Metadata Server source. These URI transformers apply string replacements or regular expressions on the ID of a document to derive a URL through which the document can be accessed. We could easily enable this functionality for all source types. The benefit of this approach, besides greater flexibility and availability of code, is that it only needs to be configured once you start to move documents around, there is no need to define a document root upfront. Still, this extra flexibility comes at a price: not all users will understand how to properly use them.

Result Ranking

AutoFocus does not show any ranking of query results. Below the surface we are able to retrieve the relevance of a query result (Lucene supports it and the Sesame framework that surrounds it can pass it through) but we decided not to show it... even though several people have asked for it.

The problem is that AutoFocus is more than a single-query-single-result-set tool. Using the Cluster Map visualization, we show how several query result sets overlap. This means that in order to add query relevance to this setup, we would have to combine the relevancies of documents matching multiple queries. This is something that cannot be done with Lucene relevancies: they can only be used to order the results within a single result set, they are not comparable between result sets. Second, only keyword search supports relevancies, the other types of searches are purely binary. This would mean an additional complication.

Still, I can imagine that we can adapt AutoFocus to at least show the relevance when you select a single keyword search result set in the Cluster Map, and hide/disable it when you select a particular cluster, a combination of result sets or a non-keyword search result set.

Configurable Facets

The list of facets shown in AutoFocus are hardwired in the software. It would be attractive to make this configurable somehow, so people can leave out facets that are not of interest or even add new facets based on e.g. enterprise taxonomies.

The mechanism for sharing facets could be shared with Aduna Spectacle.

Support for non-Latin languages

It seems that, though AutoFocus can display file names containing e.g. Chinese, Korean and Japanese characters correctly, keyword search for these languages is broken. This ie because the rules for tokenization of these text are different. We should take a look at Lucene's CJKAnalyzer, see e.g. this thread for background info: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg08223.html.

Eliminating Duplicates in People Facet

The branches in the People facet often show duplicates because of name variants (in the Creators branch) or the same person or entity using different email addresses (in the Senders and Receivers branches). Some of these are relatively easy to catch: if two email addresses have the exact same name associated with them, they can safely be assumed to be the same person/entity. More subtle differences are harder to catch.

Techniques and approaches to inspect:

Shortcuts for facet search

The keyword search facet has a number of fields you can specifically search on, such as text, title, path, etc. See the part about Field-specific Search in the User Manual.

It would be nice if such shortcuts were available for all facets, so that you can for example search on "language:en".

There are some technical challenges with this. The current set of five fields map directly on fields in the Lucene index. All other information is stored in a Sesame RDF repository. This means that either these values also need to be stored in Lucene (meaning duplication) or that the query parsing should be extended such that use of the language field in a keyword query means that this clause is removed from the Lucene Query after parsing and added as a clause to the surrounding RDF query.

Perhaps this is all solved once we replace the AutoFocus-specific LuceneSail with the generic LuceneSail?

Powered by a free Atlassian Confluence Open Source Project License granted to Aduna Open Source. Evaluate Confluence today.
Powered by Atlassian Confluence 2.7, the Enterprise Wiki. Bug/feature request - Atlassian news - Contact administrators