Algorithmic Indexing of Difficult Document Data

This developing research project aims to integrate open source text processing tools to assist with the analysis, navigation, and reading of large document based data sets. Built around existing open source software tools (for details see the software section) this project is as much about gluing together existing tools, as it is the creation of new tools.

The overarching aim is that there should be minimal manual editing of documents in the database, therefore improvements to the analysis and navigability should be realised through improvements in the algorithms, not manual tweaks of documents and their meta-data. Ultimately the aim is to produce a tool to enhance the readability of difficult databases of text documents.

The project will use a range of machine learning and natural language processing techniques to extract themes, topics, and keywords from text. These will then be used to build enhanced search functionality. Machine learning will also be used to build multi-layered networks showing different ways the documents in the data sets can be linked together. Allowing interested parties to navigate through the data. Algorithms will take the individual documents, generate metadata, process the documents, and insert them into both a document database and a network database for visualisation and analysis.


Navigate

Corpora: