Skip to content

Searching <10,000 documents to find <150 relevant articles

This search is where the Iris.ai tools are especially helpful, as this is the phase of your search where you traditionally have had two choices: spend an awful lot of time building advanced keyword queries, or spend an awful lot of time reading headlines. Or, alternatively, trusting recommendation engines based on a random selection of “other people are also interested in” and citation counts and other popularity counts, which means you will miss out on a lot of highly relevant content that other people might not know of yet.

A note on quality of results. Accuracy of the machine results, when measured properly, is measured in Precision and Recall.

  • Precision = is a reflection of the rate of correctly identified information (true positives)
  • Recall (sensitivity) = is a reflection of how much relevant information (true positives) the machine has missed; is about measuring how complete our collection of relevant information is.

Be mindful of this: the higher precision you are looking for (no noise or irrelevant papers in the final dataset), the higher the chances are that a paper that is relevant, is not in the final dataset. It is a great idea to decide in advance what is most important to you: not to miss out on anything versus to have a perfect list of all-relevant documents without any noise. It is not possible to achieve perfection of them both (neither by machine or by human), and it is always a tradeoff in either direction.

STEP 1

Create your content collection of >10,000 documents. This can be done by (1) upload, (2) by filtering with broad terms and topics on for example the Open Access dataset or USPTO or any content your organization has integrated or (3) by creating around 15-20 Explore maps and merging the explored data sets.

STEP 2

There are a range of different search strategies available, and we suggest you try out a few to familiarize yourself with them, so you can select the one most suited for your current process at all times.

Strategy 1 - Increasingly focused analysis

Apply the Analyze tool to the data set and go through both the main concepts of the word analysis and the topics of the topic analysis. Choose clearly relevant topics for inclusion, clearly irrelevant topics for exclusion - and clearly relevant terms for inclusions and irrelevant for exclusion.

Do a manual ‘sanity check’ of the ensuing results. You should have reduced your result list considerably (topic dependent, but perhaps by 50%). Your current reading list should have a good deal of articles that are relevant, although no need to do nuanced evaluation yet.

Now, save this dataset as a new, smaller dataset and redo the Analysis on this smaller dataset. You should notice how the topics and concepts are now even more focused, and again you can select them for inclusion and exclusion.

This can be repeated however many times you like, until you have a reading list you are satisfied with, which is dense enough that you can review it manually, deleting any articles that are not relevant.

Strategy 2 - a venn diagram of contexts

RSpace™ has a very powerful and unusual filter, that offers enormous potential especially in the scenarios where the things you are looking for are not easily put into one key term. This filter will take your own description of what you are looking for, and do text similarity matching with all of the documents in your data set to find contextually similar documents. This means you may use entirely different vocabulary from the author of an article, but the context really is the same, so you will still find it.

Context filters should have 50-100 words and be quite specific in order to work well. We suggest an iterative approach to adding each context filter:

  1. Write and apply the context filter
  2. Sort the list based on the context filter score.
  3. Determine when the results are becoming less relevant, and set the context filter to that % score and reapply.

It is possible to apply multiple context filters to create what in essence is a venn diagram of contexts. We suggest removing the first context filter, apply a second one in the same process to fine tune it - and then apply them both to see what articles are good matches with both of your descriptions.

Strategy 3 - mix and match

This third strategy requires perhaps less of an explanation. The smart filters can be mixed and matched in whatever way your heart desires. You can create a new, smaller dataset based on a mix of Analysis and/or Context filters to analyze again. You can delete articles from your data set at any time when doing manual reviews.

RSpace™ is a new way of handling scientific knowledge, and we are excited to see the ways our users interact with the tools. If you experience any new ways of using the tools, we’d love to hear from you with recommendations, so we can recommend them to others!