Skip to content

Searching millions of documents to find <10,000 relevant articles

Whether you start with PubMed, all of OpenAccess, USPTO or any million-entry source, the process is mainly the same.

STEP 1

Create a dataset in your Iris.ai account from one of the default or integrated large sources. Your first goal is to get the list down below 10,000 articles, by applying some broad filters. We suggest a combination of any of the following:

  • Limit your results to the relevant year range of interest.
  • Filter by a repository (for example include or exclude PubMed from the Open Access repository, if you are or are not interested in medical results).
  • Use the analysis tool to get access to Topic and Word analysis.
    • Select to include or exclude some of the main topics.
      • These topics span broadly and all documents will be part of multiple topics, so do not think about perfection or worry about going through them all; just select to Include 1-2 that you believe will be relevant, and Exclude top 1-2 that seems highly irrelevant.
    • Use the word analysis tool by searching in the box for the 1-2 main concepts you know you are looking for.
      • Experiment at what taxonomy level the word should be at. As an illustrative example, four different ones would be Animal > Mammal > Cat > Maine Coon. Try for your field of interest. Results will of course vary based on how broadly your topic has been researched. Don’t worry about being too specific - the goal is just to get under 10,000.
      • Please note that in the word analysis you can choose keywords and concepts, depending on your needs. The concepts have a broader understanding of the term than just the keyword itself - a bag-of-words related to your term.

STEP 2

When you have your dataset at less than 10,000 articles - save it as a new dataset and move over to the new dataset to keep working on the next stage. Why? Because with a new, smaller dataset, you can now run a new Analysis on that dataset and get much more granular results.

And, with your field-of-interest specific results, you’ve now moved over to another way of using the tools: searching in <10,000 articles to find <150 articles. Go there for the next steps!

EXAMPLE

Loading in the Open Access repository, filtering 1) to only see PubMed results 2) by topic 31 which contains more than 1,7m articles and 3) cirrhosis, my results list is 4363 documents and I have a great starting set. Note that if choosing “liver” instead of “cirrhosis” we would get more than 100,000 results (too broad), and being more specific than cirrhosis will not work, as for example “Alpha-1 antitrypsin deficiency” is not a term this broad word analysis of >50 million is familiar with yet - until we’ve made a smaller data set and analyzed it.

including repository example.png

topic analysis example.png

concept analysis example.png