Skip to main content

How Keenious Curates Its Index

The Keenious index is a curated subset of OpenAlex: what each of the three filtering layers keeps and removes, and how citation metrics are recomputed within the index.

Overview

The Keenious index is built from OpenAlex, an open catalog of the global research system maintained by OurResearch. OpenAlex records the scholarly record at its broadest β€” roughly 507 million records β€” and three layers of filtering narrow it to the ~188 million works searchable in Keenious. The index is also enriched with the Norwegian Scientific Index, an independent register of peer-reviewed publication channels, and citation-based metrics are recomputed so they reflect the curated index rather than the full catalog.

This page describes what each layer keeps and removes. All figures are current as of May 2026; the shape of the filtering is stable, while the absolute counts rise as OpenAlex grows.

Three filtering layers narrow OpenAlex's ~507 million records to the ~188 million works in the Keenious index

Layer 1: Document Types

OpenAlex catalogs many record types beyond publications. Five document types are kept:

  • Journal articles and conference papers (OpenAlex type article) β€” full-length research articles published in journals or conference proceedings.

  • Review articles (review) β€” literature reviews, systematic reviews, and meta-analyses.

  • Preprints (preprint) β€” from curated preprint servers such as arXiv, bioRxiv, and SSRN.

  • Books and monographs (book) β€” complete books indexed as a single record. Individual chapters are not indexed as standalone works.

  • Doctoral dissertations (dissertation) β€” OpenAlex's dissertation type covers doctoral work only; master's and bachelor's theses are classified by OpenAlex as other and are not in the index.

Everything else β€” book chapters, datasets, editorials, letters, errata, reports, grants, peer-review records, paratext β€” is removed at this layer, narrowing ~507 million records to ~301 million.

Layer 2: Source Review

Not all of the remaining works come from academic venues. OpenAlex indexes more than 280,000 distinct sources β€” journals, conference series, repositories, archives β€” and many serve purposes other than publishing research. Every source is reviewed by examining a sample of the works it contains: what it actually publishes, not its name, reputation, or country. A small regional journal publishing original research is treated the same as a large international one.

Sources are not simply approved or rejected β€” they are classified along a quality spectrum, and only sources that fall outside the scholarly record are removed entirely. Two facts shape the review:

  • Metadata gaps are not quality problems. OpenAlex coverage is uneven: Japanese, Korean, Chinese, Russian, Arabic, Indonesian, and many other non-English sources often lack abstracts and DOIs because of data-pipeline differences. Sources are evaluated on the substance of what they publish, not on metadata completeness.

  • Repositories face a higher bar than journals. A repository that only catalogues publication titles imported from CV systems β€” no abstracts, no full text β€” adds nothing searchable and can be removed even when the underlying authors are legitimate. Repositories hosting actual research content, such as theses, preprints, and working papers with abstracts, are kept.

Roughly 46,000 sources are removed, in several broad categories:

  • Digitized archives β€” scanned historical materials valuable for preservation but not academic literature: yearbooks, photographs, newspaper archives.

  • Open upload platforms β€” repositories where anyone can deposit content without editorial oversight, producing a mix of legitimate work and unvetted content. This does not apply to curated preprint servers or institutional repositories.

  • News and trade publications β€” magazines, newsletters, and trade journals that serve practitioners but do not publish original research.

  • Meeting-abstract collections β€” paragraph-length conference summaries. Short conference papers are included; bare abstract collections are not.

  • Document mills β€” pay-to-publish outlets that accept submissions without meaningful peer review, identifiable by negligible citation rates and inconsistent topical scope.

  • Data repositories and reference works β€” specimen catalogs, sensor databases, encyclopedias, dictionaries.

  • CV-style metadata catalogues β€” institutional systems listing publication titles without abstracts or full text.

  • Data-quality failures β€” sources where OpenAlex has systematically attributed works to the wrong venue.

Source review narrows ~301 million records to ~244 million.

Layer 3: Record Completeness

The final layer operates on individual records. A work has to meet four requirements:

  • Linkable β€” the record has a DOI or a landing-page URL where the work can be reached.

  • Searchable β€” the record has a meaningful title or abstract with enough real text for matching by meaning and by keyword. Placeholder titles, boilerplate phrases, and metadata stubs with no content do not qualify. The minimum text length is language-aware: a few characters of Chinese, Japanese, or Korean carry as much information as a longer English phrase.

  • A whole work β€” the record describes a complete paper, book, or thesis. OpenAlex sometimes assigns separate records to appendices, supplementary files, tables of contents, and similar fragments; these are removed so they do not surface as standalone results.

  • Unique β€” one canonical record per paper. The same work is often deposited across several repositories and aggregators, producing many near-identical records; one canonical version is kept, plus any independently cited copies.

Retracted and withdrawn works are also removed at this layer, and titles and abstracts are cleaned of artifacts that interfere with search β€” scraped website navigation, cookie-banner text, stray HTML markup. The result is the final index of ~188 million works.

The Norwegian Scientific Index

Alongside source review, the index integrates the Norwegian Scientific Index (NSI) β€” a register of peer-reviewed academic publication channels maintained by the Norwegian Directorate for Higher Education and Skills (HK-dir). NSI is curated by national disciplinary boards rather than derived from citation counts, which makes it an independent signal of peer-review status.

For each venue covered by NSI, Keenious surfaces:

  • A quality level β€” Level 1 (recognized peer-reviewed channel) or Level 2 (leading channel within its discipline)

  • The venue's principal language

  • The publisher's country

NSI is enrichment, not a filter: venues outside NSI are not removed. In search ranking, NSI-listed venues receive a boost and unlisted venues no penalty β€” see How Search Works in Keenious. The register is particularly useful as a peer-review signal for Nordic and non-English venues, which citation-based metrics underrepresent.

Recomputed Metrics

Citation counts and work counts are recomputed to count only works within the curated index. A paper's citation count in Keenious can be lower than in OpenAlex or Google Scholar because citations from excluded records β€” meeting abstracts, digitized archives, document mills β€” are not counted. An author's work count likewise covers only their publications in the index. The numbers are internally consistent β€” every citing work and every counted publication is one that can be found in Keenious β€” but they will not always match databases that count against a broader corpus.

The same applies to authors, venues, and institutions: only entities connected to works in the curated index appear, and their counts reflect it.

Field-Weighted Citation Impact

Field-weighted citation impact (FWCI) is a normalized citation measure: 1.0 means a paper is cited at the average rate for comparable work β€” same field, year, and document type β€” and 2.0 means twice that rate. It allows a paper from a low-citation field to be compared with one from a high-citation field, and it is one of the ranking signals in search.

FWCI is recomputed within the curated index. The citations and the expected-citation baselines come from works that are themselves in the index, so excluded material does not distort the metric. The baseline calculation is also outlier-resistant: citation distributions are heavily skewed by a small number of landmark papers, and a simple field average would let a single blockbuster inflate the expected rate for its entire field, depressing every other paper's FWCI.

The resulting values will not always match OpenAlex, Scopus, or SciVal β€” they are computed against a different corpus β€” but they are consistent for comparing works within Keenious.

Multi-Language Coverage

The index contains works in more than 100 languages. Curation does not penalize incomplete metadata: source review evaluates non-English sources on their content (see Layer 2), and the text requirements in Layer 3 are language-aware. Searching across languages is covered in Cross Language Search.

Staying Current

The index is synchronized with OpenAlex regularly, picking up new works, metadata corrections, and new sources β€” a paper published in the last few days may not be indexed yet. Source classifications and the Norwegian Scientific Index integration are periodically re-evaluated.

For tasks that require the full scholarly record β€” bibliometric analysis or citation studies across everything published β€” OpenAlex itself provides open API and bulk access to its complete catalog.

Did this answer your question?