Information Retrieval: A Health and Biomedical Perspective, Third EditionWilliam Hersh, M.D. |
Section |
Topic |
Reference |
---|---|---|
4.3 |
Some recent updates about the National Center
for Biomedical Ontologies (NCBO, https://www.bioontology.org/)
and its main project, the repository of biomedical
ontologies called BioPortal (https://bioportal.bioontology.org/)
(Whetzel et al., 2011; Whetzel, 2013). An important ontology is one that aims to model all human phenotypes, the Human Phenotype Ontology (Köhler et al., 2017) |
Whetzel, PL, Noy, NF, et al. (2011).
BioPortal: enhanced functionality via new Web services from
the National Center for Biomedical Ontology to access and
use ontologies in software applications. Nucleic Acids
Research. 39: W541-W545. Whetzel, PL (2013). NCBO technology: powering semantically aware applications. Journal of Biomedical Semantics. 15(4 Suppl 1): S8. http://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-S1-S8. Köhler, S, Vasilevsky, NA, et al. (2017). The Human Phenotype Ontology in 2017. Nucleic Acids Research. 45: D865-D876. |
Gene naming is still a challenge, especially
with Microsoft Excel, with its automated conversions of
dates and floating-point numbers leading to as much as
one-fifth of gene names in the biomedical literature being
erroneous. |
Ziemann, M, Eren, Y, et al. (2016). Gene name
errors are widespread in the scientific literature. Genome
Biology. 17: 177. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7. |
|
4.3 |
The MeSH vocabulary continues to evolve. According to the latest fact sheet, the 2016 version of MeSH had 27,883 descriptors (headings) with 87,883 entry terms. MeSH also contains more than 232,000 Supplementary Concept Records (SCRs) that consist of additional chemicals, diseases, and drug protocols. In 2006, MeSH was expanded from 15 to 16 trees, with the Publication Characteristics (V) tree added to account for the growing number of characteristics of publications (Nahin, 2005). (I missed this when writing the third edition of the book in 2009.) The 16 trees are now:
|
Nahin, AM (2005). PubMed® Notes for 2006. NLM
Technical Bulletin. https://www.nlm.nih.gov/pubs/techbull/nd05/nd05_pm_notes.html. https://www.nlm.nih.gov/pubs/factsheets/mesh.html |
4.3 |
The Gene Ontology (GO) continues
to grow, with over 40,000 concepts in its three
ontologies:
|
http://www.geneontology.org/ |
4.3 |
A questionnaire about the use of
the UMLS by informatics researchers had responses from 70
users (Chen, 2007). The two major intended uses were
access to source terminologies (75%) and mapping among
source terminologies (44%). The most common reported uses
were:
|
Chen, Y., Perl, Y., et al. (2007). Analysis of a study of the users, uses, and future agenda of the UMLS. Journal of the American Medical Informatics Association, 14: 221-231. |
4.3 |
The NCI Thesaurus Web site (http://nciterms.nci.nih.gov)
allows downloading, searching, and browsing. There is also
an NCI Metathesaurus that includes about 75 other
cancer-related terminologies. |
|
4.3 |
Another controlled terminology
system is an ontology that is aimed for use for simple
manual indexing for the Web, has been developed by the
major search engine companies - Google, Microsoft, and
Yahoo - and is called Schema.org (http://schema.org/).
Schema.org is supported by the Schema.org Community Group
(https://www.w3.org/community/schemaorg/).
The schemas are designed to be "microdata" that can be
used to index digital content, such as Web pages. They can
even be included in the Web page HTML themselves. The
schemas consist of a collection of "types," each of which
are associated with a set of "properties." The types are
arranged in a hierarchy. The core vocabulary currently
consists of nearly 600 types, over 800 properties, and
over 100 enumeration values for the properties. The
community has also developed a process for "extensions" to
the basic schemas, which can be "hosted" as part of the
Schema.org project or "external" and be maintained by
outside organizations. One important example of the latter
is MedicalEntity (http://schema.org/MedicalEntity),
which is related to health and the practice of medicine.
As noted on its Web page, this schema "is not intended to
define or codify a new controlled medical vocabulary, but
instead to complement existing vocabularies and
onotologies. As a schema, its focus is on surfacing the
existence of and relationships between entities described
in content; the specific convention(s) used to name and/or
code entities are outside of the scope of this schema. The
schema does provide a way to annotate entities with codes
that refer to existing controlled medical vocabularies
(such as MeSH, SNOMED, ICD, RxNorm, UMLS, etc.) when they
are available." |
|
4.4 |
Author names continue to be a
challenge for bibliographic and other databases,
especially as others establish linkages and metrics based
on them. This is becoming even more problematic with the
increasing number and productivity of Chinese authors, who
tend to have short and simple names (Qiu, 2008). A number
of different systems had been proposed for author
identifiers (Bourne and Fink, 2008; Enserink, 2009;
Fenner, 2011), but an emerging standard has been Open
Researcher and Contributor ID (ORCID, http://orcid.org). Over three million scientific authors have signed up, and many journals now require them to be designated when papers are submitted for publication. My ORCID is orcid.org/0000-0002-4114-5148, which can be used in a URL that links to a Web page listing publications and other information (http://orcid.org/0000-0002-4114-5148). Of course, the computability and probably reproducibility of science would be enhanced by unique identifiers for all resources that are used and described in research (Vasilevsky et al., 2013). |
Qiu, J. (2008). Scientific
publishing: identity crisis. Nature, 451: 766-767. Bourne, P. and Fink, J. (2008). I am not a scientist, I am a number. PLoS Computational Biology, 4(12): 19112480 http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000247. Enserink, M. (2009). Scientific publishing. Are you ready to become a number? Science, 323: 1662-1664. Fenner, M (2011). Author identifier overview. Libreas Library Ideas. 18 http://libreas.eu/ausgabe18/texte/03fenner.htm. Vasilevsky, NA, Brush, MH, et al. (2013). On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ. 5(1): e148. https://peerj.com/articles/148/. |
4.4 |
While the 15-element Dublin Core Metadata set
continues to have wide influence and use, the Dublin Core
Metadata Initiative (DCMI, http://dublincore.org/)
has expanded efforts to developing application profiles. A
Dublin Core Application Profile (DCAP) specifies and
describes the metadata used in a particular application. To
accomplish this, a profile (quoted from http://dublincore.org/documents/profile-guidelines/):
|
|
4.4 |
An overview of collaborative filtering and related recommender systems is provided by Terveen and Hill (2001). A survey of more recent collaborative filtering techniques is described by Su and Khoshgoftaar (2009). These approaches are used in what are now called recommender systems and are in common use in many commercial Web sites, such as Amazon and Netflix. Caplan and Rosenthal (2013) have proposed collaborative filtering approaches for use in identifying unknown clinical cases. Likewise, Wiesner and Pfeifer (2014) note that the growing amount of clinical data captured in EHR and other sources could lead to recommender systems for patients. | Terveen, L and Hill, W (2001).
Beyond Recommender Systems: Helping People Help Each
Other. Human-Computer Interaction in the New
Millennium. J. Carroll. Reading, MA, Addison-Wesley. Su, X and Khoshgoftaar, TM (2009). A survey of collaborative filtering techniques. Advances in Artificial Intelligence. 2009: 421425. http://www.hindawi.com/journals/aai/2009/421425/. Caplan, E and Rosenthal, N (2013). Collaborative Filtering: An Interim Approach To Identifying Clinical Doppelgängers. Health Affairs Blog. http://healthaffairs.org/blog/2013/06/17/collaborative-filtering-an-interim-approach-to-identifying-clinical-doppelgangers/. Wiesner, M and Pfeifer, D (2014). Health recommender systems: concepts, requirements, technical basics and challenges. International Journal of Environmental Research and Public Health. 11: 2580-2607. |
4.4 |
Internet advertising is here to
stay (Yuan et al., 2012), so the selection of words and
terms to promote content based on willingness to pay needs
to be considered a form of manual indexing. It is also
becoming prominent in social media as well, e.g.,
Facebook, Twitter, etc. |
Yuan, S, Abidin, AZ, et al. (2012). Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users, arXiv 2012. http://arxiv.org/pdf/1206.1754v2.pdf. |
4.4 |
The Open Directory Project is now defunct,
but an online forum persists for those who were involved (https://www.resource-zone.com/). |
|
4.4 |
The Health Education Assets Library (HEAL)
URL has changed to http://library.med.utah.edu/heal/. |
|
4.5 |
A recent New York Times article
on Google's "schooling" of its search algorithms to keep
up with those trying to game it (Lohr, 2011). The
challenges of large-scale Web indexing have given rise to
new approaches to handling petabyte quantities of data per
day. This has led Google to develop MapReduce, which is
deigned to process such data in parallel and when portions
are not immediately available (Dean and Ghemawat, 2008;
Lin and Dyer, 2010). An open-source implementation of this
approach is Hadoop (http://hadoop.apache.org/). |
Lohr, S. (2011). Google
Schools Its Algorithm. New
York Times. March 5, 2011. http://www.nytimes.com/2011/03/06/weekinreview/06lohr.html. Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107-113. Lin, J. and Dyer, C. (2010). Data-Intensive Text Processing with MapReduce. San Rafael, CA. Morgan & Claypool Publishers. |
4.5 |
Another challenge to Web indexing is that we are in the era of "adversarial" IR, where we may want to explicitly not retrieve some content (Castillo, 2011). | Castillo, C and Davison, BD (2011). Adversarial Web Search. Delft, Netherlands, now Publishers. |
4.5 |
While not indexing per se, an interesting approach to summarizing the content of document(s) is Wordle (http://www.wordle.net/). An app for the Sciverse system has been create that performs a Wordle on one's scientific publications in its comprehensive bibliographic database. The Wordle of my publications is not terribly surprising. | |
4.6 |
A more recent approach to learning object
metadata (and much simpler than others) is the Learning
Resource Metadata Initiative, which extends Schema.org
though a collection of properties that describe educational
resources. LRMI is now maintained by DCMI (http://lrmi.dublincore.net/).
LRMI predominantly uses the properties of resources of type
schema.org/CreativeWork, which were proposed to Schema.org
by LRMI to describe the educational characteristics of
learning resources. For some of the properties, it uses the
LRMI-created types schema.org/AlignmentObject and
schema.org/EducationalAudience. Version 1.1 of the LRMI
specification has been accepted into Schema.org. Medbiquitous still provides indexing of health professional education learning objects but also is involved in many more standards, such as tracking for continuing medical education (CME) and management of learning competencies. |