is OntoChem’s high performance text analysis and data mining tool box. It is designed to meet the specific needs of our clients instead of providing a one-size-fits-all solution. High quality and performance are achieved by straightforward implementation of tailor made products for information retrieval and display of medium up to very large scale data sources and document collections.
OCMiner® is used by small and large life science companies to automatically index, analyze and search internal or external data collections, extracting product related knowledge and supporting the development of novel products by transitive knowledge discovery.
- Fast and scalable processing of large content sources like file collections or databases
- Office documents and many other file formats with extended support for XML and PDF documents
- Document structure, sentence and language recognition
- Annotation of named entities
- Small or very large controlled vocabularies, taxonomies, multi-faceted ontologies, meta-ontologies in any format – e.g. OBO, OWL, SKOS, CSV, …
- Specialized unique ontologies such as chemistry and proteins or genes
- Resolution of abbreviations, acronyms, homonyms and anaphora
- Intelligent treatment of word forms and special characters
- Relationship extraction using syntax rule based shallow or deep parsing
- Annotated content, search results or extracted knowledge as files or databases
- Browser based search and display interfaces
- Data analysis and graphical representation of complex relationships
- API - local or web-based
OCMiner® is a modular processing pipeline for unstructured information based on the Apache UIMA framework. Custom data mining is implemented by integrating any number of different tool box modules into a pipeline that produces the desired output. Tunable modules to select from consist of a broad range of different readers, analysis engines or consumers that may perform tasks in parallel on multiprocessor machines and even distributed over several computers.
Readers are reading data from a variety of sources, standardizing the input for further analysis:
- Document readers for office documents and many other file formats
- Extended support for XML and PDF documents
- Database readers allow direct access to relational databases, ontologies or document management systems (DMS)
Analysis engines work on the standardized information and add further data:
- Recognition of document structure such as headlines, paragraphs, sentences, as well as specific document section types, for example title, abstract, authors, keywords, abbreviation lists and references section.
- Dictionary based named entity (NE) recognition is a high performance dictionary look-up technology with support for very large dictionaries (> 100 Mill. entries). It implements specific language and dictionary dependent treatment options such as:
- Adaptable to recognize spelling variations
- Spaces/hyphens (e.g. “HIV-1” or “HIV1” or “HIV-I”),
- Umlaut or other diacritic character handling (e.g. “Sögrens disease” → “Soegrens disease”)
- British / American English (e.g. “behaviour” → “behavior”)
- Greek letters (e.g. “a-amino acid” → “alpha-amino acid”)
- Plural forms
- Apostrophe s (e.g. “Soegrens disease” or “Soegren’s disease” or “Soegren disease”)
- Conditional black- and white-lists
- Homonym resolution is provided by context sensitive ontological similarity. For example, to decide whether “monitor” is a computer screen or a lizard species (e.g. the Savanna monitor, Varanus exanthematicus) will depend on the use of related NE in the near context. This is especially useful for very short NE such as often in case of protein or chemistry names.
- Case sensitive handling of homonyms, for example distinguishing “aids” or “AIDS”
- Resolving document specific abbreviations and acronyms, for example
- Abbreviations: kb → kilobase(s)
- Acronyms: TAT → Tyrosine aminotransferase
- Expansion of shortened word list forms like
- “vitamin A, B and C” → vitamin A + vitamin B + vitamin C
- “white and gray matter” → white matter + gray matter
- Adaptable to recognize spelling variations
- Specific ontologies and tools are available to annotate chemistry in text documents:
- Validated chemistry dictionaries with chemistry structures
- Recognition of chemistry with name-to-structure with high performance, identified compounds are stored in chemistry database
- For recognized compounds a connection table and the respective InChI is generated and looked up for novelty
- Annotation of documents with our compound classes using our chemistry ontology, generated by OntoChem’s chemistry ontology editor SODIAC.
- Anaphora resolution recognizes underdetermined NE and searches for their more precise meaning throughout the complete document
Consumers may work independently and in parallel, utilizing the data provided by the analysis engines. They provide the final output to the search and display applications of our clients.
- Text tagging and annotations, e.g. for annotating scientific publications for printing houses or extracting compounds from patents into custom databases
- Web-based search engines, for example together with our PDF-to-HTML converter
- Thematic searches or document ranking based on ontology terms to receive instant knowledge based (pre-calculated relations) results from very big data collections
- Document similarity based on concepts rather than words allows finding more relevant, related documents. For example, we may search for documents that deal with similar compounds, to treat related diseases.
- Relationship extraction ranges from simple co-occurrence detection up to sophisticated semantic relationship analysis based on specific syntax knowledge. This technology is based on OntoChem’s unique relationship ontology and syntax analysis software.
- Knowledge mining, by analyzing extracted data further – for example answering complex questions such as “What is the distribution of different compound classes that are found in different plant families?” Implicit or transitive knowledge can be searched that is distributed over heterogeneous data sets – enabling knowledge discovery that is not mentioned explicitly in a do cument. This feature allows for generation of new intellectual property.
- Structure searching: We are the leader in integrating text and powerful chemistry searching - providing a unique feature not present in other search engines. Thus, we have implemented chemical identity and stereoisomer searches in a straightforward way. The whole range of chemical searches like substructure and similarity searching is available via the integration of ChemAxon’s JChem libraries if needed.
With OCMiner® we can achieve better precision and recall than with competing technology – we would be glad to demonstrate this to you! For example, with protein names and using our protein ontology we guaranty to achieve precision rates >95% and recall rates > 85%.
However, sometimes the meaning of specific sentences may not even be resolved by human readers, therefore we have introduced confidence value based annotations. Each annotation gets a specific confidence value using a proprietary algorithm. This value may be used in custom applications to extract highly certain facts or a broader range of facts that have only a low confidence.
OCMiner technology in practice
Give it a try!