Business Insights Workbench (BIW)

Innovation Matters


Business Insights Workbench (BIW) leverages an innovative set of text analytics and mining techniques developed by IBM Research to create business value out of the vast amount of structured and unstructured information. To date, the technology has demonstrated its value through its application in the areas of helpdesk and call center problem ticket analysis, customer relationship management, intellectual property and patent portfolio analysis, healthcare and life sciences, intelligence applications such as market intelligence, as well as business communication analysis such as blog, discussion forum, on-line chat analysis. Such information analytics not only uncovers hidden issues for business but also provides guidance for potential future growth.

Leveraging the ever-growing volume of internal and external information available to enterprise for business advantage has becoming increasingly difficult. Traditional data mining technologies that work well for structured data fall short when coping with large amount of unstructured texts. Further, conventional text mining technologies that ignore human expertise in knowledge generation often fail to provide meaningful insights into data. To address these challenges, BIW combines technologies from both Business Intelligence and Knowledge Management communities to allow deep analysis of many different types of data.

In particular, BIW builds scalable data and document warehouses by identifying and extracting data from a wide range of data sources and storing them in dimensional tables. It can also leverage technology such as IBM WebSphereŽ Information Integrator OmniFind for full-text search and web content extraction. This enables BIW to work with both structured and unstructured data. To date, BIW data warehouse has collected significant fraction of U.S. Patent and Trademark Office database information, which contains more than 7.9 million full text and drawings of U.S. patents since 1976. In addition, it also has nearly 15 million citations and abstracts in the life sciences and biomedical fields, extracted from Medline, an indexing service for research in medicine and related fields provided by the U.S. National Library of Medicine.

To search and analyze structured and unstructured information in such data warehouses, BIW combines several capabilities into a single workbench: “explore”, “understand”, and “analyze” as shown in Figure 1. The “explore” capabilities allow users to search and retrieve documents in multiple data sources using simple search interface. Figure 1 shows three data sources, i.e., WebFountain which contains web data extracted from Web sites, Patent DB which contains all of the U.S. Patents, and Medline data. The “understand” capabilities enable users to create natural classifications (taxonomies) of the data objects using several clustering algorithms, including K-Means, EM, and query-based. The “analyze” capabilities allow users to discover hidden trends and patterns in the data sets using multi-dimensional affinity analysis, automatically identify non-obvious relationships with scatter charts, and automatically detect "knowledge gaps" in a set of text and suggest areas for further analysis, as well as to prioritize solution suggestions according to return on investment and strategic value.


Figure 1. BIW key capabilities, i.e., explore, understand, and analyze.

In contrast to fully automated text analytics techniques which tend to generate either overly simplified or hard-to-understand observations, BIW leverages human knowledge rather than eliminating humans altogether in the analysis process by utilizing expert human intervention at every stage of analysis. Such interactive methods empower knowledge users. For instance, users can choose from multiple clustering algorithms and add, delete, split or merge categories based on specific situations. Furthermore, users can drill down to minute levels of detail, analyze specific parts of a document, and visualize data categories over time. All intermediate analysis results can be saved and resumed without starting from the beginning.

We use a patent portfolio analysis as an example to illustrate how BIW analytics process works. Suppose that one would like to analyze patents related to “sensors” or “detectors”. One can first extract the related patents by simply typing in a search query as shown in Figure 2. The resulting patent data set can be stored in a new object called “detector sensor”. Such an object can be used for subsequent processing.




Figure 2. Search query for “Sensors” and “Detectors” and the resulting data set.


Once the data set is determined, the next step is to understand what is contained in the data. This is usually accomplished via the “understand” step which automatically builds a taxonomy for the patent data set. Figure 3. shows the taxonomy built by the BIW for sensors and detectors patents. The initial taxonomy may not be perfect and it is a rough classification of the patents. For instance, the “sheet, printing, adjacent” category may be too specific, while others such as “magnetic” and “voltage” might be helpful since they are more specific categories of sensors/detectors that can help users understand what kinds of sensors exist in this technology area. Nevertheless, such initial classification is a good starting point.



Figure 3. Initial taxonomy generated for the “sensor detector” patent data set.

As indicated earlier, the power of BIW comes from its ability to incorporate human expertise. Hence, typically, once the initial taxonomy is generated, users can refine the taxonomy. BIW provides helpful statistics and information for performing such refinements. For example, the “cohesion” column shown in Figure 3 indicates how well the individual documents in a given cluster are clustered together. For categories that have low “cohesion”, it might imply that the category should be divided up. The “Distinctness” column shown in Figure 3 indicates how distinct a cluster is with respect to other clusters. The clusters that have low distinctness might need to be merged with some others. To allow users to make such merge and split decisions, BIW also enables one to dive deeper into individual categories via a “class view” as shown in Figure 4. In this example, the class “sheet, printing, adjacent” is chosen to be examined. The class view shows typical examples of patents in this category on the right hand side. “Typical” is determined statistically. Such examples give users a quick sense on what kinds of documents are in the class. On the lower-left hand side, the bar graphs of most common terms in the class are shown. Such information indicates how BIW generates its category. The “class components” view allows users to split class into subclasses based on keywords or other rules.


Figure 4. The class view on the “sheet, printing, adjacent” category.

Once a desired taxonomy is created, a number of analysis can be carried out using BIW. Figure 5. shows a trend analysis on a set of patents. For each category, BIW generates a trend graph to show what the patent trends are like over time for each category. For instance, the “image” category is trending upwards, which indicates a more recent focus on image sensors. On the other hand, “fluid & flow” is trending downward, which indicates some potential loss of interests in this area.




Figure 5. Trend analysis of patents.

In summary, BIW combines data warehousing technologies and a wide range of text mining capabilities into a single platform. It empowers users through its interactiveness and its flexibility.

Related Publications  

W. F. Cody, J. T. Kreulen, V. Krishna and W. S. Spangler. The integration of business intelligence and knowledge management. IBM Systems Journa, 2002.

Spangler S., Kreulen J. and Lessler J. Generating and Browsing Multiple Taxonomies over a Document Collection. Journal of Management Information Systems Vol. 19:4:191-212, 2003.

Spangler S. and Kreulen J. Interactive Methods for Taxonomy Editing and Validation. ACM - CIKM. 2002.


Rate this article

Innovator's corner  

Scott SpanglerScott Spangler Researcher

What is the most exciting potential future use for the work you're doing?
The ability to assist strategic business planning based on combined analysis of Intellectual Property information (patents), web data (such as blogs) and internal customer support information (helpdesk logs).

What is the most interesting part of your research?
Finding new ways to visualize, understand, and communicate the insights buried in large amounts of unstructured data.

What inspired you to go into this field?
It combines many different threads in my education background and career: mathematics, computer science, knowledge engineering, machine learning, statistical analysis, data mining, data visualization, and text analysis. The variety and scope of problems are endless and I always learn something new with each application.

What is your favorite invention of all time?
Space travel

Research team  

Ying Chen

Ying Chen

Jeffrey Kreulen

Jeffrey Kreulen

Ana Lelescu

Ana Lelescu

Larry Proctor

Larry Proctor

James Rhodes

James Rhodes

Scott Spangler

Scott Spangler

Related Research  

Disciplines: Computer Science
Research Areas: Services Computing
Research Labs: Almaden Research Center