Library Plan for a corporate digital library

Executive Summary

IT Plan

“In just 5 years, the US’ share of the digital universe will be bigger than the entire digital universe of 2012… technology tools will be necessary but not sufficient for the taming of the US’ digital universe. It will take new management practices, user education, and savvy policies. This is where technologists must rely on support from business units, government, and consumers, and is likely an area with bigger challenges than the technological realm”


(Gantz & Reinsel, 2013) ​1​

Social Media Plan

The use of big data will become a key basis of competition and growth for individual firms. From the standpoint of competitiveness and the potential capture of value, all companies need to take big data seriously. In most industries, established competitors and new entrants alike will leverage data-driven strategies to innovate, compete, and capture value from deep and up-to-real-time information. Indeed, we found early examples of such use of data in every sector we examined (Manyika et al., 2011) ​2​.

Several issues will have to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this (Manyika et al., 2011)​2​.

Overview

In addition to my position as a Competitive Intelligence manager for a high-tech firm in Silicon Valley, I also serve as the primary curator for a digital library used to conduct data analyses and investigations. The library curates data for researches and allows researchers to fuse them in order to address various business questions, such as “What is the impact of serviceability and customer support issue on follow-on sales?”, “What is the value of information to a sales team?”, and, “Which competitive measures are working best?”.

Library Background

The Information Technology sector is a very large marketplace, spending about $3 trillion annually on technology (Lovelock et al., 2017) ​3​. In terms of company size, over 40% of overall IT spending will come from very large businesses (more than 1,000 employees) while the small office category (the 70-plus million small businesses with 1-9 employees) will provide roughly one-quarter of all IT spending throughout the forecast period (Goepfert, Minton, & Shirer, 2016) ​4​. My company is increasingly enamored with the idea of competing more and more based on analytical information, derived from as many sources as it can reasonably obtain. Historically, the primary customer for the library has been field salespeople.

The typical field sales team that focuses on Fortune 2000 companies may consist of several distinct sales specialists, several technical sales engineers, and other personnel, often referred to as “overlay sales”. My company focuses on Information Technology (IT) and has a vast portfolio of products with which to address this $3 trillion market. Due to the plethora of products and solutions that the company provides and which customers demand, no single salesperson can possibly address or manage the entirety of the set of offerings needed for a given customer; which gives rise to the idea of “overlay sales”. There is one account manager responsible for the customer account but they must rely on the overlay sales teams to provide the needed specialty expertise needed to address the customer’s needs. This can be likened to the idea of a primary care physician, who uses specialty doctors to assist in providing care beyond their skill set.

Mission and Goals

While not chartered as a ‘library’ per se, the mission and goals of the library will be as follows:

First, to help staff use and leverage the collection of curated digital data

  • To provide an Analytics service to staff and internal organizations
    • Similar to library docents; assist staff with their research questions and projects
    • Maintain the collection by continually tracking the need for new digital datasets
    • Foster community of experts – data scientists, team champions, curators, and researchers
  • The Service will:
    • Create and maintain a digital library of curated data for research questions
    • Catalog, tag, and maintain metadata about the datasets to facilitate rapid discovery
    • Assist staff in finding and using data to answer their research needs
    • Connecting users with a community of skilled personnel, such as data scientists, to assist with analytics
    • Assist internal organizations in developing visualizations which allow them to run their business better

Second, drive initiatives via analytics

  • Data Library team will be exposed to many research project ideas and teams
    • Data Library team will also be widely networked
    • Opportunity to refine ideas into competitive initiatives
  • Reserve 20% team bandwidth for driving research prototypes into initiatives
    • Data Library can either drive the initiative, or
    • Work with customer team to complete initiative (generally preferred)
  • Research initiatives must address the question:
    • “What business behavior changes if we answer this question with data?” (Coblentz, 2017a) ​5​

Primary Information Services

An example team that provides actionable information to its primary constituency is the Competitive Intelligence team. To support the primary customer – salespeople – the CI team provides a wide variety of types of information. Recent analysis shows that an informed salesperson can increase their probability of a sales ‘win’ by 10 percentage points over their lesser informed counterparts, along with a shorter sales cycle and a higher average deal size. It is this observation that drives much of the value proposition of the digital library.

Much of the information currently sought by other teams (product management, product marketing, etc.) looking to make a ‘data-driven decision includes the following: Market Trend Analysis, which includes industry analyst information, and which is distilled into a prognostication of the types of emerging problems about which customers are starting to become concerned and for which they will spend money; Buyer Pattern Analysis, which is the creation of a “persona” that typifies the set of personalities involved in a corporate purchasing decision of our technology and allows the sales teams to develop specific ‘approach strategies’ for these people; and Pricing Recommendations, which reflect the historical trends of our sales discounting practices compared with any changes from the competition – such as new pricing models or patterns or the bundling programs of new and emerging products and vendors trying to disrupt the marketplace – and result in changes to our list prices, our licensing models, and possibly our revenue recognition strategies.

Beyond the effort to create these information products, we often respond to requests for information which either force an update to an existing information product or a review of a given datum. Sales teams often request information about the competition and their likely sales strategies. Typical questions may be posed, as shown in the list below:

  • What are the typical pricing tactics used by the competition and what is their approach to discounting?
  • What are the sales messages typically communicated or positioned about the competitor’s product (in effect, their claims as to why they are better)?
  • What is the competitor messaging about our products (in effect, what are the competition’s claims as to why my company and product is not the optimal choice)?
  • What are the predispositions of the various members of the customer team?
  • What do they believe? Who are their favorites? Why might they think that way?

The answers to these questions allow the salesperson to begin to understand what perceptions may exist within the buyer’s mind and to develop the thoughtful responses needed to challenge those perceptions and to form a persuasive argument.

The technical sales engineering teams often request information regarding specific technical details of a competitor’s product which they need to provide a solid foundation for their analytical assumptions. Sales analyses typically consist of some form of financial justification, either a cost-benefit analysis or a risk-mitigation analysis. A competitive intelligence team provides the technical and go-to-market details regarding the competition which then form the basis for any economic assumptions of cost or calculations of risk impact.

Social Media Strategy

Executive Summary

This section describes the usage of internal social media applications, executive announcements, and key internal innovators to increase overall awareness of the data library; to drive business value across the division’s product portfolio using big data analyses and techniques; and to build & curate the resultant “Big Data Library”; and assist and encourage employees to use it.

Description and Background

The annual R&D budget of my former company is over $1B, yet the research data resulting from this considerable investment are seldom accessible to the employees who might be able to leverage it best. The senior leaders have stated over and over that they want to become “data-driven” yet reserving the insights that can be obtained from the data to a select few will hamper organizational performance. Thus, the Data Library is intended to be a virtual library where employees can access known documented datasets along with a social network of scientific staff who can assist in answering a data science question.

To achieve this aim of having known documented datasets and allowing employees to enjoy the full benefit of the research data that is produced, the Data Library must put in place a repeatable digital curation process and effective curation lifecycle management. This will help to ensure that important digital research data is adequately safeguarded for future (re)use. By learning how to preserve and share digital materials so others can effectively reuse them, we can maximize the impact of our research – and inspire confidence among the leadership team in the value and the accuracy of the insights obtained from it.

Audience Profile

A digital library must be able to serve as a central information point for its customers. It will also act as a services bureau, able to connect a person with a query (a business question) with someone who could help them answer that question with data. The primary activity here is not casual “what if” questions but to help someone seeking to develop insights using the data, which will help change the way they do business. Once the business person is able to answer “what they will do with the answer”, a library docent/data specialist is able to better help them find and visualize their answers. Thus, the primary audience will be employees and staff of the corporation; often referred to as knowledge workers, these people tend to execute the functions of the business according to established processes and protocols. Improving the way they do business or ceasing an action that adds little to no value is the object of these ‘questioners’.

Challenges and opportunities

This section will describe the problem occurring within companies attempting to leverage “Big Data” and describe some of the issues and pitfalls with attempting to make this a self-service capability for all employees.

Current Challenges

In the high-tech, fast-paced world of Silicon Valley, engineers are taught that collaboration with their colleagues is an excellent technique to solving problems. From an early point in their careers, particularly if they are involved in software development, they are exposed to the concept of Agile Methodology, which teaches teams of developers to ‘swarm’ on a problem and resolve it quickly; to have a ‘daily standup’ of the team where each person presents a short, 1-minute summary of their intended work for the day, any blockers or help needed, and resolve any immediate conflicts to the project. Developmental code is often stored in a code repository such as GitHub and documentation is written alongside. The daily activities and results may be documented in a wiki, for the remote workers. Yet at the end of the project (a subsection may be called a ‘sprint’) little work is done for long term preservation of the activity and development, save for the work product itself. It is a continual cycle of development, test, and delivery. This culture pervades the valley and extends to the rest of the corporation’s staff, including the Data Scientist teams. Data Scientists are increasingly engaged and involved in application development, as the next round of application improvement is set to be delivered on a never-ending treadmill of continual software delivery. This is no different than the expectations of researchers in academia to publish (rather than to document and archive)
(Koltay, 2017) ​6​ the data and the workflow used to derive the resulting output.

Current Opportunities

‘“There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data,” Kaggle founder and CEO Anthony Goldbloom told The Verge over email. “In reality, it really varies. But data cleaning is a much higher proportion of data science than an outsider would expect. Actually training models is typically a relatively small proportion (less than 10 percent) of what a machine learner or data scientist does”’

(Vincent, 2017) ​7​

Creating a digital library based on self-service and reference-ability will enable the company to potentially save significant amounts of time and effort in arriving at new applications and insights. More importantly, the time reduced in both building applications by avoiding the need to unnecessarily build redundant datasets and to improve the overall applicability of data insights because the datasets are already validated could add up to revenue implications in the hundreds of millions. By creating a digital library of curated data, combined with a process of validating and augmenting that body of data in a mechanism like academic and scientific citations, the company can create a reusable system of reference data. Coupling those data with an easy-to-use visualization system (see Critical Technologies, below.) will enable casual business users to work with datasets to begin to formulate and explore answers to business questions that they understand best. The self-service mechanism of the library means that nearly all staff can become ‘citizen data scientists’; thus, expanding the bandwidth of data exploration that can be accomplished without a commensurate increase in data scientist staffing.

Client Profile

The Data Library will be a central place for business staff to access known datasets that have been curated by scientific staff for various purposes. The Library should act as a primary curation center for new datasets which are created during the course of a project. Once a project completes or reaches sufficient maturity that it can go into production, then the datasets, the techniques used to create the data, the staff and the intended usage are submitted to the Data Library for review by a curation team. The Data Library thus acts as a data archive for projects, along with the corresponding science analysis that created the datasets in the first place.

Goals and Objectives

The primary goal of the social media activity will be to increase the usage of the Digital Library. That will be measured by using visitor metrics, particularly returning visitor metrics. Also, while we hope to decrease actual questions of the staff by using a FAQ-style information portal, accession of the various questions will serve to provide an indicator of question interest. Therefore, accessing a question link will also indicate a question asked. Visits to the FAQ, increased access to the various questions, and usage of the links from the FAQ should increase over time, even among returning users. Section 5.1 below, identifies some of the library’s main metrics for operation and valuation. Since social media is ultimately intended to drive awareness of the library and its value proposition, the metrics for any specific social media will focus on the users’ activities:

The time between a user’s initial visit and return visit.

  • The increase in return visit frequency, if any. That is, a user should return to the library more and more often.
  • The number of “touch’s” that are achieved through social media – visits from a referring site, such as an internal wiki or the employee intranet sites.
  • Social media announcement campaigns, including announcements by executives at employee meetings describing the achievements by employees when using the library.
  • The number of incoming referral links from other employee intranet sites over time.

In connection with the metrics listed in Section 5.1 below, it should be possible to ascribe value in revenue, profit, and operational performance due to the usage of the library by employees.

Assumptions

In separate literature, a study on the propensity of sales personnel to adopt new technology was conducted as part of a cross-sectional survey (Robinson Jr, 2005) ​8​. The paper points out that sales personnel adopt – or do not adopt – new technology according to two main factors: their attitude towards technology and their perception as to whether the technology will help them in their job. Additionally, these sales personnel will continue to use the technology to a degree related to their perception of how easy it is to use the tool.

Sales personnel are generally considered to be highly motivated, driven individuals who have a clear sense of their objective – to make money. It seems a reasonable assumption that data scientists could reasonably be expected to be similarly driven and have a positive attitude towards technology. Accordingly, this leads to the following hypotheses for data scientists and dataset curators in corporate enterprises:

H1: The culture within corporate enterprises of quickly delivering results outweighs the priority of documenting and archiving the big datasets used in building the results. This leads to little time spent on reflection and preparing the datasets for reuse.

H2: Corporate Enterprises lack formal custodians, tools, training, and repositories for curating and cataloging their big data.

H3: Technology Tools can help address some of the issues in managing the datasets and scientists that contribute to the library.

H4: An incentive plan providing for ‘royalties’ for curated datasets which have been submitted to the library for permanent retention, curation, and management will foster increased participation by Data Scientists and other library users in leveraging the library’s assets for their research and in submitting finished and documented datasets for subsequent management. As outlined by one comment in p15, researchers understand that it is in their best interest to manage and tag their data but typically feel that it is just too much trouble. Creating an incentivization plan which encourages a more formal approach to the data will benefit follow-on research efforts increasingly over time.

This plan will focus on hypothesis 3; hypotheses 1 and 2 have been addressed in a previously submitted paper (Coblentz, 2017b) ​9​. Technology can help to address some of these issues, as outlined in this plan by allowing scientists to record their methodologies, to support official reviews of the queries and to document the datasets, both in terms of intended use and a data dictionary. The specific technologies suggested can be found in Section 4, Information technologies and/or services for the library, below.

Information technologies and/or services for the library 

As noted by Gantz, “…technology tools will be necessary but not sufficient…” for managing the library. The main objectives, coupled with the primary threats, are outlined in the following paragraphs. The section Mission & Goals, above, outlined two mission goals for the library:

  • Create and maintain a digital library of curated data for research questions
  • Catalog, tag, and maintain metadata about the datasets to facilitate rapid discovery

Information Technologies Needed

Critical Technologies

Critical technologies will be those technologies that are necessary for the successful, continued operation of the library. There are three critical technological areas that support the library; Analytics, where the ‘citizen scientists conduct research and answer questions; control of the curated datasets to make them firmly foundational; and documentation of the library structures overall.

Cloud-based Analytics

A platform for analyzing the data created by and for data scientists must be available. Ideally, the platform should also provide compute, storage, and networking capabilities needed to work with “big datasets”; a means to query, combine, and select the data as well as to visualize the results must be part of the system. This is, after all, where the “citizen data scientists” will do their work. One such system available to the company is Domo. Domo is “a cloud-based interactive dashboard platform aimed at senior executives and line-of-business users. Domo enables rapid deployment by leveraging its native cloud architecture, an extensive set of data connectors and prebuilt content, and an intuitive, modern user experience. Domo is primarily used by business people for management-style dashboards and is often deployed in lines of business with little or no support from IT” (Sallam et al., 2017) ​10​. Domo is currently in use at the company. Additional seats will need to be licensed to satisfy the objectives of a ‘citizen analytics platform’.

Security is often a corollary of big data analytics – where will the data be accessed, how will it be stored, and what mechanisms are available to control user authentication and access. To address this, we will need use include Single Sign-on mechanisms to manage user authentication, LDAP for user accesses, and insist on periodic reviews by corporate IT/security teams to ensure that the system keeps up with the required threat analyses.

SQL Control

The SQL that generates a dataset is just as important as the dataset itself. The SQL that generates a dataset necessarily becomes part of the dataset’s provenance when reviewing the data and deciding if it is suited for the intended analysis. And when the analysis project is finished, if the results are moved into a production system upon which business decisions are based, then the SQL that generates that data must be controlled and verified. Most companies have a mechanism for managing software already – Source Code Management (SCM). Git is a common enterprise tool for SCM. Git is an open-source version control system for tracking changes in computer files and coordinating work on those files among multiple people. It is primarily used for source code management in software development (Wikipedia, 2017) ​11​.

“Version control systems keep these revisions straight, storing the modifications in a central repository. This allows developers to easily collaborate, as they can download a new version of the software, make changes, and upload the newest revision. Every developer can see these new changes, download them, and contribute. Similarly, people who have nothing to do with the development of a project can still download the files and use them” (Brown, 2016) ​12​. This “repository” style approach will be extremely suited to a digital library that needs to track and manage the SQL (Structured Query Language) code that generates datasets. Not only will this support review, approval, and “authentication” of the resulting data, but the documentation of the SQL techniques of a given query can be reviewed by other users curious about query techniques they could adapt for their own research purposes.

“Project revisions can be discussed publicly, so a mass of experts can contribute knowledge and collaborate to advance a project forward… Additionally, changelogs will be used to help people track changes to the SQL… When multiple people collaborate on a project, it’s hard to keep track of revisions—who changed what, when, and where those files are stored. GitHub takes care of this problem by keeping track of all the changes that have been pushed to the repository” (Brown, 2016) ​12​.Wiki

Citizen scientists will need a mechanism for looking up datasets, their provenance, and the schema of the data so that they can choose which data will be appropriate for addressing their questions. Documenting the history, the construction and the schema of the dataset would likely best be suited to a reference system in a wiki-style format. New data can be submitted to the Library for incorporate as official, curated data under which the policy for acceptance would be that the dataset is properly documented and the SQL that creates the dataset is likewise ‘checked in’ to the SCM repository for long term management. Any interesting data techniques used in the SQL would also be documented in the wiki pages for the dataset(s). In this manner, the wiki becomes a teaching environment for the citizen scientists, allowing them to build on a foundation of curated data and database query techniques.

Most database systems would manage the schema using a data dictionary and this library will do so as well – see Data Dictionary Applications, below for more discussion – and depending on the technology used for the data dictionary, either simple descriptions of the data fields will be in the wiki or full schema references will be used. Either way, the intent is to provide the wiki visitor with sufficient information about the fields, the naming convention, the primary keys to be used for joining data, and the table itself so they can choose their own data.

Portal – Klue / various UI’s / OAKS

Any digital library needs a primary consumption UI. The digital library will support multiple portals in the form of websites, various Collaboration 2.0 tools (wikis, blogs, etc.), and download mechanisms for file-based content. While my company does not have an internal FTP site per se, the external website and associated publication mechanisms are being retooled for internal publishing purposes. (See the paragraph below on OAKS).

Klue – The primary publishing mechanism for analyzed content will be Klue. Klue is a SaaS-based, competitive intelligence portal primarily intended for field personnel to use to inform themselves regarding the state of the competition – vendors, products, pricing, etc. Klue hosts web content along with reference linkages to other content – presentations, word documents, in-depth reporting and such – outside the corporate firewall. Security is maintained via Single Signon and an IT-hosted “Partner Portal” where authentication is managed.

The value proposition of any content delivery system should be measured according to its intended usage and desired effect. Klue ‘cards’ which are kin to a 3×5 card, are intended to hold a single ‘nugget’ of information. The form factor of the ‘card’ continuously challenges content curators to keep their content focused, relevant, and concise. Curators can rearrange the consumer’s experience by moving the cards around. It is believed that competitive information can dramatically improve the organizational performance of a large high-tech enterprise like mine. Internal studies have shown significant improvement (>10%) for smaller, division-sized units.

OAKS – Project Oaks is the internal project of retooling the external publishing mechanisms which maintain the content and corporate web properties to support internal use cases. The Digital Library will use the OAKS system to host file-based content for which Klue simply cannot support. The team will post an OAKS-based URL into the Klue ‘cards’ for the consumers and provide an SSO-handoff mechanism so that authentication methods are relatively seamless. It will be incumbent on the Digital Library admin team to harmonize the authorization systems between the two so that content is neither inadvertently exposed nor hidden from users.

Wiki – The internal Data Library wiki is still very nascent. Intended to be a source of information and insight about the datasets, the methodology used to create them, and other techniques and items of interest, it remains a daunting prospect to actually sit down and write the information down. It is a real-life reminder of Polanyi’s views on tacit knowledge: “we know more than we can tell” (Polanyi, 1967) ​13​. Currently the wiki houses several key sections:

  • The Data Library Home;
  • About The Library;
  • Internal Projects;
  • Dataset Descriptions;
  • Dataflows and Transforms;
  • Metadata standards;
  • Topic Areas;
  • Training / Education;
  • Miscellaneous;
  • Security items;
  • Index;
  • List of candidate data sources for the CI DOMO dashboard; and
  • Klue Analytics

with more sections and pages being added all the time.

Corporate intranet – where employees can create and manage their own blogs, team sites, and various information distribution systems. However, it is limited in the customization ability afforded to the employees which makes it unsuitable for describing the necessary technical items for the Library. It can, however, host the file-based content that needs to be kept internal to employees only; it is an alternative to project OAKS which is open to the partner community. Some items are just too sensitive or irrelevant for the partner community (an example would be the payouts for corporate sales staff if they sell a particular combination of products from the portfolio and other sensitive information).

Telemetry

Project Oaks is instrumented to collect data about user activities on web properties. We are using that technology (Adobe Analytics aka Omniture) to track the file downloads executed by the Klue users. We receive a report on the downloads once per month. Data provided includes the link used to access the content, the number of times the asset was downloaded, the type of content asset and the subject. The intended usage for this data is to inform the staff curating the content and maintaining the competitive information regarding the subject to see if it is downloaded at a rate commensurate with the overall competitor access rates or if something needs to be adjusted.

  • Klue uses a different technology since we do not ‘own’ the web property. In this instance, we use Google Analytics to report and collect statistics daily; and we use a weekly report generated by the Klue site on the number of users we have, their account creation date, their most recent login, and their email address. If they are an employee, their badge number is also recorded in the access logs collected by Klue. We provide the employee’s badge number as part of the SSO assertion. This allows for easier analysis of user behaviors later on. The various SFDC instances that our user base access are also tracking activity by badge number so we can correlate our access rates of a particular competitor to the sales activity conducted in the field. Current metrics extracted from the reports and statistics include: New Users added this week
  • Rate at which users return – achieved by grouping the ‘date last logged in’ into 15-day ‘buckets; the intent is to maximize the recently logged in dates into the 0-15 day bucket.
  • Partners or Employees by date so we can trend the adoption rates across demographics
  • Employee maturity (time in a sales role) and access rates are also grouped into 0-6 mos, 6-12 mos, 1-3 years, and over 3-year buckets. This allows us to see if the senior staff are adopting the technology. We can subsequently trend the performance improvement rates via the SFDC data.

Data such as this allows us to adjust our training and ‘Call to Action’ messaging.

Innovative

To extend the capabilities of the digital library, several technologies are suggested. Most have an obvious value proposition (using web analytics to track usage and adoption rates, for example); others derive from the assumptions previously listed in section 3.3 above. Assumption H4, which deals with incentivizing staff to document and contribute their work to the library is the biggest driver of these additional technologies; most are already implemented and available within the company but none have been implemented in a manner to support this objective.

Predictive analytics

“…predictive analytics is the systematic use of data, machine learning techniques and a host of statistical algorithms to identify patterns that forecast the likelihood of future outcomes based on huge chunks of historical data…Predictive analytics slightly differs from other forms of big data analytics in that it is the only form that gives futuristic forecasts. Others such as prescriptive analytics gives directions on what actions should be taken to remedy various corporate issues; diagnostic analytics determines what happened and shows us why while descriptive analytics tells us what is currently happening”.

(Farooq, 2016) ​14​

To enable citizen scientists to refine their ideas into competitive initiatives (as mentioned above, in the library objectives; section 1.5), tooling that enables forecasting and future predictions will be needed to assist in creating ROI estimates which will be used to justify projects and bound application development effort. Tools which can visualize and forecast behavior based on existing data can help users evaluate their ideas and approaches to implementing problem solutions. Such systems typically allow users to vary one or more parameters to measure the impact of changes on the solution.

Given that the analytics platform described in section Telemetry above, is cloud-based, it seems likely that the analytical tools will also be cloud-based. Two such tools have been identified and more are being developed every day. The two tools are listed below:

The wiki will describe the predictive analytics tools available, those that have been researched, and which of those seem most promising. This reference material will enable others to rapidly assess their needs against a known set of tools.

Data dictionary applications to track and identify useful columns.  

In this document, the terms data dictionary and data repository indicate a more general manually generated listing or catalog of the organization of the various datasets in the library. The data dictionary is a data structure that stores metadata, i.e., (structured) data about information (the datasets and their provenance). It is mainly used by the designers, users, and administrators of a computer system for information resource management. The catalog/data dictionary will maintain information on software configuration, documentation, application, and users, as well as other information relevant to information reference administration (Wikipedia, 2004) ​15​. Library users and application developers, will benefit from an authoritative data dictionary document that catalogs the organization, contents, and conventions of one or more databases.  This typically includes the names and descriptions of various tables (records or entities) and their contents (fields) plus additional details, like the type and length of each data element. Another important piece of information that a data dictionary can provide is the relationship between tables. Since Domo does not have active data dictionary support, the data dictionary must be maintained manually, as part of the dataset check-in process.

Citation and reuse tracking for data scientists. 

Significant challenges still exist in the provenance of big data systems (Wang, Crawl, Purawat, Nguyen, & Altintas, 2015) ​15​. A reference architecture proposed by Wang, et al., would seem to have significant advantages for the library’s need in managing data provenance. However, this will have to be studied and prototyped as the library develops. There does not seem to be a specific tool that provides a provenance capability (something pointed out as an opportunity in the paper) so initially, provenance will have to be managed via descriptions and required documentation of the datasets and analyses provided to the library via a formal submission process. This can be managed under the auspices of the company’s Data Governance Team.

Derived requirements

The IT systems in the following sections are required items needed to support the systems previously called out. Single Sign-on, for example, is needed to manage user access and authentication to the data analytics platform. Most of these systems are already in place.

Single sign-on for authentication 

“Single sign-on (SSO) is a session and user authentication service that permits a user to use one set of login credentials (e.g., name and password) to access multiple applications. The service authenticates the end-user for all the applications the user has been given rights to and eliminates further prompts when the user switches applications during the same session. On the back end, SSO is helpful for logging user activities as well as monitoring user accounts.

In a basic web SSO service, an agent module on the application server retrieves the specific authentication credentials for an individual user from a dedicated SSO policy server, while authenticating the user against a user repository such as a lightweight directory access protocol (LDAP) directory” (Rouse, 2014) ​16​. It is this requirement that drives the need for an LDAP system, see LDAP, below.

LDAP group management for access control

LDAP (Lightweight Directory Access Protocol) (Rouse, November 2008) ​17​ is a software protocol for enabling anyone to locate organizations, individuals, and other resources such as files and devices in a network, whether on the public Internet or on a corporate intranet. LDAP is a “lightweight” (smaller amount of code) version of Directory Access Protocol (DAP), which is part of X.500, a standard for directory services in a network.

In a network, a directory tells you where in the network something is located. On TCP/IP networks (including the Internet), the domain name system (DNS) is the directory system used to relate the domain name to a specific network address (a unique location on the network). However, you may not know the domain name. LDAP allows you to search for an individual without knowing where they’re located (although additional information will help with the search).

An LDAP directory is organized in a simple “tree” hierarchy consisting of the following levels:

  • The root directory (the starting place or the source of the tree), which branches out to…
  • Countries, each of which branches out to…
  • Organizations, which branch out to…
  • Organizational units (divisions, departments, and so forth), which branches out to (includes an entry for)…
  • Individuals (which includes people, files, and shared resources such as printers)

Web Analytics

Web analytics is the measurement, collection, analysis and reporting of web data for purposes of understanding and optimizing web usage. However, Web analytics is not just a process for measuring web traffic but can be used as a tool for business and market research, and to assess and improve the effectiveness of a website. Web analytics applications can also help companies measure the results of traditional print or broadcast advertising campaigns. It helps one to estimate how traffic to a website changes after the launch of a new advertising campaign. Web analytics provides information about the number of visitors to a website and the number of page views. It helps gauge traffic and popularity trends which is useful for market research.

(Wikipedia, 2005) ​18​

As part of the overall effort to measure the effectiveness of the digital library initiative, it will be necessary to measure visitor traffic, dataset accesses, and overall usage. To do this requires a set of website analytics tools. While not essential for the operation of the library per se, such tooling will be essential for the measurement and evaluation of key performance indicators. See Evaluation Plan, below.

ETL mechanisms across databases

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse (datawarehouse4u, 2017) ​19​. The company uses a variety of databases; and much of the data for citizen scientists is kept in a Big Data Lake maintained under Greenplum. Greenplum is based on Postgres 8 so standard SQL techniques can work with the technologies already described. As a rule, JSON-formatted data can be easily loaded and unloaded from source systems and loaded into the digital library. Most commercial vendors support JSON format. Using a JSON format will ease the burden of maintaining specific connections to source systems. Transformations will be done within the digital library analytics platform so that data retains its provenance and security is maintained at all times.

Team collaboration software to hold reports and analyses, policies and proper use citations. 

The company has already implemented a collaboration platform – Jive – for all employees which will be used to hold results, analyses, and conduct various social/collaborative discussions on the reports and insights derived from the data. Hyperlink pointers and embedded iframe cards can be used to post visualizations from the analytics platform into the Jive system and the other direction, as well as to GitHub and the documentation wiki.

Value Propositions

The main value proposition of the digital library and its curated assets is the ability to shorten the time it takes for a team of researchers to develop a clean, specific set of data to use in their analysis. Roughly 60-80% of project time is spent in data collection and preparation (Press, 2016) ​20​. By ‘minimizing the monkey work’, the library can shorten time to value and increase its overall return on investment.

Implementation Plan 

To the maximum extent possible, any new technology must integrate with existing corporate systems (user authentication, etc.) to minimize the effort of reviewing any new security threat vector and to ensure that the processes and systems are in step with corporate requirements. Consequently, most of the implementation effort will be in reviewing the new technologies and in integrating them with the corporate standards already in place. The library will have to create policies and procedures for the management of the data like those in place for the ‘Big Data Lakes’ already in use and this is expected to consume most of the work effort in getting the library formally accepted.

Evaluation Plan

To determine if the library is adding value to the corporation overall, it will need to be evaluated on a continuing basis. Since this is a digital library, it is reasonable to instrument the library according to the value propositions previously articulated and to create dashboards visualizing the library trends. The value propositions set forth in this plan drive the key metric definitions:

  1. Increased reuse of datasets within the library for data science projects, including partial dataset reuse. “Forking” of a dataset with subsequent joins to new data would be considered a value-add dataset.
  2. Increased dataset submission to the digital library
  3. Increases (over time) of visitors/return visits
  4. Increased data visualizations that result in business change. An example of this would be encouraging the sales force across two divisions to prefer to work together and cross-sell more of the company’s portfolio (sometimes referred to as, “increasing product drag”).
  5. Decreases in the time a typical data science project spends on data collection and cleaning due to curated data from the library
  6. An increase (over time) of the questions submitted to the library

Key Metrics

Primary and secondary metrics for the library are those that address the questions from the evaluation list just described. An example of metrics for item 1 would be:

  1. Increased reuse of datasets within the library for data science projects, including partial dataset reuse. “Forking” of a dataset with subsequent joins to new data would be considered a value-add dataset.
  • Number of new, raw datasets submitted to the library; trended over time
  • Number of new datasets with additional use cases added; that is, new data derived from existing data but used for a new purpose; trended over time
  • Number of existing datasets joined with new raw data to create a curated dataset for a new use case; trended over time
  • Number of new use cases added to the library
  • Number of projects finalized using library data
    • Number of projects ‘extended’ / reopened using library data

The number of questions could be nearly endless, so it is incumbent on the library directors to decide which questions drive value to the company at large and thus affect funding and resource budgets. To the degree that the questions help the library engender a positive return on investment, then the questions should be investigated and charted. Questions that do not help the library should be ignored.

Summary

Neither technology nor our use of it will stand still. Therefore, a review of this plan, the technologies incorporated by reference and their feature/functionality, along with the corresponding processes which leverage them, should be reviewed annually. Most of the technologies described in this document are SaaS-based and feature/functionality improvements occur frequently. Projects will be proposed and evaluated over time. The Data Governance team for the Digital Library will review and prioritize the projects; some may never get authorized as the priorities of the business are adjusted. The corporate Data Governance team will utilize (and adjust) their own mechanisms for managing data; the Digital Library will have to adjust to accommodate those changes.

A key aspect to consider – and which this plan does not (yet) address – are changes to the various data schema, along with concurrent procedural changes. It may become necessary to maintain a mapping of historical schema fields and values to current ones so that historical perspectives can be managed and addressed. Another aspect will be the incorporation of survey technologies – such as SurveyMonkey – to elicit both objective and subjective feedback from users, then documenting the nature of the schema and/or procedural changes.

A derived aspect will be the need to manage and maintain the contractual/financial vendor relationships to support the plan. To exemplify this, consider the current contract with Domo, which provides for unlimited quantities of storage (data stored on disk) and unlimited usage of computing resources needed to manage the ETL (Extract, Transform, and Load) work for datasets numbering in 10’s of millions of rows and gigabytes of size. This is much preferred to a contractual limitation based on storage and usage and would inhibit adoption overall.

Lastly, incenting staff to document their work which will accelerate the work of others to come needs to be discussed and worked with management. The needs of the business will drive the creation and adoption of the library and we will need to measure this impact on the business (in a positive sense) so we can discuss and compare against the primary alternative: doing nothing to improve things and letting IT progress as usual through inertia.

Bibliography

  1. 1.
    Gantz J, Reinsel D. The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East – United States. EMC. https://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf. Published February 2013. Accessed 2017.
  2. 2.
    Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition, and productivity. McKinsey and Co. https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/big-data-the-next-frontier-for-innovation. Published May 2011. Accessed November 22, 2017.
  3. 3.
    Lovelock J-D, Hahn WL, Atwal R, et al. Forecast Alert: IT Spending, Worldwide, 3Q17 Update. Gartner; 2017:9. https://www.gartner.com/doc/3810569?ref=unauthreader. Accessed December 3, 2017.
  4. 4.
    Goepfert J, Minton S, Shirer M. Worldwide IT Spending Will Reach $2.8 Trillion in 2019 with the Strongest Growth Coming from the Healthcare Industry, According to IDC. IDC.com; 2016:9. https://www.idc.com/getdoc.jsp?containerId=prUS41006516. Accessed November 20, 2016.
  5. 5.
    Coblentz M. Analytics as a Service. 2017.
  6. 6.
    Koltay T. Data literacy for researchers and data librarians. Journal of Librarianship and Information Science. July 2016:3-14. doi:10.1177/0961000615616450
  7. 7.
    Vincent J. The biggest headache in machine learning? Cleaning dirty data off the spreadsheets. The Verge. https://www.theverge.com/2017/11/1/16589246/machine-learning-data-science-dirty-data-kaggle-survey-2017. Published 2017. Accessed December 6, 2017.
  8. 8.
    Robinson L, Marshall GW, Stamps MB. An empirical investigation of technology acceptance in a field sales force setting. Technology and the Sales Force. 2005;34(4):407-415. doi:https://doi.org/10.1016/j.indmarman.2004.09.019
  9. 9.
    Coblentz M. Research Topic: What are the information organizing principles for big data within a large enterprise? 2017:43.
  10. 10.
    Sallam RL, Howson C, Idoine CJ, Oestreich TW, Richardson JL, Tapadinhas J. Magic Quadrant for Business Intelligence and Analytics Platforms. Gartner; 2017. https://www.gartner.com/doc/reprints?id=1-3RTAT4N&ct=170124&st=sb.
  11. 11.
    Wikipedia. Git. Wikipedia. https://en.wikipedia.org/wiki/Git. Published 2017. Accessed December 8, 2017.
  12. 12.
    Brown K. What Is GitHub, and What Is It Used For? HowToGeek. https://www.howtogeek.com/180167/htg-explains-what-is-github-and-what-do-geeks-use-it-for/. Published 2016. Accessed December 9, 2017.
  13. 13.
    Polanyi M. The Tacit Dimension. London,: Routledge & K. Paul; 1967:xi, 108 p.
  14. 14.
    Farooq M. Applications of Predictive Analytics in various industries. Big Data Made Simple. http://bigdata-madesimple.com/applications-of-predictive-analytics-in-various-industries-2/. Published 2016. Accessed December 9, 2017.
  15. 15.
    Wang J, Crawl D, Purawat S, Nguyen M, Altintas I. Big data provenance: Challenges, state of the art and opportunities. In: Santa Clara, CA; 2015.
  16. 16.
    Rouse M. Single sign-on (SSO). TechTarget. http://searchsecurity.techtarget.com/definition/single-sign-on. Published 2014. Accessed December 4, 2017.
  17. 17.
    Rouse M. LDAP (Lightweight Directory Access Protocol). TechTarget. http://searchmobilecomputing.techtarget.com/definition/LDAP . Published November 2008. Accessed December 3, 2017.
  18. 18.
    Wikipedia. Web Analytics. Wikipedia. https://en.wikipedia.org/wiki/Web_analytics. Published 2017. Accessed December 9, 2017.
  19. 19.
    datawarehouse4u. ETL Process. Datawarehouse4u. https://www.datawarehouse4u.info/ETL-process.html%7D. Published 2017. Accessed December 9, 2017.
  20. 20.
    Press G. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes. 2017:2. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#44ca7da36f63. Accessed December 9, 2017.