Building the Unstructured Data Warehouse: Architecture, Analysis, and Design

Building the Unstructured Data Warehouse: Architecture, Analysis, and Design

earn essential techniques from data warehouse legend Bill Inmon on how to build the reporting environment your business needs now!

Answers for many valuable business questions hide in text. How well can your existing reporting environment extract the necessary text from email, spreadsheets, and documents, and put it in a useful format for analytics and reporting? Transforming the traditional data warehouse into an efficient unstructured data warehouse requires additional skills from the analyst, architect, designer, and developer. This book will prepare you to successfully implement an unstructured data warehouse and, through clear explanations, examples, and case studies, you will learn new techniques and tips to successfully obtain and analyze text.

Master these ten objectives:

  • Build an unstructured data warehouse using the 11-step approach
  • Integrate text and describe it in terms of homogeneity, relevance, medium, volume, and structure
  • Overcome challenges including blather, the Tower of Babel, and lack of natural relationships
  • Avoid the Data Junkyard and combat the Spider's Web
  • Reuse techniques perfected in the traditional data warehouse and Data Warehouse 2.0, including iterative development
  • Apply essential techniques for textual Extract, Transform, and Load (ETL) such as phrase recognition, stop word filtering, and synonym replacement
  • Design the Document Inventory system and link unstructured text to structured data
  • Leverage indexes for efficient text analysis and taxonomies for useful external categorization
  • Manage large volumes of data using advanced techniques such as backward pointers
  • Evaluate technology choices suitable for unstructured data processing, such as data warehouse appliances

The following outline briefly describes each chapter's content:

  • Chapter 1 defines unstructured data and explains why text is the main focus of this book.
  • Chapter 2 addresses the challenges one faces when managing unstructured data.
  • Chapter 3 discusses the DW 2.0 architecture, which leads into the role of the unstructured data warehouse. The unstructured data warehouse is defined and benefits are given. There are several features of the conventional data warehouse that can be leveraged for the unstructured data warehouse, including ETL processing, textual integration, and iterative development.
  • Chapter 4 focuses on the heart of the unstructured data warehouse: Textual Extract, Transform, and Load (ETL).
  • Chapter 5 describes the 11 steps required to develop the unstructured data warehouse.
  • Chapter 6 describes how to inventory documents for maximum analysis value, as well as link the unstructured text to structured data for even greater value.
  • Chapter 7 goes through each of the different types of indexes necessary to make text analysis efficient. Indexes range from simple indexes, which are fast to create and are good if the analyst really knows what needs to be analyzed before the indexing process begins, to complex combined indexes, which can be made up of any and all of the other kinds of indexes.
  • Chapter 8 explains taxonomies and how they can be used within the unstructured data warehouse.
  • Chapter 9 explains ways of coping with large amounts of unstructured data. Techniques such as keeping the unstructured data at its source and using backward pointers are discussed. The chapter explains why iterative development is so important.
  • Chapter 10 focuses on challenges and some technology choices that are suitable for unstructured data processing. In addition, the data warehouse appliance is discussed.
  • Chapters 11, 12, and 13 put all of the previously discussed techniques and approaches in context through three case studies.
时间: 2024-12-21 12:28:43

Building the Unstructured Data Warehouse: Architecture, Analysis, and Design的相关文章

如何创建一个成功的数据仓库(data warehouse) (想了解数据仓库的人士快看)

创建|数据        如何创建一个成功的数据仓库(data warehose),下面的故事将告诉你!       The company's first data warehouse project began with a casual conversation between several executives on their way to lunch. The people involved were the IT manager for decision support as w

浅析基于微软SQL Server 2012 Parallel Data Warehouse的大数据解决方案

综述 随着越来越多的组织的数据从GB.TB级迈向PB级,标志着整个社会的信息化水平正在迈入新的时代 – 大数据 时代.对海量数据的处理.分析能力,日益成为组织在这个时代决胜未来的关键因素,而基于大数据的应用,也在潜移 默化地渗透到社会的方方面面,影响到每一个人的日常生活,人们日常生活中看到的电视节目.浏览的网页.接收到的 广告,都将是基于大数据分析之后提供的有针对性的内容. 微软在大数据领域的战略重点,在于更好地帮助客户"消费"大数据,让所有的用户都能够从几乎任何规 模任何类型的任何数

Concept of Key in Data Warehouse

Keys and history In a star schema, each dimension table is given a surrogate key. This column is a unique identifier, created exclusively for the data warehouse. Surrogate keys are assigned and maintained as part of the process that loads the star sc

Type 2 Slowly Change Dimension with Timestamp(原创)

Time-Stamped Dimensions If there is any uncertainty about requirements for historic data, the most common response to changes in source data is the type 2 slowly changing dimension.It is the safe choice because it preserves the association of histori

Big Data Application Case Study – Technical Architecture of a Big Data Platform

Abstract: How should we design the architecture of a big data platform? Are there any good use cases for this architecture? This article studies the case of OpSmart Technology to elaborate on the business and data architecture of Internet of Things f

MaxCompute 2.0: Evolution of Alibaba's Big Data Service

The speech mainly covers three aspects: • Overview of Alibaba Cloud MaxCompute • Evolution of Alibaba's Data Platform • MaxCompute 2.0 Moving Forward I. Overview of Alibaba Cloud MaxCompute Alibaba Cloud MaxCompute is formerly known as ODPS, which is

The Log: What every software engineer should know about real-time data's unifying abstraction

主要的思想,  将所有的系统都可以看作两部分,真正的数据log系统和各种各样的query engine  所有的一致性由log系统来保证,其他各种query engine不需要考虑一致性,安全性,只需要不停的从log系统来同步数据,如果数据丢失或crash可以从log系统replay来恢复  可以看出kafka系统在linkedin中的重要地位,不光是data aggregation,而是整个系统的核心 Part One: What Is a Log? log定义 很简单的结构,最关键的属性是,

oVirt Architecture

[原文] http://www.ovirt.org/Architecture Architecture   Contents  [hide]  1 oVirt Architecture 1.1 Overall architecture 1.2 Engine 1.2.1 Engine-Core Architecture 1.3 Host Agent (VDSM) 1.3.1 Hooks mechanism 1.3.2 MOM integration 1.4 Web-based User Inter

Alibaba Cloud releases MaxCompute big data platform in the U.S.

On November 16, 2017, Alibaba Group's cloud computing platform, Alibaba Cloud, officially launched its MaxCompute big data platform in the United States. This platform was independently developed by Alibaba Cloud and possesses many features, includin