instagram use solr instead postgresql gis

instagram的技术点可参考 : 

http://instagram-engineering.tumblr.com/post/13649370142/what-powers-instagram-hundreds-of-instances-dozens-of

从文章来看, 在地理位置搜索方面, 他们使用了solr.

也算专车专用吧. 

For our geo-search API, we used PostgreSQL for many months, but once our Media entries were sharded, moved over to using Apache Solr. It has a simple JSON interface, so as far as our application is concerned, it’s just another API to consume.

Apache Solr

SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

See the complete feature list for more details.

For more information about Solr, please see the Solr wiki.

SolrTM Features

Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results.

  • Advanced Full-Text Search Capabilities
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces - XML, JSON and HTTP
  • Comprehensive HTML Administration Interfaces
  • Server statistics exposed over JMX for monitoring
  • Linearly scalable, auto index replication, auto failover and recovery
  • Near Real-time indexing
  • Flexible and Adaptable with XML configuration
  • Extensible Plugin Architecture

Solr Uses the LuceneTM Search Library and Extends it!

  • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
  • Powerful Extensions to the Lucene Query Language
  • Faceted Search and Filtering
  • Geospatial Search with support for multiple points per document and geo polygons
  • Advanced, Configurable Text Analysis
  • Highly Configurable and User Extensible Caching
  • Performance Optimizations
  • External Configuration via XML
  • An AJAX based administration interface
  • Monitorable Logging
  • Fast near real-time incremental indexing and index replication
  • Highly Scalable Distributed search with sharded index across multiple hosts
  • JSON, XML, CSV/delimited-text, and binary update formats
  • Easy ways to pull in data from databases and XML files from local disk and HTTP sources
  • Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika
  • Apache UIMA integration for configurable metadata extraction
  • Multiple search indices

Detailed Features

Schema

  • Defines the field types and fields of documents
  • Can drive more intelligent processing
  • Declarative Lucene Analyzer specification
  • Dynamic Fields enables on-the-fly addition of new fields
  • CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
  • Explicit types eliminates the need for guessing types of fields
  • External file-based configuration of stopword lists, synonym lists, and protected word lists
  • Many additional text analysis components including word splitting, regex and sounds-like filters
  • Pluggable similarity model per field

Query

  • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)
  • Sort by any number of fields, and by complex functions of numeric fields
  • Advanced DisMax query parser for high relevancy results from user-entered queries
  • Highlighted context snippets
  • Faceted Searching based on unique field values, explicit queries, date ranges, numeric ranges or pivot
  • Multi-Select Faceting by tagging and selectively excluding filters
  • Spelling suggestions for user queries
  • More Like This suggestions for given document
  • Function Query - influence the score by user specified complex functions of numeric fields or query relevancy scores.
  • Range filter over Function Query results
  • Date Math - specify dates relative to "NOW" in queries and updates
  • Dynamic search results clustering using Carrot2
  • Numeric field statistics such as min, max, average, standard deviation
  • Combine queries derived from different syntaxes
  • Auto-suggest functionality for completing user queries
  • Allow configuration of top results for a query, overriding normal scoring and sorting
  • Simple join capability between two document types
  • Performance Optimizations

Core

  • Dynamically create and delete document collections without restarting
  • Pluggable query handlers and extensible XML data format
  • Pluggable user functions for Function Query
  • Customizable component based request handler with distributed search support
  • Document uniqueness enforcement based on unique key field
  • Duplicate document detection, including fuzzy near duplicates
  • Custom index processing chains, allowing document manipulation before indexing
  • User configurable commands triggered on index changes
  • Ability to control where docs with the sort field missing will be placed
  • "Luke" request handler for corpus information

Caching

  • Configurable Query Result, Filter, and Document cache instances
  • Pluggable Cache implementations, including a lock free, high concurrency implementation
  • Cache warming in background
  • When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
  • Autowarming in background
  • The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabling high cache hit rates across index/searcher changes.
  • Fast/small filter implementation
  • User level caching with autowarming support

SolrCloud

  • Centralized Apache ZooKeeper based configuration
  • Automated distributed indexing/sharding - send documents to any node and it will be forwarded to correct shard
  • Near Real-Time indexing with immediate push-based replication (also support for slower pull-based replication)
  • Transaction log ensures no updates are lost even if the documents are not yet indexed to disk
  • Automated query failover, index leader election and recovery in case of failure
  • No single point of failure

Admin Interface

  • Comprehensive statistics on cache utilization, updates, and queries
  • Interactive schema browser that includes index statistics
  • Replication monitoring
  • SolrCloud dashboard with graphical cluster node status
  • Full logging control
  • Text analysis debugger, showing result of every stage in an analyzer
  • Web Query Interface w/ debugging output
  • Parsed query output
  • Lucene explain() document score detailing
  • Explain score for documents outside of the requested range to debug why a given document wasn't ranked higher.

The Apache Software Foundation

The Apache Software Foundation provides support for the Apache community of open-source software projects. The Apache projects are defined by collaborative consensus based processes, an open, pragmatic software license and a desire to create high quality software that leads the way in its field. Apache Lucene, Apache Solr, Apache PyLucene, Apache Open Relevance Project and their respective logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

[参考]

1. http://lucene.apache.org/solr/

2. http://instagram-engineering.tumblr.com/post/13649370142/what-powers-instagram-hundreds-of-instances-dozens-of

时间: 2024-10-03 12:45:14

instagram use solr instead postgresql gis的相关文章

Greenplum 空间(GIS)数据检索 B-Tree & GiST 索引实践 - 阿里云HybridDB for PostgreSQL最佳实践

标签 PostgreSQL , GIS , PostGIS , Greenplum , 空间检索 , GiST , B-Tree , geohash 背景 气象数据.地震数据.室内定位.室外定位.手机.车联网.还有我们最喜欢的"左划不喜欢.右划喜欢",越来越多的位置属性的数据.将来会越来越多. 基于GIS的数据分析.OLTP业务也越来越受到决策者的青睐,例如商场的选址决策,O2O的广告营销等.有很多基于多边形.时间.用户对象属性过滤的需求. 阿里云HybridDB for Postgr

为PostgreSQL讨说法 - 浅析《UBER ENGINEERING SWITCHED FROM POSTGRES TO MYSQL》

背景 最近有一篇文档,在国外闹得沸沸扬扬,是关于UBER使用mysql替换postgres原因的文章. 英文原文https://eng.uber.com/mysql-migration/ 来自高可用架构的 中文翻译 文章涉及到 PG数据库的部分,背后的原理并没有深入的剖析,导致读者对PostgreSQL的误解 . uber在文章阐述的遇到的PG问题 We encountered many Postgres limitations: Inefficient architecture for wri

RDS PostgreSQL\HDB PG 毫秒级海量时空数据透视 典型案例分享

标签 PostgreSQL , GIS , 时空数据 , 数据透视 , bitmapAnd , bitmapOr , multi-index , 分区 , brin , geohash cluster 背景 随着移动终端的普及,现在有越来越多的业务数据会包含空间数据,例如手机用户的FEED信息.物联网.车联网.气象传感器的数据.动物的溯源数据,一系列跟踪数据. 这些数据具备这几个维度的属性: 1.空间 2.时间 3.业务属性,例如温度.湿度.消费额.油耗.等. 数据透视是企业BI.分析师.运营非

德哥的PostgreSQL私房菜 - 史上最屌PG资料合集

看完并理解这些文章,相信你会和我一样爱上PostgreSQL,并成为一名PostgreSQL的布道者. 资料不断更新中... ... 沉稳的外表无法掩饰PG炙热的内心 . 扩展阅读,用心感受PostgreSQL 内核扩展 <找对业务G点, 体验酸爽 - PostgreSQL内核扩展指南>https://yq.aliyun.com/articles/55981 <当物流调度遇见PostgreSQL - GIS, 路由, 机器学习 (狮子,女巫,魔衣橱)>https://yq.aliy

旋转门数据压缩算法在PostgreSQL中的实现 - 流式压缩在物联网、监控、传感器等场景的应用

背景 在物联网.监控.传感器.金融等应用领域,数据在时间维度上流式的产生,而且数据量非常庞大. 例如我们经常看到的性能监控视图,就是很多点在时间维度上描绘的曲线. 又比如金融行业的走势数据等等. 我们想象一下,如果每个传感器或指标每100毫秒产生1个点,一天就是864000个点. 而传感器或指标是非常多的,例如有100万个传感器或指标,一天的量就接近一亿的量. 假设我们要描绘一个时间段的图形,这么多的点,渲染估计都要很久. 那么有没有好的压缩算法,即能保证失真度,又能很好的对数据进行压缩呢? 旋

mongoDB BI 分析利器 - PostgreSQL FDW (MongoDB Connector for BI)

背景 mongoDB是近几年迅速崛起的一种文档型数据库,广泛应用于对事务无要求,但是要求较好的开发灵活性,扩展弹性的领域,. 随着企业对数据挖掘需求的增加,用户可能会对存储在mongo中的数据有挖掘需求,但是mongoDB的语法较为单一,不能满足挖掘的需求. PostgreSQL是起源于伯克利大小的一个开源数据库,已经有20多年的历史,以稳定性,功能强大著称,号称"开源界的Oracle". 在国内外各个行业都有非常多的用户,如平安银行,邮储银行,中移动,去哪儿,高德,菜鸟,美国宇航局,

行存、列存,堆表、AO表性能对比 - 阿里云HDB for PostgreSQL最佳实践

标签 PostgreSQL , GIS , PostGIS , Greenplum , 空间检索 , GiST , B-Tree , geohash 背景 <Greenplum 行存.列存,堆表.AO表的原理和选择> 以上文档详细的介绍了行存.列存,堆表.AO表的原理以及选择的依据. <一个简单算法可以帮助物联网,金融 用户 节约98%的数据存储成本 (PostgreSQL,Greenplum帮你做到)> 以上文档介绍了提升基于列存的全局数据压缩比的方法. <解密上帝之手 -

蜂巢的艺术与技术价值 - PostgreSQL PostGIS&#039;s hex-grid

标签 PostgreSQL , vector grid , polygon grid , square grid , Hexagon grid , 矢量网格 , 几何网格 , 线段网格 , 多边形网格 , 四边形网格 , 六边形网格 , 蜂巢 , PostGIS , 地图 , 转换 背景 人们为了更好的描述一个东西,有一种将大化小的思路,比如时钟被分为了12个区域,每个区域表示一个小时,然后每个小的区域又被划分为更小的区域表示分钟. 在GIS系统中,也有类似的思想,比如将地图划分成网格.通过编码

PostgreSQL 空间、多维 序列 生成方法

标签 PostgreSQL , GIS , PostGIS , 序列 , 空间序列 背景 数据库的一维序列是很好理解的东西,就是在一个维度上自增. 那么二维.多维序列怎么理解呢?显然就是在多个维度上齐头并进的自增咯. 二维序列 以二维序列为例,应该是这样增长的: 0,0 0,1 1,0 1,1 1,2 2,1 2,2 ... 那么如何生成以上二维序列呢?实际上可以利用数据库的多个一维序列来生成. create sequence seq1; create sequence seq2; create