awesome-nlp

 

awesome-nlp 

A curated list of resources dedicated to Natural Language Processing

Maintainers - Keon KimMartin Park

Please read the contribution guidelines before contributing.

Please feel free to pull requests, or email Martin Park (sp3005@nyu.edu)/Keon Kim (keon.kim@nyu.edu) to add links.

Table of Contents

Tutorials and Courses

  • Tensor Flow Tutorial on Seq2Seq Models
  • Natural Language Understanding with Distributed Representation Lecture Note by Cho

videos

Deep Learning for NLP

Stanford CS 224D: Deep Learning for NLP class
Class by Richard Socher. 2016 content was updated to make use of Tensorflow. Lecture slides and reading materials for 2016 class here. Videos for 2016 class here. Note that there are some lecture videos missing for 2016 (lecture 9, and lectures 12 onwards). All videos for 2015 class here

Udacity Deep Learning Deep Learning course on Udacity (using Tensorflow) which covers a section on using deep learning for NLP tasks. This section covers how to implement Word2Vec, RNN's and LSTMs.

A Primer on Neural Network Models for Natural Language Processing
Yoav Goldberg. October 2015. No new info, 75 page summary of state of the art.

Packages

Implementations

Libraries

  • TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
  • Node.js and Javascript - Node.js Libaries for NLP
    • Twitter-text - A JavaScript implementation of Twitter's text processing library
    • Knwl.js - A Natural Language Processor in JS
    • Retext - Extensible system for analyzing and manipulating natural language
    • NLP Compromise - Natural Language processing in the browser
    • Natural - general natural language facilities for node
  • Python - Python NLP Libraries
    • Scikit-learn: Machine learning in Python
    • Natural Language Toolkit (NLTK)
    • Pattern - A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.
    • TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
    • YAlign - A sentence aligner, a friendly tool for extracting parallel sentences from comparable corpora.
    • jieba - Chinese Words Segmentation Utilities.
    • SnowNLP - A library for processing Chinese text.
    • KoNLPy - A Python package for Korean natural language processing.
    • Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
    • BLLIP Parser - Python bindings for the BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
    • PyNLPl - Python Natural Language Processing Library. General purpose NLP library for Python. Also contains some specific modules for parsing common NLP formats, most notably for FoLiA, but also ARPA language models, Moses phrasetables, GIZA++ alignments.
    • python-ucto - Python binding to ucto (a unicode-aware rule-based tokenizer for various languages)
    • python-frog - Python binding to Frog, an NLP suite for Dutch. (pos tagging, lemmatisation, dependency parsing, NER)
    • python-zpar - Python bindings for ZPar, a statistical part-of-speech-tagger, constiuency parser, and dependency parser for English.
    • colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
    • spaCy - Industrial strength NLP with Python and Cython.
    • PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies.
  • C++ - C++ Libraries
    • MIT Information Extraction Toolkit - C, C++, and Python tools for named entity recognition and relation extraction
    • CRF++ - Open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data & other Natural Language Processing tasks.
    • CRFsuite - CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data.
    • BLLIP Parser - BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
    • colibri-core - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
    • ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.
    • libfolia - C++ library for the FoLiA format
    • frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.
    • MeTA - MeTA : ModErn Text Analysis is a C++ Data Sciences Toolkit that facilitates mining big text data.
    • Mecab (Japanese)
    • Mecab (Korean)
    • Moses
  • Java - Java NLP Libraries
  • Clojure
    • Clojure-openNLP - Natural Language Processing in Clojure (opennlp)
    • Infections-clj - Rails-like inflection library for Clojure and ClojureScript
  • Ruby

Services

  • Wit-ai - Natural Language Interface for apps and devices.

Articles

Review Articles

Word Vectors

Resources about word vectors, aka word embeddings, and distributed representations for words.
Word vectors are numeric representations of words that are often used as input to deep learning systems. This process is sometimes called pretraining.

Efficient Estimation of Word Representations in Vector Space
Distributed Representations of Words and Phrases and their Compositionality
Mikolov et al. 2013.
Generate word and phrase vectors. Performs well on word similarity and analogy task and includes Word2Vec source codeSubsamples frequent words. (i.e. frequent words like "the" are skipped periodically to speed things up and improve vector for less frequently used words)
Word2Vec tutorial in TensorFlow

Deep Learning, NLP, and Representations
Chris Olah (2014) Blog post explaining word2vec.

GloVe: Global vectors for word representation
Pennington, Socher, Manning. 2014. Creates word vectors and relates word2vec to matrix factorizations. Evalutaion section led to controversy by Yoav Goldberg
Glove source code and training data

Thought Vectors

Thought vectors are numeric representations for sentences, paragraphs, and documents. The following papers are listed in order of date published, each one replaces the last as the state of the art in sentiment analysis.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Socher et al. 2013. Introduces Recursive Neural Tensor Network. Uses a parse tree.

Distributed Representations of Sentences and Documents
Le, Mikolov. 2014. Introduces Paragraph Vector. Concatenates and averages pretrained, fixed word vectors to create vectors for sentences, paragraphs and documents. Also known as paragraph2vec. Doesn't use a parse tree.
Implemented in gensim. See doc2vec tutorial

Deep Recursive Neural Networks for Compositionality in Language
Irsoy & Cardie. 2014. Uses Deep Recursive Neural Networks. Uses a parse tree.

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
Tai et al. 2015 Introduces Tree LSTM. Uses a parse tree.

Semi-supervised Sequence Learning
Dai, Le 2015 "With pretraining, we are able to train long short term memory recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups."

Machine Translation

Neural Machine Translation by jointly learning to align and translate Bahdanau, Cho 2014. "comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation." Implements attention mechanism.
English to French Demo

Sequence to Sequence Learning with Neural Networks
Sutskever, Vinyals, Le 2014. (nips presentation). Uses LSTM RNNs to generate translations. " Our main result is that on an English to French translation task from the WMT’14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8"
seq2seq tutorial in

Single Exchange Dialogs

A Neural Network Approach toContext-Sensitive Generation of Conversational Responses
Sordoni 2015. Generates responses to tweets.
Uses Recurrent Neural Network Language Model (RLM) architecture of (Mikolov et al., 2010). source code: RNNLM Toolkit

Neural Responding Machine for Short-Text Conversation
Shang et al. 2015 Uses Neural Responding Machine. Trained on Weibo dataset. Achieves one round conversations with 75% appropriate responses.

A Neural Conversation Model
Vinyals, Le 2015. Uses LSTM RNNs to generate conversational responses. Uses seq2seq framework. Seq2Seq was originally designed for machine transation and it "translates" a single sentence, up to around 79 words, to a single sentence response, and has no memory of previous dialog exchanges. Used in Google Smart Reply feature for Inbox

Memory and Attention Models (from DL4NLP)

Reasoning, Attention and Memory RAM workshop at NIPS 2015. slides included

Memory Networks Weston et. al 2014, and End-To-End Memory Networks Sukhbaatar et. al 2015.
Memory networks are implemented in MemNN. Attempts to solve task of reason attention and memory.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Weston 2015. Classifies QA tasks like single factoid, yes/no etc. Extends memory networks.
Evaluating prerequisite qualities for learning end to end dialog systems
Dodge et. al 2015. Tests Memory Networks on 4 tasks including reddit dialog task.
See Jason Weston lecture on MemNN

Neural Turing Machines
Graves et al. 2014.

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets
Joulin, Mikolov 2015. Stack RNN source code and blog post

General Natural Language Processing

Named Entity Recognition

Neural Network

Supplementary Materials

Blogs

Credits

part of the lists are from

 

时间: 2024-09-20 00:01:59

awesome-nlp的相关文章

CCAI 2017 | 专访德国语言技术领军者 Hans Uszkoreit:深度学习还不足以解决 NLP 核心问题

会前,我们采访到了大会 Keynote 嘉宾.德国人工智能研究中心科技总监 Hans Uszkoreit 博士. Uszkoreit 博士是中德两国人工智能合作的核心人物,负责德国人工智能研究中心在中国的所有合作项目,今年 3 月,他刚被任命为在北京新成立的人工智能技术中心(AITC)总监兼首席科学家.在访谈中,Uszkoreit 博士谈到了人工智能在工业 4.0 和商业智能上的应用,以及中.美.欧在人工智能领域的差异. 对于他的老本行,Uszkoreit 博士认为,语言技术是人工智能的核心部分

综述 | 一文读懂自然语言处理NLP(附学习资料)

前言 自然语言处理是文本挖掘的研究领域之一,是人工智能和语言学领域的分支学科.在此领域中探讨如何处理及运用自然语言. 对于自然语言处理的发展历程,可以从哲学中的经验主义和理性主义说起.基于统计的自然语言处理是哲学中的经验主义,基于规则的自然语言处理是哲学中的理性主义.在哲学领域中经验主义与理性主义的斗争一直是此消彼长,这种矛盾与斗争也反映在具体科学上,如自然语言处理. 早期的自然语言处理具有鲜明的经验主义色彩.如 1913 年马尔科夫提出马尔科夫随机过程与马尔科夫模型的基础就是"手工查频&quo

NLP专题论文解读:从Chatbot、NER到QA系统...

本期NLP 专题论文笔记,涵盖对话系统.命名实体识别(NER)和QA系统,希望对你有所帮助. 一.对话系统 论文 | Affective Neural Response Generation 链接 | http://www.paperweekly.site/papers/1043 作者 | Jeffreygao 1. 论文动机 论文来自华为诺亚方舟实验室. 都说人工智能要有情感,能体会到人的喜怒哀乐,今天就来介绍一篇带有情绪的 chatbot.在以往的研究中,大部分对话系统都只关注生成对话的语法

NLP 专题论文解读:从 Chatbot 到 NER | PaperDaily #11

对话系统 1. 论文动机 论文来自华为诺亚方舟实验室. 都说人工智能要有情感,能体会到人的喜怒哀乐,今天就来介绍一篇带有情绪的 chatbot.在以往的研究中,大部分对话系统都只关注生成对话的语法语义是否合理,这里面有考虑上下文的,有结合主题的,有生成长句子的等等.但很少有对话系统关注情感,这是很不合理的.因为在聊天中,当一个人表示难过的时候,另一方经常会回应适当的安慰:当一方感到开心时,另一方也会为其感到快乐.就好比 A 说"我的宠物狗去世了",B 很自然应该回复"我为你感

一文学会最常见的10种NLP处理技术(附资源&代码)

自然语言处理(NLP)是一种艺术与科学的结合,旨在从文本数据中提取信息.在它的帮助下,我们从文本中提炼出适用于计算机算法的信息.从自动翻译.文本分类到情绪分析,自然语言处理成为所有数据科学家的必备技能之一. 在这篇文章中,你将学习到最常见的10个NLP任务,以及相关资源和代码. 为什么要写这篇文章? 对于处理NLP问题,我也研究了一段时日.这期间我翻阅了大量资料,通过研究报告,博客和同类NLP问题的赛事内容学习该领域的最新发展成果,并应对NLP处理时遇到的各类状况. 因此,我决定将这些资源集中起

NLP入门+实战必读:一文教会你最常见的10种自然语言处理技术(附代码)

自然语言处理(NLP)是一种艺术与科学的结合,旨在从文本数据中提取信息.在它的帮助下,我们从文本中提炼出适用于计算机算法的信息.从自动翻译.文本分类到情绪分析,自然语言处理成为所有数据科学家的必备技能之一. 在这篇文章中,你将学习到最常见的10个NLP任务,以及相关资源和代码. 为什么要写这篇文章? 对于处理NLP问题,我也研究了一段时日.这期间我需要翻阅大量资料,通过研究报告,博客和同类NLP问题的赛事内容学习该领域的最新发展成果,并应对NLP处理时遇到的各类状况. 因此,我决定将这些资源集中

引入秘密武器强化学习,发掘GAN在NLP领域的潜力

1.基础:文本生成模型的标准框架文本生成(Text Generation)通过 机器学习 + 自然语言处理 技术尝试使AI具有人类水平的语言表达能力,从一定程度上能够反应现今自然语言处理的发展水平. 下面用极简的描述介绍一下文本生成技术的大体框架,具体可以参阅各种网络文献(比如:CSDN经典Blog"好玩的文本生成"[1]),论文等. 文本生成按任务来说,比较流行的有:机器翻译.句子生成.对话生成等,本文着重讨论后面两种. 基于深度学习的Text Generator 通常使用循环神经网

NLP多任务学习:一种层次增长的神经网络结构 | PaperDaily #16

由于神经网络强大的表达能力,在 NLP 领域研究者们开始研究基于神经网络的多任务学习.大多数方法通过网络参数共享来学习任务间的关联,提升各任务效果. 本期推荐的论文笔记来自 PaperWeekly 社区用户 @robertdlut.这篇文章介绍了一个联合的多任务(joint many-task)模型,通过逐步加深层数来解决复杂任务. 与传统的并行多任务学习不一样的地方在于,该文是根据任务的层次关系构建层次(POS->CHUNK->DEP->Related->Entailment)的

专访iDST NLP负责人——淘宝内容搜索、评价归纳的幕后英雄

2017杭州云栖大会详情请戳这里! 司罗是最早一批从学术界转向工业界的人工智能科学家之一. 2006年,卡内基梅隆大学博士毕业的司罗进入另一所人工智能顶级高校--普渡大学计算机系任教,在这期间,他专注于信息检索.机器学习.自然语言处理等领域的研究,他是一位高产的学术专家,短短几年就发表了100余篇论文:2012年,成为普度大学计算机系终身教授后,一举奠定了司罗在学术圈的地位,他先后担任了ACM信息系统(TOIS),ACM 交互信息系统(TIIS)和信息处理与管理(IPM)编辑委员会的副主编,多次

ACL 2017 | 三位阿里人工智能专家独家解密NLP、机器翻译优秀论文

7月底到现在,全球最顶尖的人工智能会议已在全球各地先后落幕: CVPR 2017(国际计算机视觉与模式识别会议,Conference on Computer Vision and Pattern Recognition) ACL 2017(国际语言学协会,The Association for Computational Linguistics)) ICML 2017(国际机器学习大会,International Conference on Machine Learning) SIGIR 201