在拿破仑·希尔(Napolean Hill)所著的《思考致富》(Think and Grow Rich)一书中,他为我们引述了Darby苦挖金矿多年后,就在离矿脉一步之遥的时候与宝藏失之交臂的故事。
思考致富中文版的豆瓣阅读链接:
http://read.douban.com/reader/ebook/10954762/
根据该书内容进行的修改
如今,我虽然不知道这故事是真是假,但是我明确知道在我身边有不少这样的“数据Darby”。这些人了解机器学习的目的和执行,对待任何研究问题只使用2-3种算法。他们不用更好的算法和技术来更新自身,只因为他们太顽固,或者他们只是在耗费时间而不求进步。
像Darby这一类人,他们总是在接近终点的时候而错失良机。最终,他们以计算量大、难度大或是无法设定合适的阈值来优化模型等借口,放弃了机器学习。这有什么意义?你听说过这些人吗?
今天给出的速查表旨在改变这群“数据Darby”对机器学习的态度,使他们成为身体力行的倡导者。这里收集了10个最为常用的机器学习算法,附上了Python和R代码。
考虑到机器学习方法在建模中得到了更多的运用,以下速查表可以作为代码指南来帮助你掌握机器学习算法运用。祝你好运!
对于那些超级懒惰的数据Darbies,我们将让你的生活过得更轻松。你可以在此下载PDF版的速查表,便可直接复制粘贴代码。
机器学习算法 |
||
类 型 |
||
监督学习 |
非监督学习 |
增强学习 |
决策树 K-近邻算法 随机决策森林 Logistics回归分析 |
Apriori算法 K-均值算法 系统聚类 |
马尔科夫决策过程 增强学习算法(Q-学习) |
线性回归 |
#Import Library #Import other necessary libraries like pandas, #numpy... from sklearn import linear_model #Load Train and Test datasets #Identify feature and response variable(s) and #values must be numeric and numpy arrays x_train=input_variables_values_training_datasets y_train=target_variables_values_training_datasets x_test=input_variables_values_test_datasets #Create linear regression objectlinear = linear_model.LinearRegression() #Train the model using the training sets and #check scorelinear.fit(x_train, y_train) linear.score(x_train, y_train) #Equation coefficient and Intercept print('Coefficient: \n', linear.coef_) print('Intercept: \n', linear.intercept_) #Predict Output predicted= linear.predict(x_test) |
#Load Train and Test datasets #Identify feature and response variable(s) and #values must be numeric and numpy arrays x_train <- input_variables_values_training_datasets y_train <- target_variables_values_training_datasets x_test <- input_variables_values_test_datasets x <- cbind(x_train,y_train) #Train the model using the training sets and #check score linear <- lm(y_train ~ ., data = x)summary(linear) #Predict Output predicted= predict(linear,x_test) |
逻辑回归 |
#Import Library from sklearn.linear_model import LogisticRegression #Assumed you have, X (predictor) and Y (target) #for training data set and x_test(predictor) #of test_dataset #Create logistic regression object model = LogisticRegression() #Train the model using the training sets #and check score model.fit(X, y) model.score(X, y) #Equation coefficient and Intercept print('Coefficient: \n', model.coef_) print('Intercept: \n', model.intercept_) #Predict Output predicted= model.predict(x_test) |
x <- cbind(x_train,y_train) #Train the model using the training sets and check #score logistic <- glm(y_train ~ ., data = x,family='binomial') summary(logistic) #Predict Outputpredicted= predict(logistic,x_test) |
决 策 树 |
#Import Library #Import other necessary libraries like pandas, numpy... from sklearn import tree #Assumed you have, X (predictor) and Y (target) for #training data set and x_test(predictor) of #test_dataset #Create tree objectmodel = tree.DecisionTreeClassifier(criterion='gini') #for classification, here you can change the #algorithm as gini or entropy (information gain) by #default it is gin #model = tree.DecisionTreeRegressor() for #regression #Train the model using the training sets and check #score model.fit(X, y) model.score(X, y) #Predict Outputpredicted= model.predict(x_test) |
#Import Library library(rpart) x <-cbind(x_train,y_train) #grow tree fit <- rpart(y_train ~ ., data = x,method="class") summary(fit) #Predict Outputpredicted= predict(fit,x_test) |
支持 向量机 |
#Import Library from sklearn import svm #Assumed you have, X (predictor) and Y (target) for #training data set and x_test(predictor) of test_dataset #Create SVM classification objectmodel = svm.svc() #there are various options associatedwith it, this is simple for classification. #Train the model using the training sets and check #score model.fit(X, y) model.score(X, y) #Predict Outputpredicted= model.predict(x_test) |
#Import Library library(e1071) x <- cbind(x_train,y_train) #Fitting model fit <-svm(y_train ~ ., data = x) summary(fit) #Predict Outputpredicted= predict(fit,x_test) |
贝叶斯算法 |
#Import Libraryfrom sklearn.naive_bayes import GaussianNB #Assumed you have, X (predictor) and Y (target) for #training data set and x_test(predictor) of test_dataset #Create SVM classification object model = GaussianNB() #there is other distribution for multinomial classes like Bernoulli Naive Bayes #Train the model using the training sets and check #scoremodel.fit(X, y) #Predict Outputpredicted= model.predict(x_test) |
#Import Librarylibrary(e1071) x <- cbind(x_train,y_train)#Fitting model fit <-naiveBayes(y_train ~ ., data = x) summary(fit) #Predict Outputpredicted= predict(fit,x_test) |
k-近邻算法析 |
#Import Library from sklearn.neighbors import KNeighborsClassifier #Assumed you have, X (predictor) and Y (target) for #training data set and x_test(predictor) of test_dataset #Create KNeighbors classifier object model KNeighborsClassifier(n_neighbors=6) #default value for n_neighbors is 5 #Train the model using the training sets and check score model.fit(X, y) #Predict Outputpredicted= model.predict(x_test) |
#Import Librarylibrary(knn) x <- cbind(x_train,y_train) #Fitting model fit <-knn(y_train ~ ., data = x,k=5) summary(fit) #Predict Output predicted= predict(fit,x_test) |
硬聚类算法 |
#Import Library from sklearn.cluster import KMeans #Assumed you have, X (attributes) for training data set #and x_test(attributes) of test_dataset #Create KNeighbors classifier object model k_means = KMeans(n_clusters=3, random_state=0) #Train the model using the training sets and check score model.fit(X) #Predict Outputpredicted= model.predict(x_test) |
#Import Library library(cluster) fit <- kmeans(X, 3) #5 cluster solution |
随机森林算法 |
#Import Libraryfrom sklearn.ensemble import RandomForestClassifier #Assumed you have, X (predictor) and Y (target) for #training data set and x_test(predictor) of test_dataset #Create Random Forest objectmodel= RandomForestClassifier() #Train the model using the training sets and check score model.fit(X, y) #Predict Outputpredicted= model.predict(x_test) |
#Import Library library(randomForest) x <- cbind(x_train,y_train) #Fitting model fit <- randomForest(Species ~ ., x,ntree=500) summary(fit) #Predict Outputpredicted= predict(fit,x_test) |
降维算法 |
#Import Library from sklearn import decomposition #Assumed you have training and test data set as train and #test #Create PCA object pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features) #For Factor analysis #fa= decomposition.FactorAnalysis() #Reduced the dimension of training dataset using PCA train_reduced = pca.fit_transform(train) #Reduced the dimension of test datasettest_reduced = pca.transform(test) |
#Import Library library(stats) pca <- princomp(train, cor = TRUE) train_reduced <- predict(pca,train) test_reduced <- predict(pca,test) |
GB D T |
#Import Library from sklearn.ensemble import GradientBoostingClassifier #Assumed you have, X (predictor) and Y (target) for #training data set and x_test(predictor) of test_dataset #Create Gradient Boosting Classifier object model= GradientBoostingClassifier(n_estimators=100, \ learning_rate=1.0, max_depth=1, random_state=0) #Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test) |
#Import Library library(caret) x <- cbind(x_train,y_train) #Fitting modelfitControl <- trainControl( method = "repeatedcv", + number = 4, repeats = 4) fit <- train(y ~ ., data = x, method = "gbm",+ trControl = fitControl,verbose = FALSE) predicted= predict(fit,x_test,type= "prob")[,2] |
原文发布时间为:2015-12-02