前段时间阿里云函数计算推出了Java8版本的编译环境,我结合一个java语言来完成函数计算的代码编写,该示例主要是模拟一个网站图片爬虫,把指定网站的指定页面的图片全部获取并保存到对象存储中,画了一个简单的架构图如下:
流程讲解:
用户输入某个网站地址,并把爬虫系统部署到函数计算上,执行后函数计算会自动把某网站的图片抓取到本地,并通过内网的方式上传到对象存储(OSS)上。这里涉及到两段代码,一段是网站爬取图片的代码,一段是把图片上传到对象存储(略),我们下面结合上面的框图来看看代码构成。
- 在函数计算上执行的代码:
/*
* Created on 2017-9-16
*
* TODO To change the template for this generated file go to
* Window - Preferences - Java - Code Style - Code Templates
*/
package com.aliyun.function.crawler;
/**
* @author fuhw
*
* TODO To change the template for this generated type comment go to Window -
* Preferences - Java - Code Style - Code Templates
*/
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.List;
import com.aliyun.fc.runtime.Context;
import com.aliyun.fc.runtime.StreamRequestHandler;
public class index implements StreamRequestHandler {
private static final String URL = "https://www.csdn.com";
private static final String ECODING = "UTF-8";
private static final String IMGURL_REG = "<img.*src=(.*?)[^>]*?>";
private static final String IMGSRC_REG = "http:\"?(.*?)(\"|>|\\s+)";
@Override public void handleRequest(InputStream inputStream,
OutputStream outputStream, Context context) throws IOException {
List<String> imgUrl ;
try {
catchImg cm = new catchImg();
String HTML = cm.getHTML(URL);
imgUrl = cm.getImageUrl(HTML);
List<String> imgSrc = cm.getImageSrc(imgUrl);
cm.Download(imgSrc);
} catch (Exception e) {
System.out.println("fail download image! ");
}
outputStream.write("download image is OK!".getBytes());
}
}
- 爬虫系统代码:
package com.aliyun.function.crawler;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class catchImg {
// 地址
private static final String URL = "http://www.csdn.net";
// 编码
private static final String ECODING = "UTF-8";
// 获取img标签正则
private static final String IMGURL_REG = "<img.*src=(.*?)[^>]*?>";
// 获取src路径的正则
private static final String IMGSRC_REG = "http:\"?(.*?)(\"|>|\\s+.(gif|png|jpg|bmp|jpeg|tif|tiff))";
public static void main(String[] args) throws Exception {
catchImg cm = new catchImg();
//获得html文本内容
String HTML = cm.getHTML(URL);
//获取图片标签
List<String> imgUrl = cm.getImageUrl(HTML);
//获取图片src地址
List<String> imgSrc = cm.getImageSrc(imgUrl);
//下载图片
cm.Download(imgSrc);
}
/***************************************************************************
* 获取HTML内容
*
* @param url
* @return
* @throws Exception
*/
public String getHTML(String url) throws Exception {
URL uri = new URL(url);
URLConnection connection = uri.openConnection();
InputStream in = connection.getInputStream();
byte[] buf = new byte[1024];
int length = 0;
StringBuffer sb = new StringBuffer();
while ((length = in.read(buf, 0, buf.length)) > 0) {
sb.append(new String(buf, ECODING));
}
in.close();
return sb.toString();
}
/***************************************************************************
* 获取ImageUrl地址
*
* @param HTML
* @return
*/
public List<String> getImageUrl(String HTML) {
Matcher matcher = Pattern.compile(IMGURL_REG).matcher(HTML);
List<String> listImgUrl = new ArrayList<String>();
while (matcher.find()) {
listImgUrl.add(matcher.group());
}
return listImgUrl;
}
/***************************************************************************
* 获取ImageSrc地址
*
* @param listImageUrl
* @return
*/
public List<String> getImageSrc(List<String> listImageUrl) {
List<String> listImgSrc = new ArrayList<String>();
for (String image : listImageUrl) {
Matcher matcher = Pattern.compile(IMGSRC_REG).matcher(image);
while (matcher.find()) {
String str = matcher.group().substring(0,
matcher.group().length() - 1);
listImgSrc.add(str);
}
}
return listImgSrc;
}
/***************************************************************************
* 下载图片
*
* @param listImgSrc
*/
public void Download(List<String> listImgSrc) {
try {
//System.out.println("listImgSrc size = "+listImgSrc.size());
for (String url : listImgSrc) {
String imageName = url.substring(url.lastIndexOf("/") + 1, url
.length());
URL uri = new URL(url);
InputStream in = uri.openStream();
// FileOutputStream fo = new FileOutputStream("/tmp/"
FileOutputStream fo = new FileOutputStream(""
+ new File(imageName));
byte[] buf = new byte[1024];
int length = 0;
System.out.println("Start : " + url);
while ((length = in.read(buf, 0, buf.length)) != -1) {
fo.write(buf, 0, length);
}
in.close();
fo.close();
//System.out.println("success");
}
} catch (Exception e) {
//e.printStackTrace();
//System.out.println("fail download in void Download function");
}
}
}
- 注意事项:
1、在本地java环境调试代码的时候,工程里需要引入两个包:
1)aliyun-java-sdk-fc包:http://search.maven.org/#search%7Cga%7C1%7Caliyun-java-sdk-fc
2)fc-java-core包:http://search.maven.org/#search%7Cga%7C1%7Cfc-java-core
2、把图片上传到OSS的代码参考:https://help.aliyun.com/document_detail/32013.html
3、在控制台上的程序入口书写: com.aliyun.function.crawler.index::handleRequest,格式是:包名+入口文件名::入口函数名
4、由于java是编译类型的程序,需要本地编译好后打成jar包通过函数计算控制台上传到远程,打jar包可以通过两种方式,一种可以在eclipse操作界面:
一种通过Java命令行打jar包:jar -cvf fc.jar catchImg.class index.class
5、在编写函数计算的时候,需要注意两个地方,一个是java的运行环境不能直接通过在线编译的方式来做,另外,函数入口名的书写,看下图的标注:
- 执行看效果