ruby写一个文件内容相似性比较的代码

1.相似度定义

我们定义,则,我们设,则,|C|=s,则相似度p=,p(0,1)

2.相似度检测算法设计

算法设计:

定义4个字符为一个字符串,将T1,T2分割成若干字符串,若剩余字符不足4个,则以空格补全。将分割后的T1T2计数,记下|T1|=n,|T2|=m,s=0;在T1中取出第一字符串,检测是否在T2中,若存在,则s+1,并删除与被检测字符串相同的字符串,循环到T2检测,直到T2中不存在被检测的字符串,循环到T1,提出下一个被检测字符串,到T2中检测;如此循环检测,直到T1中的所有字符串都被检测或者T2中所有的字符串都被删除,停止,记下此时的s;将所得的s除以n和m中最大的那个数,所得的结果为T1,T2的相似度。先以T1为被检测模板,检测,然后再以T2为被检测模板检测,得出两个相似度的数,取最小值。

用ruby实现如下:

def fill_str(str,i=4)
  return str if str.size%i == 0
  str<<" "*(4-str.size%i)
end

def txt_cmp(f0,f1)
  str_f0,str_f1 = fill_str(File.new(f0).read),fill_str(File.new(f1).read)
  a0,a1 = str_f0.scan(/.{4}/m),str_f1.scan(/.{4}/m)
  n,m,s = a0.size,a1.size,0
  a0.each do |txt|
    if a1.include?(txt)
      size = a1.size
      s+=size-a1.keep_if {|item| item!=txt}.size
    end
    break if a1.size == 0
  end
  s/[n,m].max.to_f
rescue =>e
  puts "error : #{e.message}\n" << e.backtrace[0..2].join("\n")
end

(puts "you must cmp 2 txt file";exit) if ARGV.size != 2
r = txt_cmp(f0=ARGV[0],f1=ARGV[1])
puts "#{f0} and #{f1} semblance is #{r*100}%"

下面是4个文件分别为1.txt 2.txt a.txt b.txt,内容如下:

1.txt

NFC East rival quarterbacks Tony Romo(notes) of the Dallas Cowboys and Eli Manning(notes) of the New York Giants now have something else in common ḂẂ they've used the same wedding planner to help them tie the knot. Todd Fiscus, the man with the plan, set
up what he called "man food" at Dallas' Arlington Hall on Saturday, when Romo married former Miss Missouri Candace Crawford. "I have a lot of football players to feed," said Fiscus, who had pizza and short ribs on the menu.

However, Romo apparently put all the tunes together. "Tony picked out every song, and when it plays, and what the keynote things are," Fiscus said.

Sounds like a very orderly occasion, but there was one wild card ḂẂ whether Cowboys owner Jerry Jones would be able to attend. With the continued lockout, owners and players are not supposed to have any contact away from the negotiating table. But Jones received
special dispensation from the NFL to attend, just as the Green Bay Packers recently were informed that they will, in fact, receive their Super Bowl rings in a June 16 ceremony no matter what the labor situation is at that time. Jones was there along with virtually
all of Romo's teammates.

It is unknown whether Jones and Romo actually discussed any labor issues at the wedding ḂẂ we're guessing this was more of a "friendly", though Jones is one of the most powerful owners on the NFL's side of things and Romo's marquee value gives him a lot of
play on the other side.

"I've gotten special permission," Jones recently told ESPN's Ed Werder. "But more than anything, (I got the) right ticket from him and his fianceẀḊ ḂẂ Romo's wife-to-be. (It's) one of prettiest invitations I've ever seen.

"So, yes, I will be there and (I'm) proud for him. He's got the best end of this deal."

Romo, who had been linked romantically before with Jessica Simpson and Carrie Underwood, proposed to Crawford last December. Crawford's brother Chace is known for his role on the TV show "Gossip Girl' and has also been linked romantically with Underwood.

According to the new Mrs. Romo, the lockout may play a part in the couple's plans for a honeymoon; usually around this time of year, her husband would be participating in minicamps and other off-season workouts.

"This lockout has been quite a dent in the honeymoon idea," she told WFAA-TV. "We'll see. We haven't really gotten there yet. We're taking a day at a time with the lockout. We (are not) even sure if we're gonna get to go (on) one."

2.txt

Officially, Memorial Day, observed on the last Monday of May (this year it's May 30), honors the war dead. Unofficially, the day honors the start of summer. (More on that in a moment.)

The upcoming three-day weekend has prompted searches on Yahoo! for "when is memorial day," "what is memorial day," and "memorial day history." The day was originally known as "Decoration Day" because the day was dedicated to the Civil War dead, when mourners
would decorate gravesites as a remembrance.

The holiday was first widely observed on May 30, 1868, when 5,000 people helped decorate the gravesites of 20,000 Union and Confederate soldiers buried at Arlington National Cemetery. (Some parts of the South still remember members of the Confederate Army with
Confederate Memorial Day.)

After World War I, the observances were widened to honor the fallen from all American wars--and in 1971, Congress declared Memorial Day a national holiday.

Towns across the country now honor military personnel with services, parades, and fireworks. A national moment of remembrance takes place at 3 p.m. At Arlington National Cemetery, headstones are graced with small American flags.

This day is not to be confused with Veterans Day, which is observed on November 11 to honor military veterans, both alive and dead.

However, confusion abounds anyway, with the weekend marking for many the kickoff of summer, and it is reserved for weekend getaways, picnics, and sales. Searches on "memorial day sales," "memorial day recipes," and "memorial day weekend" are just some of the
lookups related to the festivities.

a.txt

23l4kj23 klgjdlskgj235 3lkj 0952ru lkfj lkqejfg
2t34lktj3409t uj34gjklejeglekjfdklsafjalsfj
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
sdgakdgjsdalgjaslfjsalkfjsadlf

b.txt

23l4kj23 klgjdlskgj235 3lkj 0952ru lkfj lkqejfg
2t34lktj3409t uj34gjklejeglekjfdklsafjalsfj
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
sdgakdgjsdalgjaslfjsalkfjsadlf

测试代码如下:

ruby -EISO-8859-14 txtcmp.rb 1.txt 2.txt
1.txt and 2.txt semblance is 8.653846153846153%

ruby txtcmp.rb a.txt b.txt

a.txt and b.txt semblance is 79.54545454545455%

因为1.txt中包含非utf-8字符,默认比较会出错,遂指定外部编码比较。

时间: 2024-08-18 02:17:21

ruby写一个文件内容相似性比较的代码的相关文章

控制-【求帮忙】来个大神帮忙写一个脉冲程序的C语言代码

问题描述 [求帮忙]来个大神帮忙写一个脉冲程序的C语言代码 现在需要用PC直接控制步进电机驱动进而来控制步进电机的启停,转动.兄弟我是一点都不会,所以特来此求大神给一个代码.只要能输出高低电平和方波的脉冲就好了!在此谢过啦!!!下面是驱动器型号和基本信息 解决方案 如果只是要高低电平的话,(不知道你用的什么单片机)设从51单片机的P0^0口输出 while(1){ P0^0=0: delay(50);//延时50ms P0^1=1: delay(50);//延时50ms} 解决方案二: 是的呢~

用java写一个文件夹浏览器

问题描述 如何用java编写一个文件夹浏览器 解决方案 解决方案二:楼主好,我是今年毕业参加工作的,刚进公司的时候,就用java实现了一个文件浏览器,我这还有源码呢.java写文件浏览器可以使用swing,swing画出的界面不好看,但移植性好,我用的是swt做的,界面组件风格跟本地计算机系统的一致,但这种跨平台性不好,需要针对特定系统的swt的jar包,但综合考虑,我还是推荐使用swt实现,如果楼主需要,我可以将源码分享给你.解决方案三: 解决方案四:引用1楼ysjian_pingcx的回复:

如何用c#写一个文件上传程序,通过https,服务器端用jsp写的

问题描述 原来用的是MultipartPostMethod()封装的数据,现在要用c#来实现这一功能,希望高手解答,谢谢packagecom.meridian.cfets.lcm3.internal.noticemanagement;importjava.io.File;importjava.io.FileNotFoundException;importjava.io.FileOutputStream;importjava.io.IOException;importjava.util.Prope

写CSS文件的流程和CSS代码顺序

文章简介:css制作流程及标准. css制作流程及标准 (一)制作流程:1,创建文件(文件管理及命名)2,与html文档建立关系 注意点:    1)不建议使用:内联样式和内嵌样式         原因:结构(html)和表现(css样式)没有分离    2)区别:外链样式与导入样式(http://zhidao.baidu.com/question/198616109.html)    3)网站常用:外链样式 3,制作页面样式 注意点: 同html框架一致从上到下 从整体到局部 共用样式到个别样

用ASP.Net写一个发送ICQ信息的程序代码

asp.net|程序 这里我给大家提供一个很实用的例子,就是在线发送ICQ信息.想一想我们在网页上直接给朋友发送ICQ信息,那是多么美妙的事情啊.呵呵,在吹牛啊,其实ICQ本来就有在线发送的代码,不过,这些都是AOL给你写好的代码,多没有意思啊.还是自已写的比较好,呵呵,废话少说,大家来看代码吧. <% @ Page Language="C#" %><% @ Assembly Name="System.Net" %><% @ Impor

安卓文件上传下载-我是安卓开发学了一点,大家可以给我讲讲如何写一个上传下载的功能

问题描述 我是安卓开发学了一点,大家可以给我讲讲如何写一个上传下载的功能 安卓我是0基础,现在我们老师命令我写一个文件上传下载,可是我只看了那么一点,大家可以给我讲讲思路,自己实际案例 解决方案 首先看看你们老师的要求是上传下载到哪里?然后再搜索方法案例,因为数据存储有多种方式都不一样的 解决方案二: http://download.csdn.net/detail/airlke/8172213

fso实现整个文件夹内容的复制到另一个文件夹中

这里是一个实现将一个文件夹中的内容,包括子文件夹中的内容,复制到另一个文件夹中的asp代码.在使用的过程中要将文件夹的相对路径转换成绝对路径.转换的方法是使用server.mappath. <% startfile_1="d:\aaa" '原始文件夹 tofile_1="c:\bbb" '目标文件夹 Call copyfile(startfile_1,tofile_1) response.write "完成" function copyfi

用servlet将jsp文件内容转为html

用servlet将jsp文件内容转为html. 用servlet将jsp文件内容转为html.代码如下: package examples; import java.io.ByteArrayOutputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.io.PrintWriter; import javax.serv

用servlet将jsp文件内容转为htm

用servlet将jsp文件内容转为html.代码如下: package examples;import java.io.ByteArrayOutputStream;import java.io.FileOutputStream;import java.io.IOException;import java.io.OutputStreamWriter;import java.io.PrintWriter; import javax.servlet.RequestDispatcher;import