Wayback – CAT EARS

Waybackのカレンダー表示形式を変更する

ICT, tips, Wayback

7月 182012

parserのerliestTimestampに未来を指定すれば、12ヶ月分しか表示されなくなる。
Wayback – Administrators Manual

[xml title=”wayback.xml” mark=”3″] [/xml]

earliestTimestampを2030未来を指定してると年始に前月分が表示されない欠点がある。
org.archive.wayback.query.RendereのcaptureJspを変更すれば、利用するカレンダーを変更できる。

[xml title=”wayback.xml” mark=”3″] [/xml]

wayback.urlprefix=http://localhost:8080/wayback/で動かしたい

ICT, tips, Wayback

7月 132012

class=”org.archive.wayback.webapp.AccessPoint”のbeanのnameを”8080″にすればおｋ。
port80にするなら”80″に。

SourceForge.net: Web Archive Access Utilities: archive-access-discuss

日時のUTC表示は強制

ICT, tips, Wayback

7月 132012

UTC強制でした。。ハードコーディングされてる…JSP弄れば変更できるけども…

StringFormatter (Wayback 1.7.0 API)

Note that date formatting done through this class forces all times to the UTC timezone – at the moment it appears too confusing to attempt to localize times in any other way..

WaybackToolbarの無効化

ICT, tips, Wayback

7月 132012

DisclaimChooser.jspのToolbar.jsp読み込みをやめることで対応可能。

[xml title=”DisclaimChooser.jsp” mark=”13″]
<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8" %><%@ page import="org.archive.wayback.core.UIResults" %><%@ page import="org.archive.wayback.core.WaybackRequest" %><% UIResults results = UIResults.extractReplay(request); WaybackRequest wbr = results.getWbRequest(); if(wbr.isLiveWebRequest()) { %><% } else { %><% }%>
[/xml]

指定タイムスタンプのデータが無い場合、過去ログ探索されてしまう

ICT, tips, Wayback

7月 132012

仕様のようなのであきらめ。

Wayback – Administrators Manual

Archival URL mode allows replay of all versions captured of a particular URL, by modifying the Timestamp. When an Archival URL Replay request is received for a URL, the Wayback Machine will replay the closest version in time to the Timestamp requested of the particular URL.

収集結果に401ステータスのレコードと200ステータスのレコードが混在する場合、Waybackで401の結果が表示されてしまう

ICT, tips, Wayback, Web Curator Tool

7月 132012

WebCuratorToolのProfileにて、Writeする際に401レコードを出力しないよう対応することで回避可能

回避例：

WebCuratorTool > Management > profile > Edit > Writers > org.archive.crawler.writer.ARCWriterProcessor > Archiver#decide-rules
に以下を追加

class org.archive.crawler.deciderules.FetchStatusDecideRule
decision REJECT
target-status 401

BASIC認証が必要なサイトの収集がうまくいかない

ICT, tips, Wayback

7月 132012

max-retriesのデフォルトが3な為

robots.txtで401
robots.txtで404
seedで401

で終了してしまう。
4にすれば解決。

[#HER-1376] login/auth/credential functionality overly sensitive to ‘max-retries’; improve robustness/error-reporting – IA Webteam JIRA

Thanks for the details – I think the real culprit is this setting, for a non-intuitive reason:

[xml]3[/xml]

Indeed, if I lower my max-retries to 3 I can reproduce the problem.

So a quick workaround: increase your max-retries to 4. You’ll still be running a bit close to the edge – a momentary problem affecting DNS/robots/URI fetching might still push it over the limit – but in a usual situation you’ll succeed.

Another workaround: any other URI against the same site scheduled first would trigger the DNS and robots tries – so when the authentication-needing URI comes up, it would have all its tries left.

過去に収集したデータを削除する方法

BarkleyDB, ICT, Java, tips, Wayback

7月 132012

FileStore→LocationDB→ResourceIndexと言った具合にマージされる。

Wayback – Resource Store Configuration

デフォルトではResourceIndexにはBarkleyDB Java Edition(BDB)を利用している。
これを編集するのはやや骨が折れる。

WaybackのAPI(libフォルダ内jar)を利用する事で削除可能。

ArcIndexerを使用してarcファイルからCaptureSerchResultを抽出
CaptureSerchResultをBDBRecourdに変換し、Keyを取得
BDBRecordSetを使用し、Indexを削除

[java title=”例 ※トランザクションは構成していない”]
BDBRecordSet rs = new BDBRecordSet();
try{
rs.initializeDB(“D:\\DL\\wct\\bdb”, “DB1”);
//System.out.println(rs.get(“example.com/css/disastercenter.css 20111108054107 1314431 20111108054041-00000.ver1.arc”));

CloseableIterator itr = new ArcIndexer().iterator(“D:\\DL\\wct\\20111128074440-00000.ver1.arc”);
UrlCanonicalizer canonicalizer = new AggressiveUrlCanonicalizer();
SearchResultToBDBRecordAdapter adapter = new SearchResultToBDBRecordAdapter(canonicalizer);
while(itr.hasNext()){
CaptureSearchResult result = itr.next();
BDBRecord r = adapter.adapt(result);
System.out.println(BDBRecordSet.bytesToString(r.getKey().getData()));

rs.delete(BDBRecordSet.bytesToString(r.getKey().getData()));
}
}catch(Throwable t){
t.printStackTrace();
}finally{
rs.shutdownDB();
}
[/java]

日	月	火	水	木	金	土
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31