7月 182012
 
    <property name="parser">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser">
        <property name="earliestTimestamp" value="2030" />
  • earliestTimestampを2030未来を指定してると年始に前月分が表示されない欠点がある。

  • org.archive.wayback.query.RendereのcaptureJspを変更すれば、利用するカレンダーを変更できる。

    <property name="query">
      <bean class="org.archive.wayback.query.Renderer">
        <property name="captureJsp" value="/WEB-INF/query/OldCalendarResults.jsp" />
7月 132012
 

DisclaimChooser.jspのToolbar.jsp読み込みをやめることで対応可能。

<%@
 page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"
%><%@
 page import="org.archive.wayback.core.UIResults"
%><%@
 page import="org.archive.wayback.core.WaybackRequest"
%><%
UIResults results = UIResults.extractReplay(request);
WaybackRequest wbr = results.getWbRequest();
if(wbr.isLiveWebRequest()) {
    %><jsp:include page="/WEB-INF/replay/LiveWebDisclaimer.jsp" flush="true" /><%
} else {
    %><jsp:include page="/WEB-INF/replay/Toolbar.jsp" flush="true" /><%
}%>
7月 132012
 

仕様のようなのであきらめ。

Wayback – Administrators Manual

Archival URL mode allows replay of all versions captured of a particular URL, by modifying the Timestamp. When an Archival URL Replay request is received for a URL, the Wayback Machine will replay the closest version in time to the Timestamp requested of the particular URL.

7月 132012
 

WebCuratorToolのProfileにて、Writeする際に401レコードを出力しないよう対応することで回避可能

回避例:

WebCuratorTool > Management > profile > Edit > Writers > org.archive.crawler.writer.ARCWriterProcessor > Archiver#decide-rules
に以下を追加

  • class org.archive.crawler.deciderules.FetchStatusDecideRule
  • decision REJECT
  • target-status 401
7月 132012
 

max-retriesのデフォルトが3な為

  1. robots.txtで401
  2. robots.txtで404
  3. seedで401

で終了してしまう。
4にすれば解決。

[#HER-1376] login/auth/credential functionality overly sensitive to ‘max-retries’; improve robustness/error-reporting – IA Webteam JIRA


Thanks for the details – I think the real culprit is this setting, for a non-intuitive reason:

<integer name="max-retries">3</integer>

Indeed, if I lower my max-retries to 3 I can reproduce the problem.

So a quick workaround: increase your max-retries to 4. You’ll still be running a bit close to the edge – a momentary problem affecting DNS/robots/URI fetching might still push it over the limit – but in a usual situation you’ll succeed.

Another workaround: any other URI against the same site scheduled first would trigger the DNS and robots tries – so when the authentication-needing URI comes up, it would have all its tries left.

7月 132012
 

FileStore→LocationDB→ResourceIndexと言った具合にマージされる。

Wayback – Resource Store Configuration

デフォルトではResourceIndexにはBarkleyDB Java Edition(BDB)を利用している。
これを編集するのはやや骨が折れる。

WaybackのAPI(libフォルダ内jar)を利用する事で削除可能。

  1. ArcIndexerを使用してarcファイルからCaptureSerchResultを抽出
  2. CaptureSerchResultをBDBRecourdに変換し、Keyを取得
  3. BDBRecordSetを使用し、Indexを削除
BDBRecordSet rs = new BDBRecordSet();
try{
    rs.initializeDB("D:\\DL\\wct\\bdb", "DB1");
    //System.out.println(rs.get("example.com/css/disastercenter.css 20111108054107 1314431 20111108054041-00000.ver1.arc"));

    CloseableIterator<CaptureSearchResult> itr = new ArcIndexer().iterator("D:\\DL\\wct\\20111128074440-00000.ver1.arc");
    UrlCanonicalizer canonicalizer = new AggressiveUrlCanonicalizer();
    SearchResultToBDBRecordAdapter adapter = new SearchResultToBDBRecordAdapter(canonicalizer);
    while(itr.hasNext()){
        CaptureSearchResult result = itr.next();
        BDBRecord r = adapter.adapt(result);
        System.out.println(BDBRecordSet.bytesToString(r.getKey().getData()));

        rs.delete(BDBRecordSet.bytesToString(r.getKey().getData()));
    }
}catch(Throwable t){
    t.printStackTrace();
}finally{
    rs.shutdownDB();
}