How to archive open source materials-bellingcat

2021-11-24 02:47:50 By : Ms. Amy Deng

Aric Toler started volunteering for Bellingcat in 2014 and has been working since 2015. He is currently responsible for Bellingcat's training and research work, with a focus on Eurasia/Eastern Europe.

Throughout the story, click on any image to view it at full resolution.

When conducting open source surveys, a constant problem is how to archive the material you are researching. For example, social media posts may be deleted by users after you post an investigation, or videos on YouTube that display sensitive information (such as war crimes in Syria) may be deleted due to censorship policies set by YouTube.

There are two main reasons for archiving all digital evidence of your use of the investigation: to save it in case it is deleted from its original source, and to prove to your audience that the material (if it has been deleted) does exist that you brought it up. Screenshots are easy to fake, so it’s important to find a way to keep the material and show that you don’t have the opportunity to modify the content.

For most content, including social media posts, news reports, and other web pages, there are usually two services available: Archive.today and Archive.org. These two sites store web pages on their own servers, which can be accessed by anyone with a URL. Even better, both sites will save page snapshots over time, so you can observe changes every time you archive, such as before and after editing a news article. We generally recommend saving data on both sites to maximize the amount of archived content. We will summarize how each of these sites work and how efficiently they capture pages on many of the most popular social networking sites. In general, Archive.today is more versatile in saving social network pages because they save pages through accounts created for these sites, while Archive.org can only see completely public pages that do not require an account.

Among the two main archive sites, Archive.is is the most versatile and more friendly to social networking sites. However, its existence is almost not as long as Archive.org, and since it is a much smaller operation, it should be considered unstable. In addition, because extremist content is sometimes shared through the Archive.today link, the site has been banned in many countries. The site has alternate URLs (Archive.is, Archive.li, Archive.ch...) that allow you to bypass the censorship of certain (but not all) countries (such as Russia, China, and Finland).

The pages saved on Archive.today are completely derived from user requests and will not be automatically retrieved like Archive.org. To save a page on this site, just enter the URL you want to save in the red box.

You can also archive pages by saving bookmarks to the bookmark bar of your browser, creating a one-click path to save a snapshot of the page you are currently on. To do this, use the following URL to save a new page in the bookmarks (or favorites) column:

javascript:void(open('https://archive.today/?run=1&url=' encodeURIComponent(document.location)))

Just click on the newly created bookmark to save any page you opened in the browser's tab at that time.

Or, you can click the button on the Archive.today homepage and drag it to your bookmarks bar, no need to create bookmarks manually.

To check if the URL has been saved, put it in the blue box below.

If you are not sure of the exact URL, there are more advanced ways to search for saved pages. For example, if you want to find all archived Bellingcat news articles tagged with MENA (Middle East and North Africa), please search:

The asterisk at the end of the URL means all articles on the Bellingcat site whose URL starts with "/news/mena", including all articles in the MENA section of our site.

The result is a mix of articles manually saved by the user who entered the URL and pages that cross-reference the Archive.org database of saved pages. In some cases, you can visit multiple versions of the same page because the article may change over time.

Another useful feature of Archive.today is the ability to save the entire page as an image, even if it spans a long distance. However, this should not replace the actual archive link generated, as the screenshot can be modified after saving.

Archive.today is relatively competent at archiving social media pages, but it is far from perfect. Featured archive pages from various social networks are shown below. The general rule of thumb is that if you try to archive any social media pages that need to bypass privacy-such as "Only friends of friends can see" on Facebook-it is almost impossible to save the page to a third-party archive site, such as Archive .today or Archive.org.

In the following example, click the hyperlink of each social network to view the page on Archive.today.

The effect is quite good, but there are restrictions on the photos and videos embedded in the post.

It works well, but there are restrictions on embedded content (such as photos, videos, and links) in tweets.

It works well, and there are restrictions on embedded photos and videos.

It works well, and there are restrictions on embedded photos and videos.

Only metadata and text can be saved, not the actual video.

Internet Archive was established in 1996. It has kept web page snapshots for more than two decades and has a considerable budget, ensuring the stability that we may not be able to afford from Archive.today. Although Archive.org has many fascinating projects, we are most interested in their Internet Archive Wayback Machine (web.archive.org), which allows users to archive specific web pages and view snapshots taken by others.

As with Archive.today, the process of finding and saving web pages is simple. Search for a URL at the top of the page to search results, and enter a URL you want to save in the lower right corner:

Although Archive.today relies on users to submit pages to be saved, Archive.org uses user requests and scripts to automatically save pages. For example, since the first purchase of the domain in May 2014, Bellingcat's homepage has been saved more than 800 times, of which only a small part may come from user requests.

For saving common web pages and news articles, Archive.org is usually better than Archive.today because it allows you to click into other pages of the archive. For example, using the Internet Archive Wayback Machine, you can browse most of the Bellingcat website as in 2014, and all pages are saved nearly four years ago. On Archive.today, the availability of the archive page is higher.

Archive.org struggles with social networking sites a bit more than Archive.today, but it still has its uses.

Suitable for pages that are completely public, but unlike Archive.today, pages that require a Facebook account cannot be accessed.

It works well, but there are restrictions on embedded content (such as photos, videos, and links) in tweets.

It is suitable for completely public pages, but unlike Archive.today, pages that require a VK account cannot be accessed.

It is suitable for completely public pages, but unlike Archive.today, pages that require an OK account cannot be accessed.

It does not work well on the main Wayback Machine site, because it is even difficult to save the metadata and text in the video.

However, Archive.org has a separate project called YouTube Crawl, which archives videos from YouTube with complete metadata. You can view details on how to participate in their project here, but it is more complicated than the simple one-click solutions on web.archive.org and archive.today.

If you read the previous section, it is obvious that neither Archive.org nor Archive.today can save photos and videos from Instagram and YouTube, and there are problems with saving photos from Facebook, VK, and other websites. For these sites, it is much more difficult to create a third-party "neutral" platform to host the media. Instead, we need to download the materials separately, and then provide supplementary materials (such as screenshots showing metadata, mirrored versions of the materials, etc.) to show that the images and videos are authentic.

There are many websites that can extract videos from YouTube, such as KeepVid, Y2Mate, etc. Archiving videos from YouTube is not difficult at all, as long as you have enough hard drive or cloud space to store them. Make sure to take a screenshot of the metadata and save the page on Archive.today to preserve the title, upload date, and description, even if the video is not saved on the page.

Unfortunately, archiving Instagram pages is very difficult. Generally, the best thing we can do is to hope that the post has been mirrored to another website (there are many unreputable websites that "borrow" Instagram content and host it themselves) and manually save the image at full resolution.

To access photos on Instagram in full resolution, please use the following method:

To save videos from Instagram, you can use many sites similar to KeepVid, such as Gramblast and DreDown.

Downloading photos in high resolution in Facebook is much easier than in Instagram because it is built into the user interface of the website. Just click "Options" and then click "Download" on the photo to pull it from Facebook's servers. The image may not be the original resolution on the camera, but it is the best resolution you can extract from Facebook itself.

Extracting videos from Facebook is a bit difficult, but it is still relatively easy. When watching a video, right-click and select "Show Video URL" to copy and paste the direct link of a third-party website to download the video.

Like YouTube and Instagram, you can use many third-party websites to get videos from Facebook servers in case the user who uploaded the material deletes it. FBDown.net works well, with almost no ads or pop-up windows. After pasting the video URL copied from the original source, you can download the highest available quality video from the link shown in the red box below.

Saving a photo in full resolution on Vkontakte is very simple: just select "View Original" on the photo and you will be able to access it at the maximum available resolution. In fact, even if the user deletes the photo from their page, the URL of the VK hosting the full-resolution image will be retained indefinitely.

Downloading videos from VK is a bit trickier than YouTube, but there are many free (and paid) tools available. For example, GetVideo.org will allow you to download videos uploaded to VK in their original resolution. To get the video URL, right-click on the video and select "Copy Video Link".

Note that you should not click "Best Quality" on this GetVideo, but choose the highest specific resolution (e.g. 720p). Please note that the download speed from this site is very slow.

The best way to capture photos at full or near full resolution is to select "Full Screen" and then save or capture the image.

Compared with other social networks, there are fewer sites to extract videos from Odnoklassniki, but this is possible, such as Video-Download.co.

Generally, you cannot use the services discussed earlier to download webpages or videos because they are restricted by privacy settings (restricting access to websites such as Archive.today), or they use obscure video playback platforms that websites like KeepVid cannot extract from . All the solutions mentioned earlier in this guide are free; however, there are other services that require some paid or trial software to make your life easier. We are not recommending how you spend your money, but Bellingcat researchers have used (or developed in one case) the following solutions with some success.

There are some software solutions that can be extracted from most video sites, even if they do not use YouTube or other popular platforms. Although it requires a fee to use it, Apowersoft's video download capture is suitable for almost all embedded videos, including (in some cases) live streaming. The software can detect the video being played in the browser, and then (usually) can download the video from its original source. If you try to download a particular video and cannot find any other solutions, it may be worth trying the software for free. If you are unable to use the free trial version or do not want to purchase the software, please contact the author of this article via Twitter (@AricToler) for help downloading specific videos.

For the web pages behind the privacy settings, it is difficult to find any solution to create a trusted third-party archived copy of the website. Saving pages directly in HTML format is notoriously messy and creates many subfolders on your hard drive. Another solution might be to save the page as a PDF file by printing it as a PDF (File -> Print -> Print as PDF), or use Adobe Create to convert the web page to PDF.

In other words, the content of these pages can still be modified in the PDF. Currently, the most trusted but still imperfect way to display the contents of a privacy lock page may be to record the screen while you visit the page (see the list of simple solutions here).

Finally, if you have done a lot of online research and want an automated tracking solution so you can trace your steps, please consider Hunch.ly developed by Bellingcat contributor and Python wizard Justin Seitz. After activation, this plugin will automatically store every web page you visit while conducting a survey. If one of these pages is later deleted and you did not archive it, Hunch.ly will help you.

Do you have any other websites or resources for archiving web pages, images, and videos? Tell us in the comments that we can add them to this guide.

Your donation to Bellingcat is a direct contribution to our research. With your support, we will continue to publish groundbreaking investigations and expose illegal activities around the world.

In addition to the content we have published, we will also introduce readers to the activities our employees and contributors participate in, such as noteworthy interviews and training seminars.