Clear image metadata from Wayback Machine-bellingcat

2021-11-24 02:58:29 By : Mr. Wu Justin

Justin Seitz is a Canadian security consultant and the author of two computer hacking books published by No Starch Press. He blogs on AutomatingOSINT.com and can be found on Twitter @jms_dot_py.

This article was originally published on the AutomatingOSINT.com blog.

Not long ago, I was very interested in the Oct282011.com Internet Mystery (if you haven’t heard of it, please check this podcast). Friends on the Hunchly mailing list and I started a short journey to see if we can find any additional clues or, of course, solve the mystery. One of the main sources of information for the survey is The Wayback Machine, which is a popular resource for many surveys.

For this particular investigation, there are many strange images scattered around as clues. I want to know if it is possible to retrieve these photos from Wayback Machine and then check their EXIF ​​data to see if we can find the author’s details or other Information on delicious gold nuggets. Of course, I don’t do this manually, so I think this is a great opportunity to build a new tool to do this for me.

We will use several great tools to achieve this miracle. The first is a Python module written by Jeremy Singer-Vine, called waybackpack. Although you can use waybackpack as a standalone tool on the command line, in this blog post, we will simply import it and use part of it to interact with Wayback Machine. The second tool is Phil Harvey's ExifTool. When extracting EXIF ​​information from photos, this little beauty is the gold standard and is trusted by the world.

Our goal is to extract all the images of a specific URL on the Wayback Machine, extract any EXIF ​​data, and then output all the information to a spreadsheet, which we can then view.

This article involves some active parts, so let's put these boring things aside first.

On Ubuntu-based Linux, you can do the following:

Mac OSX users can use Phil's installer here.

For people on Windows, you must do the following:

Install the necessary Python libraries

Now we are ready to install the various Python libraries we need:

pip install bs4 request pandas pyexifinfo waybackpack

Okay, let's get started?

Now open a new Python file, name it waybackimages.py (download the source code here) and start typing (with both hands) the following:

Nothing is too surprising here. We just import all the necessary modules, set the target URL, and then create a directory for all the images to be stored.

Now let's implement the first function, which will be responsible for querying the Wayback Machine to obtain all unique snapshots of the target URL:

Now that our search function has been implemented, we need to process the results, retrieve each captured page, and then extract all image paths stored in HTML. Let us do it now.

All right! Now that we have extracted all the image URLs that can be extracted, we need to download them and process them to obtain EXIF ​​data. Let's implement it now.

Let's separate this code a bit:

We are almost done. Now we just need to link all these functions together and get some output in CSV format so that we can easily view all the EXIF ​​data we found. It's time to make the final touches for this script!

Let's break down the last piece of code:

Okay, now is the fun part. Set the URL you are interested in, and then run the script from the command line or your favorite Python IDE. You should see output like this:

[*] Retrieved 41 possible storage URLs [*] Retrieved https://web.archive.org/web/20110823161411/http://www.oct282011.com/ (the first of 41)[* ] Search https://web .archive.org/web/20110830211214/http://www.oct282011.com/ (2 of 41) [] Add new image: https://web.archive.org/web/20110830211214 /http://www .oct282011.com/st.jpg

[*] Save https://web.archive.org/web/20111016032412/http://www.oct282011.com/material_same_habits.png [v] Download https://web.archive.org/web/20111018162204/http ://www.oct282011.com/ignoring.png [v] Download https://web.archive.org/web/20111018162204/http://www.oct282011.com/material_same_habits.png [v] Download https:/ / /web.archive.org/web/20111023153511/http://www.oct282011.com/ignoring.png [v] Download https://web.archive.org/web/20111023153511/http://www.oct282011 . com/material_same_habits.png [v] Download https://web.archive.org/web/20111024101059/http://www.oct282011.com/ignoring.png [v] Download https://web.archive.org/ web/20111024101059/http://www.oct282011.com/material_same_habits.png [*] CSV writing to results.csv completed

Your donation to Bellingcat is a direct contribution to our research. With your support, we will continue to publish groundbreaking investigations and expose illegal activities around the world.

In addition to the content we have published, we will also introduce readers to the activities our employees and contributors participate in, such as noteworthy interviews and training seminars.