Content from Setup
Last updated on 2024-05-13 | Edit this page
Estimated time: 20 minutes
Here’s how to setup BitCurator
Instructions
For this lab, the main goal is to get started with BitCurator. To do this, you will need to install BitCurator on your local computer, and you will also need to have a small removable storage device, such as a USB drive or thumb drive. (Your removable storage media should be modest in capacity in order to reduce file sizes.)
The overall steps in this task are as follows:
- Install BitCurator locally (on your laptop, or a desktop that you use and have access to). This will require installing VirtualBox, which is a software from Oracle, an “image” of BitCurator to use for importing the image to VirtualBox, and then starting up the BitCurator environment on your computer.
- Launch BitCurator in the VirtualBox environment, and spend some time getting familiar with the tool’s in the BitCurator environment.
- Acquire a disk image of some sample of digital materials (from a USB or small storage device) using GuyMager, a disk imaging tool that is part of the BitCurator environment.
For reference, these are the tools you will be focused on:
- BitCurator - as of January 2023, the current version is 4.4.1
- GuyMager (note that this is already part of the above, so there is not need to install it separately)
Installation Tips
- To guide your BitCurator installation and imaging process, you should consult Bitcurator’s QuickStart Guide, available here: https://github.com/BitCurator/bitcurator-distro/wiki/BitCurator-Quick-Start-Guide.
- Note that for this activity, you will be using the “Virtual Appliance” option, not the option to install as a direct boot or partition. So, you do not need to create a bootable USB drive prior to installation (ignore instructions about that).
- You can find the instructions for installing the BitCurator Virtual Appliance in VirtualBox here: https://github.com/BitCurator/bitcurator-distro/wiki/BitCurator-Quick-Start-Guide#install-option-2-import-a-bitcurator-virtual-appliance-in-virtualbox
- Be sure to set up a Shared Folder as outlined in the instructions, so that you can save files to your local computer (outside the VM).
- If you are ever asked while using BitCurator to enter an administrator password, use bcadmin.
Sample Data
- Sample disk images from Bentley floppy disks are available in this folder: Bentley_Code4Lib_Samples.
Content from Digital Forensics for Archives
Last updated on 2024-05-02 | Edit this page
Estimated time: 8 minutes
Overview
Questions
- Why might archivists, librarians, and other cultural heritage workers want to use digital forensics techniques and tools?
- What are some known use cases of digital forensics usage amongst archivists, librarians, or others?
- What is BitCurator? What is the BitCurator environment? Is it the same as “digital forensics”?
Objectives
- Become familiar with digital forensics techniques and their application in cultural heritage and digital curation
- Identify and understand various types of magnetic disk removable media, which might be encountered in collections
- Describe and recommend tools and techniques for extracting content from legacy media, including use of write blockers and creation of disk images
- Understand various types of metadata that can be generated for born-digital content extracted from legacy media Become familiar with BitCurator and its toolset
Digital Forensics
Digital forensics refers to a suite of activities and tools to preserve the original context of digital materials (e.g., the system timestamps and OS structure) and extract content at the bitstream level from damaged or deleted digital content.
Archivists + Digital Forensics: Why
What are some use cases for digital forensics with legacy born digital materials?
Enter BitCurator Environment (BCE)
To address this, a group of archivists and researchers developed the BitCurator Environment, or BCE. The BCE is a suite of open-source digital forensics softwares that are particularly useful to archivists in tracking creation metadata, structure, file identification, and documenting provenance. It even contains some built-in writeblockers and other tools to preserve original order and chain of custody. BitCurator tools are grouped within an Ubuntu-based Linux environment and can be run virtually or installed directly as the main OS of a workstation, and together this is all known as the BCE. We will discuss BCE more in the next episode.
Resources
There are many resources that explain how to use the BCE and other digital forensics tools. Given that this lesson focuses on BCE, most of the resources are geared toward this software environment, but the list also includes a few more general resources:
- BitCurator Docs - overview and explanation of all current BCE tools and functionalities
- Corinne Rogers, “From time theft to time stamps: mapping the development of digital forensics from law enforcement to archival authority,” International Journal of Digital Humanities (2019): 13–28. DOI: 10.1007/s42803-019-00002-y
- Julianna Barrera-Gomez and Ricky Erway, “Walk This Way: Detailed Steps for Transferring Born-Digital Content from Media You Can Read In-house,” 2013, OCLC Research. Available at https://www.oclc.org/content/dam/research/publications/library/2013/2013-02.pdf
- Christopher A. Lee, et al., “From Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting Institutions,” whitepaper for BitCurator, September 2013. Available at https://bitcurator.net/wp-content/uploads/sites/1099/2018/08/bitstreams-to-heritage.pdf
- Sam Meister and Alexandra Chassanoff, “Integrating Digital Forensics Techniques into CuratorialTasks: A Case Study,” International Journal of Digital Curation 9 (2014): 6-16. DOI: 10.2218/ijdc.v9i2.325
Key Points
- Digital forensics identifies a range of activities which aim to extract and preserve contextual information about digital content on external devices, like laptops, servers, drives, and even legacy devices like floppy disks and USB drives
- Digital forensics tools and techniques can help digital preservation work, particularly in maintaining information about original order, provenance, and chain of custody for digital objects
- Digital preservation workers, particularly archivists, have used digital forensics techniques and tools to record information about, process, and preserve digital content, and particularly to address content stored on legacy digital devices
Content from Getting Started with BitCurator
Last updated on 2024-04-04 | Edit this page
Estimated time: 8 minutes
Overview
Questions
- How do I install and use digital forensics tools that may be useful for digital curation activities?
- What is a disk image and how can I create one?
- What tools may be used to acquire born-digital materials from removable storage media (and other locations), which ensure the integrity of the data, create useful information about the source and the resulting materials, and can help to preserve the context of the original materials?
- What sorts of digital media are most well suited to this sort of activity? Are there some that are not?
Objectives
- Test and evaluate tools for use in the identification, transfer, and preservation of born-digital materials.
- Install and become familiar with the tools in the BitCurator environment.
- Identify appropriate tools for acquiring born-digital content from removable media and scan for potentially sensitive information stored in that media.
- Use the Guymager disk imaging software to acquire the contents of a storage device and its associated metadata.
Activities
Getting around: Answers will vary, depending on what you choose to look at. At minimum, you should look at the various “Applications” (menu up at the top), use the right click option to look at file information, checksums, and look around to find other interesting things.
Key Points
- Use BitCurator as a helpful way to bundle together and run many tools useful to digital forensics that are appropriate to digital curation. That is, tools that assist in creating trustworthy digital copies, provenance information, contextual data, and chain of custody information.
- You can use
GuyMager
to make disk images. - BitCurator has things set up so you can use
GuyMager
as well as other tools that will document your transfer and copying processes.
Content from Disk Imaging
Last updated on 2024-04-24 | Edit this page
Estimated time: 22 minutes
Overview
Questions
- What is a disk image?
- When would you disk image media?
Objectives
- Use the Guymager disk imaging software to acquire the contents of a storage device and its associated metadata.
- Learn to evaluate when to image a disk based on individualized criteria.
What is a disk image?
A disk image a bit-perfect sequence of all the bits on a particular physical device; in other words, a complete bitstream (as defined by the physical limits of a storage device).
- You may have seen .dmg, or .iso files - these are images (like a thumb drive, CD, diskette)
- We will work with “forensic images,” specifically the “Expert Witness” format (aka .E01 or EWF), which is a complete sequence of a physical drive, does not allow any modifications
To image or not to image?
Considerations for disk imaging
There is no right or wrong answer to whether or not you should image a disk!
- Are you choosing between extracting files and/or chunks of content?
- Collection considerations:
- What is your collecting purpose?
- What is the role of the device(s)?
- What are you storage concerns?
- Device considerations:
- What devices have an OS (that means lots of redundant & proprietary files)?
- If it’s a storage device, may have deleted/unintended files (these are captured by forensic imaging approaches)
- What is on the device? Sometimes the device contents will determine if a disk image is required, such as executable files or other software.
- What are the preservation needs?
Creating a disk image
For this walk through we will be using 3.25 inch floppy disks. Similar concepts are applicable to other storage mediums, but the exact steps may differ.
Using Guymager
The following instructions are modified from the Bitcurator Quick Start Guide
Mounting the device is not required to create an image of it. If you wish to mount the device, click on the Files icon in the dock, and select the name of the indicated volume on the device to mount. If you are not using a hardware write blocker, or if the USB device read-only policy is not enabled, your device is now mounted and writable.
Click on the Applications menu in the top left of the screen, then navigate to the Imaging and Recovery submenu. Then click on Guymager. Guymager requires elevated privileges for access to physical devices; you will be prompted for your password to enable this. Once Guymager has loaded, the main interface appears as in the picture above. In this example, the 3.25 inch floppy disk drive is selected.
Next, right-click on the selected device (in this example, a 3.25 USB floppy drive listed as MITSUMI_USB_FDD) and select Acquire Image from the context menu.
A new dialog prompt will appear. This disk image will be acquired using the Expert Witness Format (the second option at the top). Guymager will split EWF images into 2048MiB segments by default. If you do not wish to split the image, set the Split size to something very large (2 EiB, for example).
The five metadata fields starting with Batch number are optional, but can be useful for tracking and metadata purposes. Under Destination select the image directory you would like the disk image to be saved to. In this case, we have simply chosen to write the image to a folder on the Desktop. Finally, provide a name for the image. Then click Start.
You will see the main dialog state change to Running. When the acquisition finishes, you will see a Finished - Verified & Ok message in the State column.
Some disk image formats you may see
- RAW and Split RAW (RAW stored across multiple files)
- Advanced Forensics Format (AFF) [no longer recommended]
- EnCase Evidence File (.E01)
- ISO (for CD-ROM)
- IMG (floppy or sometimes CD-ROM)
RAW format (dd)
- Copies of the raw media data. Often split into smaller chunks to make them more manageable and so that the resulting images can fit onto limited file systems and media such as FAT or DVD/CDROM.
- Advantages:
- Very simple, use simple tools to manipulate the image.
- Image can be easily split for storage and transport on removable media
- Output can be piped to other applications for immediate processing
- Disadvantages:
- Can be very large (no compression). Zipped raw images cannot be operated on directly with regular tools (efficiently perform arbitrary seeks).
- Often too large to store on FAT formatted media
- No metadata other than file names, no hashes.
- No checksumming on files – not robust
- Missing segments (for example from scratched CD/DVD – can sometimes be overwritten with 0’s).
- Overwritten data (unrecoverable – no checksums on small blocks in file).
Expert Witness Format (EnCase)
- Evidence file consists (in order) of: Acquisition information, Data Block, CRC (cyclic redundancy check), acquisition hash (MD5)
- Can be split for storage, transport
- CRC computed for every 32K block; balance between integrity and speed, also makes it very difficult to tamper with the evidence file (1 in 4 billion chance of collision)
- Cannot be manipulated with simple (open source UNIX) tools; support reverse engineered in libewf
- Previously limited to 2GB size
- Largely proprietary
- Has been reverse engineered by Joachim Metz in libewf (used in open source tools that read EWF) -
Accessing disk images
- Virtualization and emulation
- Mounting the original filesystem
- Accessing (but not mounting) disk images using forensics software
- For end user access:
- Remote, dynamic access to disk image contents (via server, virtual environment)
- Cross-drive analysis
Mount disk image: Using BitCurator Mounter
The following instructions are modified from the Bitcurator Quick Start Guide
In the file manager dialog, right click on any of the sample images you have created, select Scripts, and then select Disk Image Mount. This script serves as a wrapper for libewf and some mounting tools to attempt to automatically mount any identified file systems. If such a filesystem is found, you will see it appear as a mountable device in the list on the left.
Note: This mount is read-only. You cannot alter the content of a filesystem mounted from an E01 file (modifying, adding new files, or deleting) from this desktop interface.
Once you have finished examining the content, click the eject indicator next to the filesystem name in the file dialog. You will get a prompt for your user password in order to complete this step.
Activities
Challenge
Split into two groups.
Group 1: Create your own disk images from the supplied 3.25 inch floppy disks.
Group 2: Mount one of the sample disk images available in the GitHub repo. What information do you see? Is there anything that sticks out to you?
Once you’ve completed one group activity then switch to the next group!
Content from Reporting
Last updated on 2024-05-13 | Edit this page
Estimated time: 8 minutes
Overview
Questions
- What tools are available in the BCE for analyzing disk images or directories of data transferred from legacy media?
- How do you use them?
- Specifically, how can librarians and archivists capture basic system characteristics and metadata?
- How can they generate reports to help them triage and organize files for digital archiving processes?
- How can they scan for for potentially sensitive information to help them make decisiosn about access?
Objectives
- Gain basic experience with:
- Brunnhilde, a reporting tool for directories and disk images;
- Bulk Extractor and Bulk Reviewer, which scans for credit card numbers, emails, etc.; and
-
fiwalk
, to print filesystem statistics
- Learn more about reporting functionality in the BitCurator Environment, in general, and where to learn more.
Reporting in BitCurator is essentially a method of generating technical and preservation metadata about a disk image or directory of data.
At a high level, you will be using, and creating a workflow piecing together:
- a “map” of the disk image, which records relationships, integrity (checksums), names, timestamps, etc. (this is in DFXML);
- a summary of the file types, duplicates, and other relationship information;
- tools for assessing Personally Identifialble Information (PII) and sensitive content; and
- summaries of sensitive content, if discovered.
Note: If you haven’t yet created a disk image or otherwise have a directory of data to work with, you can use Bentley Code4Lib Samples or download sample data from BitCurator’s Github site and work with that: bcc-dfa-sample-data.
One possible structure to group content and metadata (the one we’ll be using for this workshop):
c4l24_bicuratorintro_group0X_image0XX/ <-- parent directory (sample name)
│
├── reports/ <-- subdirectory for detailed metadata (use mkdir)
│ ├── beout/ <-- bulk extractor reports (generated by bulk_extractor)
│ ├── brunn_output/ <-- brunnhilde reports (generated by brunnhilde.py)
│ └── mappedfeatures/ <-- sensitive info (generated by identify_filenames.py)
│
├── c4l24_bicuratorintro_group0X_image0XX_dfxml.xml <-- DFXML (E01 “map” generated by fiwalk)
├── c4l24_bicuratorintro_group0X_image0XX.E01 <-- disk image (generated by Guymager)
└── c4l24_bicuratorintro_group0X_image0XX.info <-- disk image metadata (from Guymager)
First Things First
Today we’ll be using a number of command line tools in the BCE, including:
fiwalk
brunnhilde.py
bulk_extractor
identify_filenames.py
All of these are “pre-loaded” in the BCE, and a simple way to get
usage instructions for any of them is to simply type their names in the
terminal and press enter. E.g., brunnhilde.py
, which is the
same as as using brunnhilde.py -h
or
brunnhilde.py --help
. This is standard for CLI tools, but
we hope it helps illustrate how what we’re doing today is only the “tip
of the iceberg” for any of these individual tools or the BCE in
general.
Reporting
BitCurator includes a variety of tools to analyze and report on disk images and the filesystems they contain.
Map Your Image AKA How to Create DFXML (with fiwalk)
Your first goal is to create a Digital Forensics or DFXML “map” of the disk image. DFXML is used to automate digital forensics processing, and includes all filesystem data, checksums for integrity, and explain the relationships of elements of the disk image. We’ll do this using fiwalk, a program that processes a disk image using the SleuthKit library (a library and collection of command line tools that allow you to investigate disk images for various file systems) and outputs its results in Digital Forensics XML. This map will be used later in other tools.
Tool: fiwalk
To run: Use fiwalk in the terminal.
Command syntax:
fiwalk -f -X <output filename_dfxml.xml> <input image file.E01>
This command tells the terminal to run fiwalk
, run the
“file” command on each file that it finds (-f
), write the
results to an XML file with the specified filename
(-X <output filename_dfxml.xml>
) and identifies the
source of the analysis (the disk image).
Generate File Summaries and Reports AKA How to Run brunnhilde to Report on the Disk Image
Your next goal is to create a summary of file types, duplicates, and any hard to identify files using Brunnhilde. Brunnhilde runs Siegfried, a signature-based file format identification tool, against a specified directory or disk image, loads the results into a sqlite3 database, and queries the database to generate reports to aid in assessment: triage, arrangement, and description of digital archives. The program will also check for viruses unless specified otherwise, and will optionally run bulk_extractor against the given source.
Tool: brunnhilde
To run: Use brunnhilde in the terminal.
Command syntax:
brunnhilde.py -d -b --tsk_fstype fat --tsk_imgtype ewf <image input file.E01> <output destination/reports/brunn_output>
This command tells the terminal to run brunnhilde
, treat
the input as a disk image (-d
), generate a bulk extractor
report (-b
), analyze the disk image as a FAT filesystem
(--tsk_fstype fat
), and analyze the disk image as an expert
witness file (--tsk_imgtype ewf
). Then, the command
provides the location of the source disk image
(<image input file.E01>
) and the destination for
reports
(<output destination/reports/brunn_output>
).
Outputs include:
- report.html: Includes some provenance information on the scan itself, aggregate statistics for the material as a whole (number of files, begin and end dates, number of unique vs. duplicate files, etc.), and detailed reports on content found (file formats, file format versions, MIME types, last modified dates by year, unidentified files, Siegfried warnings/errors, duplicate files, and -optionally - Social Security Numbers found by bulk_extractor).
- csv_reports folder: Contains CSV results queried from database on file formats, file format versions, MIME types, last modified dates by year, unidentified files, Siegfried warnings and errors, and duplicate files.
- siegfried.csv: Full CSV output from Siegfried
Identify Sensitive Information AKA How to Identify Features (with bulk_extractor)
Your next goal is to create reports that identify potentially sensitive information, like SSNs, emails, etc. To do this, we’ll use Bulk Extractor, which rapidly scans any kind of input (disk images, files, directories of files, etc) and extracts structured information such as email addresses, credit card numbers, JPEGs and JSON snippets without parsing the file system or file system structures.
Tool: bulk_extractor
To run: Use bulk_extractor in the terminal AND/OR use Bulk Reviewer.
Command syntax:
bulk_extractor -o <output destination/reports/beout> <input target disk image file.E01>
This command tells the terminal to run the
bulk_extractor
tool, then to output a report to the
specified directory
(-o <image directory>/reports/beout
) and specifies
the target file to analyze
(<input target disk image file.E01>
).
Note: To use Bulk Reviewer, a GUI alternative and an Electron desktop application that aids in identification, review, and removal of sensitive files in directories and disk images, and which scans directories and disk images for personally identifiable information (PII) and other sensitive information using bulk_extractor, click over Applications (top left) > Forensics and Reporting > bulk-reviewer. Click “Scan new directory or disk image.” Select the “Type” (“Directory” or “Image”), create a “Name” for the report, “Browse” to the directory or disk image, select and “Options” and then click “Start Scan.” Once it’s finished, you can then view the report and have options to save or export the results.
The desktop application then enables users to:
- Review features found by type and by file in a user-friendly dashboard that supports annotation and dismissing features as false positives
- Generate CSV reports of features found
- Export sets of files
- Cleared: Files free of PII
- Private: Files with PII that should be restricted or run through redaction software
Note: The “terry-work-usb-2009-12-11.EO1” disk image in the sample data from BitCurator’s Github site produces a number of “hits”–including social security numbers, phone numbers, and email addresses–if the directories or disk images you’re working with do not.
Summarize Sensitive Information Reports AKA How to Summarize Identified Features (with identify_filenames.py)
Your final goal is to summarize the reports on sensitive information,
show main types of features, and to note what files contain the
features. To do this, we’ll use identify_filenames.py
,
which identifies filenames from “bulk_extractor” output and uses the
DFXML to map to point between various hits discovered earlier to the
files on the disk images (rather than the byte offsets).
Tool: identify_filenames.py
To run: Use identify_filenames in the terminal.
Command syntax:
identify_filenames.py --all --image_filename <input disk image.E01> --xmlfile <DFXML of the image_dfxml.xml> <bulk extractor reports location/reports/brunn_output/bulk_extractor> <destination for summary report>/reports/mappedfeatures>
This command tells the terminal to run the
identify_filenames.py
script, look at all of the feature
files (--all
), specifies the source image
(--image_filename <input disk image>
), use the
specified DFXML file
(--xmlfile <DFXML of the image_dfxml.xml>
),
identifies the bulk extractor output to use
(<bulk extractor reports location>
, use the one in
<image directory/reports/brunn_output/bulk_extractor>
),
and specifies a destination for the the analysis
(<image directory/reports/mappedfeatures>
).
So What?
What is the utility in creating all these reports? Reports create technical and preservaton metadata about directories or disk images that can accompany them in to the future and aid in later appraisal and processing for preservation and access.
Key Points
- Some reports may be needed for contextualizing and using the disc images in other programs (dfxml).
- Some reports may be more for risk management and analyzing PII.
- Some may be more for preservation planning (file types).
- Some may be for general description (dates of creation, titles/names of files, users, or other topical information).
The way you’d interpret any of these reports depends on the report on what you’re wanted to get out of it. Some reports, like the bulk_extractor reports, are easier to read through. The DFXML, while “harder” to read, gives you all the checksums and a listing of what’s on a disk image, which could be good for checking fixity, but also helping you to determine if you want to extract the files from the disk image.
Additional resources
- BitCurator Quick Start Guide, which includes sections on:
- Jesse’s BitCurator Workstation Guide created for SI 667: Foundations of Digital Curation.
- Brunnhilde - Siegfried-based characterization tool for directories and disk images, for more information on Tessa Walsh’s Brunnehilde tool.