Fast and easy Hash function?

I want to keep a list of files of the SD card in RAM. So a kind of dictionary. The files can be in an arbitrary folders (or subfolders) in the SD card. In the code I want to quickly check if a given file name (with path) exists in the SD card.

Instead of storing a list of all file names and paths of the SD card in RAM, I was thinking of storing only a hash of each filepath to save RAM.

Do you know a fast and simple algorithm to convert a file path to a hash? The resulted hash id should be max 32-bit if possible.

Combine a 32-bit xorshift:

With a generic byte hashing algorithm:

Xorshift32 is guaranteed to have a full 32-bit period if used as the state transition function of a PRNG.
The generic algorithm has no guarantees, best measure it to make sure you’re happy with it.

(I used this hashing/checksum algorithm to detect when the EEPROM data had been corrupted.)

1 Like

Excellent! Looks like what I need.

1 Like

There may be some ways to speed it up if it’s not fast enough,
particularly if you can guarantee that the input size is a multiple of 4 bytes.

I was wondering if there was an easy-ish way to convert a const char* string to a 32-bit hash at compile time using constexpr.

I figured the almighty @Pharap would know a good way to do it. The idea is to be able to do the following with my DataPack library:

DataPack dataFile;
//the DataPack::getPackedFile would ideally take a 32-bit hash
//or convert the string at compile-time and call the appropriate overloaded function
DataPack::PackedFile packedFile = dataFile.getPackedFile("/path/to/file");
//I would also like to do the following
const uint32_t dataFiles[] =
{
  DataPack::getHash("/path/to/file1"),
  DataPack::getHash("/path/to/file2")
  //...
};

The idea is to be able to call a function passing a path as a string constant and have it converted to a hash at compile time to save space. Also to store an array of files (so they can be referenced via an index) in flash space to save on RAM.

2 Likes

Yes, @Pharap s function works well for me :+1:

2 Likes

As @Hanski hinted at, you could just constexpr the xorshift I posted earlier:

Here’s a version more specific to your use case:

namespace DataPack
{
	constexpr uint32_t xorshift32(uint32_t value)
	{
		value ^= (value << 13);
		value ^= (value >> 17);
		value ^= (value << 5);
		return value;
	}

	template<std::size_t size>
	constexpr std::uint32_t hash(const char (& string)[size])
	{
		return hash(string, size);
	}

	constexpr std::uint32_t hash(const char block[], std::size_t size)
	{
		constexpr std::size_t resultSize = sizeof(std::uint32_t);

		std::uint32_t sum = static_cast<std::uint32_t>(size);

		for (std::size_t i = 0; i < size; i += resultSize)
		{
			std::uint32_t value = 0;
			
			for (std::size_t j = 0; j < resultSize; ++j)
			{
				value <<= 8;
				
				const size_t index = (i + j);

				if(index < size)
					value |= block[index];
			}
			
			sum = xorshift32(sum + value);
		}

		return sum;
	}
}

Edit: I’ve just realised this actually ignores some of the tail bytes, I’m currently fixing that.

Edit 2: Disregard that, I’m being silly. The loop is correct, it’s just being clever instead of being clear about what’s going on. :P

Which you’d then use as:

constexpr uint32_t dataFiles[]
{
  DataPack::hash("/path/to/file1"),
  DataPack::hash("/path/to/file2"),
  // ...
};

If you’re feeling bold you could also provide a user-defined literal suffix like so:

namespace DataPack
{
	constexpr std::uint32_t operator ""_hash(const char * string, std::size_t size)
	{
		return DataPack::hash(string, size);
	}
}

After which a user could do:

using DataPack::operator ""_hash;

constexpr uint32_t dataFiles[]
{
  "/path/to/file1"_hash,
  "/path/to/file2"_hash,
  // ...
};

(Or using namespace DataPack; if they really wanted to bring in everything, including the kitchen sink. :P)

xorshift32 is guaranteed to be good for avoiding clashes because George Marsaglia says so and he knows more about that kind of stuff than I do. :P

The byte combining algorithm doesn’t come with that guarantee because I wrote it and I’m not a mathematician so you may want to run some tests, but generally it ought to be decent.

Be warned that no algorithm is going to guarantee no clashes simply due to the pigeon hole principle.

You’ll probably need to think of a way to handle clashes eventually.

One possibility, although not necessarily an easy one, is a compile-time hash table.

Sounds like a job for a file allocation table to me.
I.e. an array of data offsets/pointers.

1 Like

This is really great, and pretty much exactly what I was looking for.

That is much appreciated as I’ll be able to more easily incorporate that into the library with less work.

Clashes will be easily handled by the data packing tool (the tool that creates the pack). The idea is the tool will create the hashes and either error out or at least warn of clashes. Then on the Pokitto side the library just needs to know the path and the hash will be calculated at compile time.

The alternative I have now is something like this for a start (the packer either embeds the file table in the file header, or spits out a c++ header with the following for now):

//datapackname.h
#pragma once

#define FILE_filename1 START_ADDRESS, END_ADDRESS
#define FILE_filename2 START_ADDRESS, END_ADDRESS

While arguably all files could be just coded with a resource ID (even using uin8_t for extra savings if there’s fewer than 256 files), but that gets harder to program and even harder to read (ie. dataPack.getPackedFile(idNumber) which is provided, but mostly for iteration convenience).

Ideally I would like the files to be able to be repacked (like if some changes were made) without having to recompile the code (thus the need for storing the file offsets inside the file itself). While a compile-time hashing function might not be considered “trivial” for some it makes the user-end portion far simpler. Also there’s the benefit of more readable code, like dataPack.getPackedFile("/path/to/file/as/string") could be a constexpr function that calls the overload dataPack.getPackedFile(hashValue) using the hash function. IMHO this would make the user’s code really easy to read/write and put all the complexity behind the curtains (fully accessible to the curious, but otherwise hidden from view).

Anyway, thanks a bunch for the help, I’ll have to implement this soon and get the DataPack library and packing tool on github ASAP.

Just to throw an idea at the wall, you could export a header for the user to use like this:

#pragma once

#include <cstdint>

namespace MyGameData
{
	constexpr std::uint8_t dataFile1 = /*File table offset*/;
	constexpr std::uint8_t dataFile2 = /*File table offset*/;
}

Then users could do dataPack.getPackedFile(MyGameData::dataFile1).

Effectively you’d be replacing URLs encoded in string literals with name constants, which has the added advantages of:

  • You’ll know you’re getting them wrong at compile time when the editor tells you “this identifier doesn’t exist”
  • The various identifiers will show up in an editor’s autocomplete

If each constant represents an index into an allocation table stored in the file then there would be no need to recompile the code.

Also if you really want them readable you could produce nested namespaces such that the user writes something like dataPack.getPackedFile(files::path::to::file::as::string);.

The big difference here is that rather than getting a pseudorandom value (i.e. a hash) from a fixed value and being at the mercy of the hash, you get to dictate what the value is.

Of if you’re feeling particularly crazy you can overload the / operator and do some type magic so you end up writing dataPack.getPackedFile(files/path/to/file/as/string).
(If you want to see this kind of madness in action, have a peek at this, especially the examples.)
It’s nicer to look at but more cumbersome to write the boilerplate for. :P

(If you’re interested in any of these alternatives feel free to PM me -
I don’t want to clog up the hash function thread too much.)

1 Like

Thank you @Pharap for all the info and help. I now have a fully functional v0.1 of my DataPack utility, just need to create an example demo project and put it up on github for others to use. Will create a dedicated topic for it once I’ve got it on github.

I went with this format for referencing files (snippet from my PreludeToADream game):

const uint32_t backdrops[] =
{
	DataPack::hash("backdrops/field.gfx"),
	DataPack::hash("backdrops/forest.gfx"),
	DataPack::hash("backdrops/shipdeck.gfx"),
	DataPack::hash("backdrops/swamp.gfx")
};

const uint32_t battlers[] =
{
	DataPack::hash("battlers/giantPokitto.gfx"),
	DataPack::hash("battlers/pirate.gfx"),
	DataPack::hash("battlers/snake.gfx"),
	DataPack::hash("battlers/wolf.gfx")
};

static const uint32_t maps[][3] =
{
	{DataPack::hash("maps/overworld/data.dat"),DataPack::hash("maps/overworld/passability.dat"),DataPack::hash("maps/overworld/tilemap.dat")}
};

static const uint32_t music[]
{
	DataPack::hash("music/JRPG_battleBoss_loop.raw"),
	DataPack::hash("music/JRPG_battleFinal.raw"),
	DataPack::hash("music/JRPG_battle_loop.raw"),
	DataPack::hash("music/JRPG_characterRevived.raw"),
	DataPack::hash("music/JRPG_discovery.raw"),
	DataPack::hash("music/JRPG_docks_loop.raw"),
	DataPack::hash("music/JRPG_dungeon_loop.raw"),
	DataPack::hash("music/JRPG_fields_loop.raw"),
	DataPack::hash("music/JRPG_gameOver.raw"),
	DataPack::hash("music/JRPG_goodMorning.raw"),
	DataPack::hash("music/JRPG_goodNight.raw"),
	DataPack::hash("music/JRPG_inn_loop.raw"),
	DataPack::hash("music/JRPG_joinParty.raw"),
	DataPack::hash("music/JRPG_labyrinth_loop.raw"),
	DataPack::hash("music/JRPG_levelUp.raw"),
	DataPack::hash("music/JRPG_machineRoom_loop.raw"),
	DataPack::hash("music/JRPG_mainTheme.raw"),
	DataPack::hash("music/JRPG_mysticIsle.raw"),
	DataPack::hash("music/JRPG_princess.raw"),
	DataPack::hash("music/JRPG_royalCourt_loop.raw"),
	DataPack::hash("music/JRPG_shop_loop.raw"),
	DataPack::hash("music/JRPG_tavern.raw"),
	DataPack::hash("music/JRPG_temple.raw"),
	DataPack::hash("music/JRPG_town_loop.raw"),
	DataPack::hash("music/JRPG_winBattleBig.raw"),
	DataPack::hash("music/JRPG_winBattleBoss.raw"),
	DataPack::hash("music/JRPG_winBattleFinal.raw"),
	DataPack::hash("music/JRPG_winBattle.raw")
};

static const uint32_t sprites[]
{
};

static const uint32_t tilesets[]
{
	DataPack::hash("tilesets/overworld.gfx")
};

With this method it’s easy enough to create an array of all files, or separate arrays like I’ve done for individual types of resources (music, tilesets, etc.).

I don’t know if I’m crazy enough to go to that extent :P

Also clashes are handled by the packing utility, but I’ll explain a little more detail on using the packing tool, and using the DataPack library in the readme on github once I get it ready.

1 Like

Just to check, are you aware of what static actually means (in terms of storage and linkage) when used with global variables?

(I ask only because I know a lot of people see it used in code and then just copy it without really understanding what the static is for.)

That’s a shame, because the standards committee are. :P

https://en.cppreference.com/w/cpp/filesystem/path/operator_slash