Curtis Lassam – Hash Functions and You: Partners in Freedom – PyCon 2015


ABOUT HASH FUNCTIONS AND YOU, PARTNERS IN
FREEDOM.>>HELLO, I’M CURTIS LASSAM, CLASSAM ON TWITTER.
WE’LL TALK ABOUT HASH FUNCTIONS. I’M GOING TO SHOW YOU WHY THIS FUNDAMENTAL
TECHNIQUE BELONGS IN YOUR TOOL KIT AND YOU’LL SEE
I’M NOT A HANDYMAN. WE’RE GOING TO TALK ABOUT HASH FUNCTIONS,
#FILTER, BLOOM FILTERS AND HASHES INSECURITY. WHAT IS A HASH FUNCTION?
THE VERY BEGINNING. HERE IS AN EXAMPLE.
IT TAKES A STRING, SPLITS ITS INTO CHARACTERS, CONVERTS EVERY CHARACTER INTO AN INTEGER,
ADDS THE INTEGERS TOGETHER AND MODELS THE RESULT
BY 100.
THIS IS A HASH IF YOU THINK. NOT EVERY HASH FUNCTION WORKS THIS WAY BUT
THIS, LIKE MANY OTHER FUNCTIONS, IS AN EXAMPLE OF
A HASH FUNCTION.
IT HAS A NUMBER OF USEFUL PROPERTIES. REGARDLESS OF WHAT SEQUENCE OF CHARACTERS
YOU PUT IN, YOU’RE GUARANTEED THAT THE OUTPUT WILL
ALWAYS BE BETWEEN 0 AND 99.
THE OUTPUT FOR A STRING WILL ALWAYS BE THE SAME.
AND MANY DIFFERENT INPUTS CAN PRODUCE THE SAME
OUTPUT. ALSO, GIVEN THE OUTPUT, IT’S IMPOSSIBLE TO
GUESS THE INPUT THAT PRODUCED IT.
I KNOW INVOKING THE WIKIPEDIA DEFINITION OF SOMETHING IS THE PRESENTATION EQUIVALENT TO
OPENING WITH WEBSTER DEFINITIONS DEFINES BUT THE
WIKIPEDIA HAS A GOOD DEFINITION OF A HASH FUNCTION.
A HASH FUNCTION IS ANY ALGORITHM THAT MAPS DATA
OF ARBITRARY LENGTH TO DATA OF A FIXED LENGTH. SO WHATEVER YOU PUT IN, YOU GET DATA OF A
CERTAIN LENGTH.
THAT’S HASH FUNCTION. LET’S TALK ABOUT HASH TABLES.
THEY’RE A DATA CONCEPT USED TO STORE KEYS AND
VALUES. KEYS AND VALUES ARE IMPORTANT, YOU DEAL WITH
THEM WHEN YOU WORK WITH THINGS LIKE MUNGO DB OR
PYTHON DICT, AND IF YOU’RE DEALING WITH KEYS AND
VALUES, CHANCES ARE BEHIND THE SCENES SOMETHING CLEAR
WITH HASH TABLES IS HAPPENING. IT COULD BE A TREE OR A TRI, BUT WE’RE NOT
TALKING ABOUT THOSE SO WE’LL PRETEND THEY DON’T
EXIST. SO LET’S START WITH A BIG BLOCK OF MEMORY,
AN ARRAY WITH 100 ELEMENTS.
WE WANT TO STORE THE VALUE, HASH 215D29 AGAINST THE KEY COLOR SO, WE RUN A HASH FUNCTION ON
THE KEY, COLOR, THIS PRODUCES A NUMBER WHICH WE
MOD BY THE LENGTH OF OUR TABLE AND THIS GIVES
US AN INDEX.
AND THEN WE CAN STORE OUR VALUE AT THE LOCATION POINTED TO BY THE INDEX, THAT’S A HASH TABLE.
EASY PEASY. BUT IT IS NOT SO EASY NOR QUITE SO PEASY.
AS THE ARRAY FILLS UP WITH VALUES, IT GETS MORE
AND MORE LIKELY THAT WE’LL HAVE A HASH VALUE POINT TO A SPOT IN THE ARRAY THAT’S ALREADY
FULL. THIS IS CALLED A COLLISION AS YOU MIGHT IMAGINE.
WHAT DO WE DO WHEN THERE’S ALREADY A VALUE IN THE
SPOT WHERE WE WANT TO STORE A VALUE? WE CAN KEEP WALKING FORWARD IN THE TABLE UNTIL
WE FIND AN AVAILABLE SPOT, THIS IS CALLED LINEAR
PROBING. ALTERNATIVELY, WE COULD LINK TO EVERY SPACE
IN THE ARRAY, THAT WAY OUR HASH TABLES CAN TAKE
AS MANY VARIABLES AS WE CAN THROW AT IT.
OR ROOT A TREE AT EVERY SPACE IN THE TABLE. THIS, ROOTING A SEPARATE DATA STRUCTURE AT
EVERY SPOT IN THE TABLE IS CALLED A CHAINED HASH
TABLE. THERE’S ONE PROBLEM WITH A CHAINED HASH TABLE,
HERE THERE’S A BUNCH OF DIFFERENT VALUES STORED WHERE WE’RE LOOKING FOR THE KEY.
GEEZ, HOW DO WE KNOW WHICH IS THE RIGHT ONE. WE NEED TO STORE THE KEY WITH THE VALUE IN
THE SECOND DATA STRUCTURE SO WE CAN MAKE SURE
WE’RE RETRIEVERING THE RIGHT THING.
THIS IS GOING TO BE THE CASE ANY TIME WE HAVE TO
DEAL WITH COLLISION. IF WE CAN RETRIEVE MULTIPLE POTENTIAL VALUES
FROM AUER HASH TABLE, WE NEEDS TO BE ABLE TO TELL
WHICH IS THE CORRECT VALUE. SO EVEN WITH SOME STRATEGY FOR COLLISION
DETECTION IN PLACE, IT’S POSSIBLE FOR THE TABLE
TO GET SO FULL THAT IT PERFORMS VERY SLUGGISHLY. A CROWDED CHAINED HASH TABLE IS ONLY A LITTLE
BIT BETTER THAN A LINKED LIST.
OR IN THE CASE OF HASHING STRATEGIES THAT JUST
SHUFFLE ADDRESSES AROUND, IT’S POSSIBLE FOR THE
TABLE TO BECOME COMPLETELY FULL. WHEN THIS HAPPENS, IT’S TIME TO REBUILD THE
HASH? THIS IS THE TIME-CONSUMING PROCESS OF ADDRESSING
AN EVEN BIGGER WHACK OF MEMORY, TAKING ALL THE
KEYS OUT OF THE ARRAY, REHASHING THEM AND PUTTING
THEM IN THE SECONDS ARRAY. THERE’S ONE LANGUAGE I CAN THINK OF WHOSE
DEFAULT HASH TABLE IMPLEMENTATION CAN PERFORM THIS
COMPUTATIONALLY IMPORTANT STEP UNEXPECTEDLY ANY
TIME AN INSERT PUSHES THE LOAD ABOVE THE LOAD LIMIT.
FOR THE SAKE OF POLITENESS, I’M NOT GOING TO
MENTION WHICH LANGUAGE. OF COURSE, LINEAR PROBING AND CHAINED HASHING
ARE NOT THE ONLY MANAGEMENT STRATEGIES, THERE
ARE MANY WAYS TO MANAGE A HASH TABLE, LIKE ROBIN
HOOD HASHING WHICH STEALS DATA FROM THE RICHER
TABLES AND PUT ITS INTO THE POORER TABLES, OR HOPSCOTCH
WHERE YOU ARRAY IT ON THE PAVEMENT. YOU PROBABLY SHOULDN’T FACT-CHECK ME ON THOSE
LAST TWO. SO A HASH TABLE USES THE HASH FUNCTION AS
A LOCATION TO STORE THE DATA IN MEMORY.
INSERT HAPPENS IN CONSTANT TIME. DELETE HAPPENS IN CONSTANT TIME.
LOOK UP, HAPPENS IN CONSTANT TIME. AND YOU SHOULDN’T LINEAR SEARCH THROUGH A
HASH TABLE.
SO THAT FIRST FEW MINUTES OF THE PRESENTATION WAS
THERE TO GET YOU UP TO SPEED ON THE BASIC. NOW LET’S GET TO SOME OF MEAT OF THE
PRESENTATION. BLOOM FILTERS, THEY’RE FINANCE.
A BLOOM FILTER IS A GREAT WAY TO KEEP BALLOONS OUT OF YOUR FACE.
A BLOOM FILTER IS A DATA STRUCTURE THAT’S FAST
AND SPACE EFFICIENT, USED TO TEST FOR MEMBERSHIP IN A SET.
THAT’S IMPORTANT, IT TESTS FOR SET MEMBERSHIP, IT
DOESN’T STORE ANY DATA, IT CAN TELL YOU IF SOMETHING IS IN A SET BUT YOU CAN’T RETRIEVE
AN ITEM FROM THE SET.
LIKE IF WE HAVE THREE OBJECTS, BANANA, APPLE, BOWLING BALL AND A BLOOM FILTER REPRESENTING
THE SET OF FRUIT, WE CAN USE THE SET TO DETERMINE
THE BANANA AND APPLE ARE PROBABLY FRUIT AND THAT
BOWLING BALL IS DEFINITELY NOT. BUT WE CAN’T GIVE YOU A HIS OF ALL THE FRUIT
THAT WE’RE USED TO POPULATE THE SET.
WE DON’T KNOW THEM. SO A LOT OF THE TIME BLOOM FILTERS ARE USED
TO ANSWER QUESTIONS LIKE, IS CHUMP’S A REAL WORLD?
NO, CHUPPIES IS NOT A REAL WORLD. IS EVIL.YOU A REAL WEBSITE, I DON’T KNOW.
AND IS MAIN.CSS IN THE CACHE? YES, IT IS.
IS THIS A BAND IMAGE? SO LET’S LOOK AT THE PROBLEM OF BANNED IMAGES
A LITTLE BIT.
LET’S IMAGINE WE’RE RUNNING A FORUM AND OUR ITINERANT USERS KEEP POSTING BANNED IMAGES.
HOW DO WE KNOW, LOOKING AT IMAGE DATA IF AN IMAGE
HAS BEEN BANNED? LET’S SAY WE RUN A HASH FUNCTION ON THE IMAGE.
YOU CAN HASH IMAGES, HASH JUST ABOUT ANYTHING THAT’S DATA.
MOD THE RESULT BY THE LENGTH OF AN ARRAY, MOVE TO
THAT INDEX LOCATION IN THE ARRAY, AND SAVE A LINK
TO THE IMAGE THERE. THEN WHEN WE’RE CHECKING A NEW IMAGE, WE CAN
HASH IT AND SEE IF IT’S IN OUR ARRAY.
OF COURSE, THIS IS JUST A BUG STANDARD HASH TABLE
AND I’M SUPPOSED TO BE TALKING ABOUT BLOOM FILTERS THIS.
WORKS FINE, THE STRATEGY I’VE USED SO FAR, BUT
TAKES UP A LOT OF SPACE AND I PROMISED YOU SPACE
EFFICIENCY. LET’S IMAGINE THERE ARE 5,000 IMAGES WE WANT
TO KEEP OUT OF OR FORUMS AND THEY TAKE UP 100
KILOBYTES EACH. THAT MEANS ABOUT A 500 MEGABYTE TABLE OF
DUPLICATE IMAGES. WHILE NOT A TON OF SPACE, IT’S ENOUGH THAT
YOU PROBABLY WON’T WANT TO KEEP THAT IN RAM OR
SEND IT TO A CLIENT.
REMEMBER, THOUGH, WE DON’T NEED STORAGE FROM THE
DATA STRUCTURE, WE’RE ONLY INTERESTED IN WHETHER OR NOT THIS IMAGE EXISTS IN OUR DATA STRUCTURE.
LET’S IMAGINE WE STORE ZEROS IN EVERY SPACE IN
OUR TABLE AND WE STORE ONES WHERE THE HASH FUNCTIONS LANDS.
THAT WAY WE CAN CHECK IF AN IMAGE IS BENT BY
CHECKING IT AND SPOTTING IF IT HAS A ONE IN IT.
IF THERE’S NO ONE THERE, WE CAN’T POSSIBLY HAVE
SEEN THAT IMAGE BEFORE. THERE’S ONLY ONE SLIGHT PROBLEM WITH THIS
TECHNIQUE, WHAT HAPPENS WHEN WE HAVE A DIFFERENT IMAGE THAT ACCIDENTALLY COLLIDES WITH AN IMAGE
THAT WE SET EARLIER. THIS CREATES A FALSE-POSITIVE AND UNFAIRLY
TAKES PICTURES OF NICHOLAS CAGE OUT OF CIRCULATION.
SO HOW DO WE STOP COLLISIONS LIKE THIS FROM OCCURRING?
WELL, WE COULD USE A HASH FUNCTION THAT GUARANTEES AN ASTRONOMICALLY LOW PROBABILITY
OF COLLISION.
MD5, FOR EXAMPLE. IT’S A HASH FUNCTION THAT GUARANTEES AN
ASTRONOMICALLY PROBABILITY OF COLLISION BUT WITH
THAT LOW PROBABILITY OF COLLISION, YOU GET AXIOMATICALLY A HIGH NUMBER OF OUTPUTS WITH
ONE BIT FOR EVERY OUTPUT, YOU’RE SUDDENLY SADDLED
WITH FOUR TIMES TEN TO THE 25 TERABYTES IN YOUR
ARRAY, WHICH IS ALSO A LITTLE BIT UNREALISTIC TO
CRAM INTO YOUR RAM. BUT IT DOES HIT ON ONE OF THE TWO STRATEGIES
WE COULD USE TO REDUCE COLLISIONS IN OUR BIT
ARRAY, MORE SPACE.
LET’S LOOK AT THE LESS OBVIOUS STRATEGY FOR REDUCING COLLISIONS.
MORE HASH FUNCTIONS. INSTEAD OF HASHING OUR IMAGE JUST ONCE, LET’S
HASH IT TWICE, WITH TWO DIFFERENT HASH FUNCTIONS. THIS GIVES US TWO DIFFERENT LOCATIONS IN THE
TABLE, AND WE CAN STORE ONE AT BOTH OF THEM. ANOTHER IMAGE MIGHT SHARE THE RESULT OF ONE
OF THE HASH FUNCTIONS, BUT IT’S NOT AS LIKELY
TO SHARE THE RESULTS OF ALL OF THEM.
AND IF ANY OF THE HASH FUNCTIONS POINT TO A
LOCATION WITH A ZERO IN IT, WE KNOW THAT THIS OBJECT CAN NEVER HAVE BEEN ENTERED INTO THE
BLOOM FILTER.
SO THIS IS A BLOOM FILTER, A BIT FIELD AND MULTIPLE HASH FUNCTIONS TO SET THE BITS IN
THAT FIELD.
WE PUT ITEMS IN BY HASHING THEM MULTIPLE TIMES AND SETTING THE BITS AT ALL OF THOSE LOCATIONS
AND WE CHECK IF ITEMS ARE IN THE SET BY HASHING ITEMS MULTIPLE TIMES AND CHECKING IF THE BITS
ARE SETS AT ALL OF THOSE LOCATIONS.
THERE ARE SOME DOWNSIDES TO THIS. BECAUSE EACH ITEM WE PUT INTO THE BLOOM FILTER
HAS MULTIPLE HASH FUNCTIONS, THERE CAN BE SOME
OVERLAP, WHICH MEANS WE CAN NEVER DELETE ANYTHING FROM THE BLOOM FILTER.
IF WE DO, WE RUN THE RISK OF ACCIDENTALLY DELETING SOMETHING ELSE FROM THE FILL —
ACCIDENTALLY DELETING SOMETHING ELSE FROM THE
FILTER. ON TOP OF THAT, WE’VE REDUCED THE CHANCE OF
COLLISION BUT IT’S IMPRACTICAL TO REDUCE THAT CHANCE TO EFFECTIVELY ZERO SO ALWAYS SOME
CHANCE OF A FALSE POSITIVE.
HOWEVER, THE PROBABILITY OF A FALSE POSITIVE IS A
NUMBER THAT WE HAVE SOME CONTROL OVER. IF WE KNOW THE DESIRED PROBABILITY OF A
COLLISION, IN THE CASE OF OUR IMAGE FILTER, LET’S
SAY 0.1% AND THE NUMBER OF THINGS THAT WE WANT TO
PUT 2349 FILTER, WE MENTIONED BEFORE, 5,000 IMAGES, WE CAN USE HAND-WAVEY MATH TO DETERMINE
HOW MUCH SPACE WE NEED AND WHAT WE NEED TO GET
THE OPTIMAL SOLUTION. WAVING MY HANDS, YOU DON’T TO HAVE MEMORIZE
THIS OR WRITE IT DOWN, IT’S EASY TO FIND ON THE
INTERNET. SO DOING THIS MATH WITH OUR NUMBERS, WE NEED
AN ARRAY OF 71,888 BITS, WHICH IS 8.8 KILOBYTES
AND THAT’S SMALL ENOUGH TO KEEP IT IN RAM.
WE COULD SEND IT TO THE CLIENT SIDE IF WE WANTED
TO. IT’S EASIER TO WORK WITH.
AND WE NEED TEN DIFFERENT HASH FUNCTIONS. THAT IS TO SAY THE OPTIMAL PACKING REQUIRES
THAT EACH ITEM WE PLACE IN THE TABLE SET TEN DIFFERENT
BITS IN THE BLOOM FILTER. A LOT OF THE TIME BLOOM FILTERS ARE PAIRED
UP WITH OTHER DATA STRUCTURE THAT HANDLE THE
ACTUAL STORAGE.
THERE’S SUPER USEFUL WHEN PAIRED WITH DATA STRUCTURES THAT EXHIBIT WORST-CASE PERFORMANCE.
IN A LINKED LIST OR UNSORTED LARGE ARRAY, FOR
EXAMPLE, YOU GET THE WORST POSSIBLE CASE WHEN YOU’RE SEARCHING FOR AN ITEM THAT JUST ISN’T
THERE. THE BLOOM FILTER CAN CHECK BEFORE YOU HIT
THE DATA STRUCTURE IN THE DATA IS IN THE DATA
STRUCTURE. THEY’RE ALSO VERY USEFUL WHEN THE RETRIEVAL
STEP FOR DATA TAKES A LONG TIME.
FOR EXAMPLE, WHEN A NETWORK CALL NEEDS TO BE MADE
TO A FAR-AWAY DATABASE, A LOCAL BLOOM FILTER IS A
WONDERFUL WAY TO KNOW IF YOU’RE WASTE CRINGE EVERYBODY’S TIME WITH A REQUEST FOR DATA THAT
JUST DOESN’T EXIST. OR DATA WITH A VERY LOW HIT RATE.
IF YOU’RE DEALING WITH THE SORT OF DATA WHERE YOU
GET TEN MISSES FOR EVERY HIT, IS THIS A SUSPICIOUS WEBSITE, FOR EXAMPLE, YOU CAN CATCH
THE MISSES WITH A BLOOM FILTER. SO IN SUMMARY, THIS IS A VERY SMALL SUMMARY
FOR THE PEOPLE IN THE BACK SO I’LL TRY TO BE LOUD,
BLOOM FILTERS ARE FAST, COMPRESSED, STORAGE-FREE DATA STRUCTURE USED TO CHECK FOR SET MEMBERSHIP.
IT’S IMPLEMENTED AS A SET OF HASH FUNCTIONS, POINTING TO A BIT ARRAY, NO RETRIEVAL IS ALLOWED,
AND NO REMOVAL IS ALLOWED. SO NOW LET’S TALK ABOUT HOW TO PICK WHICH
HASH FUNCTIONS TO USE WHEN YOU’RE BUILDING DATA
STRUCTURES. I SHOWED YOU MY AWESOME CHEESE HASH FUNCTION
EARLIER BUT IT’S TERRIBLE AND ON TOP OF THAT, REALLY ONLY WORKS ON STRINGS.
NOW, THE NUMBER OF DIFFERENT HASH FUNCTIONS ARE
COUNTLESS, EACH ONE IS A UNIQUE SNOWFLAKE, ONLY
THREE OF THE ONES ON THIS BOARD ARE MADE UP AND
IF YOU CAN PICK THEM OUT, I MIGHT GET A CHANCE TO
ASK YOU WHICH ONES THEY ARE AT THE END OF THE
PRESENTATION. I’LL COME BACK TO THAT SLIDE.
IF WE NEED TO. WHICH ONES DO WE PICK FOR OUR DATA STRUCTURES?
FOR DATA STRUCTURES, THE PROPERTIES WE’RE LOOKING
FOR ARE THAT THE HASH FUNCTION SHOULD BE FAST AND
WELL DISTRIBUTED. WHEN WE SAY FAST, WHAT WE MEAN IS NOT
CRYPTOGRAPHIC. CRYPT CRASH HASHES ARE AWESOME, THEY HAVE
A BUNCH OF — THE MOST IMPORTANT ONE, THEY’RE COLLISION
RESISTANT, IT’S ASTRONOMICALLY UNLIKELY THAT YOU’LL EVER BE ABLE TO FIND TWO DIFFERENT
ITEMS THAT HAVE THE SAME HASH VALUE.
BUT THE CRYPTOGRAPHIC FEATURES ALSO MAKE THEM MORE PROCESSER HUNGRY, SO WE SHOULD AVOID
SHAH OR MD5 WHEN WORKING ON DATA STRUCTURE STUFF WE
DON’T NEED THE SECURITY.
SO NON-CRYPTOGRAPHIC. AND WELL-DISTRIBUTED WHICH MEANS THAT NO MATTER
HOW SIMILAR YOUR DATA IS GOING INTO THE DATA STRUCTURE, THE OUTPUT SHOULD APPEAR ALL OVER
THE SPECTRUM.
HASH FUNCTIONS THAT EXHIBIT THIS QUALITY ARE KNOWN AS AVALANCHING HASHES, BECAUSE SMALL
CHANGES IN THE INPUT LEAD TO LARGE CHANGES IN THE
OUTPUT. OF COURSE, THIS IS ALSO A DESIRABLE PROPERTY
IN CRYPTOGRAPHIC HASHES BUT THIS ONE WE’RE WILLING
TO BLOW OUR TIME ON BECAUSE IT’S IMPORTANT FOR
DATA STRUCTURE THAT HAVE HASH FUNCTIONS TO HAVE
WELL-DISTRIBUTED HASH FUNCTIONS. A COMMON HASH USE FOR THIS PURPOSE IS THE
NON-CRYPTO GRAPHIC, WELL AVALANCHING PUBLIC DOMAIN MURMUR THREE, IMPLEMENTATIONS EXIST
FOR MOST MODERN LANGUAGES, INCLUDING PYTHON AND
WHICH HAS APPEARED IN MOST OPEN SOURCE PRODUCTS.
IT ALSO HAS A SEED VALUE SO YOU CAN CREATE DOZENS
OF DIFFERENT HASH FUNCTIONS OUT OF THE MURMUR 3
BASE HASH FUNCTION, JUST BY CHANGING THE SEED VALUE.
THERE ARE SOME REASONS, THOUGH, THAT YOU MIGHT ACTUALLY WANT CRYPTOGRAPHIC HASHES IN YOUR
DATA STRUCTURES.
A CLEVER ATTACKER, FOR EXAMPLE, MIGHT TAKE ADVANTAGE OF YOUR DATA STRUCTURE AND ONLY
INSERT DATA THAT MATCHES A CERTAIN SMALL SET OF HASHES,
FORCING COLLISION AFTER COLLISION, UNTIL YOUR WHOLE APPLICATION FALLS OVER.
IT’S BEEN DEMONSTRATED THAT PYTHON IS VULNERABLE TO THIS SORT OF ATTACK, ALTHOUGH THE BDFL
CALL THIS SOME SECURITY RESEARCHERS DRUMMING UP
BUSINESS. IN PEP 456, CHRISTIAN HEIMS LAYS OUT THE REASON
BEHIND CHOOSING A CRYPTOGRAPHIC HASH FUNCTION AS
THE DEFAULT FOR BITS AND BYTES, COMPARING THE —
AS WELL AS THE PRIVATE CPYTHON HASH IMPLEMENTATION.
THIS PIP WAS ACCEPTED AND PYTHON 3.4, IT’S SIP
HASH. ONE OTHER THING.
WE HASHED AN IMAGE TO PUTS IN OUR DATA STRUCTURE EARLIER.
WHAT SORT OF HASH FUNCTION WORKS ON AN IMAGE? MOST OF THEM, REALLY, BUT A PERCEPTUAL HASH,
OR P-HASH IS DESIGNED 20 CLUSTER SIMILAR IMAGES
TOGETHER IN THE HASH OUTPUT. FOR EXAMPLE, IF IT’S THE SAME IMAGE BUT SLICED
LARGER OR SKEWED TO THE LEFT, IT SHOULD END UP
WITH A HASH FUNCTION VERY CLOSE TO THE ORIGINAL IMAGE.
OF COURSE, BY NATURE, THIS HASH FUNCTION WON’T BE
DISTRIBUTED IN THE WAY THAT WE WOULD NEED FOR AN
OPTIMAL GENERAL PURPOSE DATA STRUCTURE. IT’S THE OPPOSITE OF AN AFTER LAP HAVING HASH,
SMALL CHANGES IN THE OUTPUT LEAD TO ALMOST NO
CHANGES IN THE OUTPUT. I SAID OUTPUT TWICE IN THAT LAST SENTENCE.
PAY NO ATTENTION. BUT WE CAN PAY — WE CAN ABUSE THAT PROPERTY
SO THAT FALSE POSITIVE ARE UNFAIRLY CLUSTERED
ON IMAGES THAT LOOK VERY SIMILAR TO OUR BAND
IMAGES, WHICH WOULD ACTUALLY PROBABLY BE A GOOD THING
IN THE CASE OF TRYING TO FIND BANNED IMAGES,
SO THAT CONCLUDES THE DATA STRUCTURE PORTIONS OF THE
TALK. NOW LET’S TACK ABOUT HASH FUNCTIONS CONTRIBUTE
TO THE SECURITY OF YOUR APPLICATION.
SO YOU’RE RUNNING A WEB APPLICATION AND THE WORST
CASE SCENARIO HAPPENS. SOME HACKER MAKES OFF WITH THE USER TABLE
FROM YOUR DATABASE.
HOW? I DON’T KNOW, FOR THE SAKE OF ARGUMENT, LET’S
SAY SQL INJECTION.
AT THIS POINT, YOU ARE YOUR USER ARE SHIT OUT OF
WILL BE, SOME BRIGGAND MADE OFF WITH THEIR PASSWORDS.
NO, SOMEBODY SECURED THEM BEFORE SAVING THEM. HOPEFULLY YOU ARE A AWARE OF THIS.
IN ORDER TO HIDES THE PASSWORDS, WHEN THE USER
SAVERS THE PASSWORDS, WE DON’T SAVE THE PASSWORDS ITSELF, WE SAVE THE RESULT OF THE HASH FUNCTION.
LATER, WHEN THE USER TRIES TO LOG IN, THE THEY
TRY TO — THEY PROVIDE A PASSWORD AND WE HASH ITS.
IF THE TWO ARE CORRECT, THE USER HAS PRIDES THE
CORRECT PASSWORD. IF THE TWO DIFFERENT PASSWORDS EVER COLLIDES,
IF TWO HASH STREAMS GO TO THE SAME VALUE, IT
WOULD BE IMPOSSIBLE TO LOG INTO OUR SITE WITH THE
WRONG PASSWORD.
THAT’S BAD. DO YOU REMEMBER EARLIER WHEN I SAID THAT
CRYPTOGRAPHIC HASHES LIKE MD 5, FOR EXAMPLE, HAVE
A FEATURE CALLED COLLISION RESISTANCE WHICH MEANS
THAT TWO INPUTS ARE ASTRONOMICALLY UNLIKELY TO
COLLIDE? HERE IS WHERE THAT IS SUPER IMPORTANT.
NOW OUR PASSWORDS ARE PROTECTED. YEP.
TOTALLY PROTECTED. NOTHING CAN POSSIBLY GO WRONG.
OKAY, SO LET’S TALK ABOUT DICTIONARY ATTACKS, THE
WAY TO BREAK HASHED PASSWORDS. SO WE’VE STOLEN A WHOLE DATABASE FULL OF USER
NAMES AND HASH PASSWORDS AND WE WANT TO GET AT
THE RAW PASSWORDS SO THAT HE CAN TRY THEM OUT ON
BANKING SITES AND STUFF. SO WHAT WE CAN DO IS CREATE A LIST OF EVERYTHING
THAT WE CAN THINK OF AS A POSSIBLE PASSWORD, AN
ENORMOUS COMPREHENSIVE LIST OF PASSWORDS. THEN WE USE THE SAME HASH THAT THE PROGRAMMER
USED TO HASH THE ORIGINAL PASSWORDS AND WE RUN
THAT ON EVERY SINGLE ITEM IN OUR GIGANTIC PASSWORD LIST.
NOW WE HAVE A GIANT COLLECTION OF HASHED TO PASSWORDS PAIRS WHICH WE CAN COMPARE AGAINST
THE ORIGINAL DATA.
ANY TIME WE FIND A MATCH WITH ONE OF THE HASHES IN OUR SET, WE KNOW WHAT THE PASSWORD MUST
HAVE BEEN.
THIS CHECKING OF EVERY HASH IN THE SET AGAINST EVERY HASH IN THE DATABASE IS N SQUARED BUT
IT’S INHERENTLY VERY PARALYZABLE SO WITH A BIT
OF OPTIMIZATION, THIS CAN RUN VERY FAST AND SOME
PEOPLE HAVE MANAGED TO USE GRAPHIC SOFTWARE TO
RUN THEM VERY QUICKLY. SO THIS IS CALLED A PRE-COMPUTED DICTIONARY
ATTACK. IT’S IMPORTANT TO NOTE THAT THIS ISN’T A RAINBOW
TABLE, A RAINBOW TABLE IS A DIFFERENT THING, INVOLVE LASH CHAINS THAT ACCOMPLISHES THE
SAME TASK BUT USING A LOT LESS STORAGE SPACE.
AND I’M NOT GOING TO DESCRIBE THEM TODAY BECAUSE I’M NOT SURE IF I HAVE ENOUGH TIME AND I DON’T
UNDERSTAND THEM SUPER WELL. SO THE MD 5 HASH FUNCTION IS SO COMMON THAT
JUSO SOLONAN RELEASED A UTILITY CALLED BOZO CRACK,
THAT TAKES A PASSWORD, SEARCHING FOR IT ON GOOGLE
AND MD5 HASHING EVERYTHING THAT COMES BACK IN
GOOGLE RESULTS UNTIL IT FINDS A MATCH. NOT AS COMPREHENSIVE AS MD5 DICTIONARY BUT
STILL MANAGES TO BE DEPRESSINGLY EFFECTIVE, 85%
EFFECTIVE, SCARY. PARTS OF THE REASON THESE ATTACKS ARE SO
EFFECTIVE IS BECAUSE EVERY USER’S PASSWORD HAS
BEEN HASHED WITH THE SAME HASH FUNCTION. IT’S POSSIBLE FOR US TO TEST PASSWORDS AGAINST
THE ENTIRE TABLE ALL AT ONCE. WHILE IT’S UNLIKELY WE’LL NEVER CRACK ALL
THE PASSWORDS, WE’RE ABLE TO FIND OUT THE USERS
WHO HAVE SIMPLE PASSWORDS VERY, VERY EASILY.
JUST TESTING THE WHOLE DATABASE AGAINST THE THOUSAND MOST COMMON PASSWORDS SHOULDN’T TAKE
MORE THAN AN HOUR AND WILL PROVIDE US WITH A
WEALTH OF POTENTIAL USEFUL DATA. SO WHAT WE WANT TO DO TO REDUCE THE EFFECTIVENESS
OF THIS KIND OF ATTACK IS TO USE A DIFFERENT HASH
FUNCTION FOR EVERY SINGLE PERSON IN THE DATABASE. THIS DOESN’T MEAN WE NEED TO ENLIST THOUSANDS
OF DIFFERENT IMPLEMENTATIONS, WHAT IT MEANS WE
NEED TO USE ONE WHICH WE SEED WITH A DIFFERENT
VALUE FOR EACH USER IN THE TABLE.
IT HAS TO BE A VALUE THAT WE HAVE ACCESS TO. BECAUSE WE NEED TO BE ABLE TO RECREATE THIS
CUSTOM HASH FUNCTION EVERY TIME WE CHECK THE USER’S PASSWORD.
IT’S QUITE COMMON TO USE THE USER NAME FOR THIS
VALUE. DOESN’T MEAN YOU SHOULD USE THE USER NAME,
CRIP TO GO FEARS REQUIRE THAT YOU USE A LARGE RANDOM
BLOCK TO USE AS A SEED. THIS HASH SEEDING VALUE IS CALLED ASSAULT.
SO ASSAULT IS RANDOM DATA USED AS ADDITIONAL INPUT TO A HASH FUNCTION TO PROTECT AGAINST
RAINBOW TABLE AND DICTIONARY STYLE ATTACKS. OKAY, SO THAT COVERS THE VERY, VERY BASICS.
AND WHEN I SAY BASIC, I MEAN THE VERY MOST BASICS.
THIS ISN’T EXACTLY STATE OF THE ART. UNIX CRYPT FUNCTIONS FIRST USED SALT HASHES
IN 1976, A FULL DECADE BEFORE I WAS BORN.
SO NOW LET’S TALK ABOUT NOT USING MD5, WHICH IS
AN IMPORTANT TOPIC AND HASHES ALL TO ITSELF. MD5 IS WELL KNOWN AND WELL UNDERSTOOD AND
BEEN IN USE FOR A VERY LONG TIME.
UNFORTUNATELY, IN THAT VERY LONG TIME, SOME CRACKS HAVE BEGUN TO SHOW IN THIS VENERABLE
HASHING ALGORITHM. THE BIGGEST REASON TO AVOID IT IS IT’S TOO
FAST. IT’S POSSIBLE TO MD5 HASH MILLIONS OF VARIABLES
PER SECOND. THIS MAKES BRUTE FORCE ATTACKS AGAINST PASSWORDS
VERY, VERY EASY. THIS HASHING PYTHON SCRIPT I WROTE RUNS ON
A CHEAP LITTLE VIRTUAL MACHINE, ONE MILLION
MD5 HASHES IN 1.5 SECONDS.
AND THIS IS JUST LIKE A VAGRANT 512, YOU KNOW, POP-UP VIRTUAL MACHINE ON MY COMPUTER, NOT
REALLY SOMETHING THAT IS BEEFY WITH A MUCH FASTER
COMPUTER, YOU CAN RUN MANY MORE HASHES MUCH MORE
QUICKLY. SO INSTEAD OF MD5, I RECOMMEND BCRYPT OR SCRYPT
OR PDKF2 BCRYPT IS DESIGNED TO BE SLOW. USING THE DEFAULT SITTING ON THE SAME VIRTUAL
MACHINE, THIS RUN WHERE I USED BCRYPT INSTEAD OF
MD5 WOULD TAKE ME 3.4 DAYS. THE LAST ONE WAS 1.5 SECONDS, THIS TOOK ME
3.4 DAYS.
THIS COMBINED WITH ASSAULT FOR EACH INDIVIDUAL USER MEANS THAT BRUTE FORCING PASSWORDS OUT
OF A DATABASE COULD TAKE DAYS OR MONTHS PER USER
RATHER THAN AN HOUR OR TWO. IT ALSO COMES WITH A WORK VARIABLE THAT YOU
CAN CRANK UP TO 11.
WHEN I TURN IT UP TO 15, MY MILLION HASH SCRIPT GOES FROM 3.4 DAYS TO RUN TO 26 DAYS TO RUN.
I DIDN’T ACTUALLY LET IT RUN FOR 26 DAYS, I
ACTUALLY LETS IT RUN FOR A MUCH SMALLER AMOUNT OF
TIMES AND MULTIPLIED THE TIME. SO AT THIS POINT LOGGING A SINGLE USER INTO
PLY SITE COULD TAKE A COUPLE OF SECONDS, SO I’M
NOT SURE THEY’RE WILLING TO WAITS THAT LONG BUT
NICE TO HAVE THE VARIABLE THERE THAT I CAN CRANK
UP. SO FOR PASSWORD SECURITY, YOU SHOULD BE USING
SOMETHING THAT’S SPECIFICALLY DESIGNED FOR PASSWORDS SECURITY.
LIKE BCRYPT OR PBKDF 2. BUT EVEN FOR GENERAL PURPOSES, IF YOU’RE LOOKING
FOR A SECURE HASH ALL GO RHYTHM, MD5 IS KINDS OF
OLD AND BROKEN, IT’S BEEN SHOWN THAT YOU CAN FORCE COLLISIONS IN MD5, BY RESEARCHERS BUT
IT COULD HAPPEN.
SO INSTEAD TRY SECURE HASH ALL GO RHYTHM. SO SECURE HASH ALGORITHM, OR SHAR 51123 HAS
BECOME A NEW STANDARD FOR GRAPHING, ONE THAT’S MUCH MORE SECURE.
AND THAT’S BEEN HASH FUNCTIONS, HASH TABLES, BLOOM FILTERS, CHOOSING HASH FUNCTIONS AND
HASHES AND SECURITY, WHICH CONCLUDES THE BODY OF
THE PRESENTATION.
NOW, I BELIEVE THAT LEAVES US A LITTLE BIT OF
TIME FOR Q AND A. BEFORE WE START Q AND A, THOUGH, I’M GOING
TO BRING UP MY LIST OF FAKE HASH FUNCTIONS AND
I’M GOING TO SEE IF YOU CAN GUESS WHICH ONES ARE
THE FAKES.
[ Audio Indiscernible ]>>NO, NOT THAT ONE.
SOME 41 WAS A FAKE HASH FUNCTION. COME ON HASH FUNCTION SLIGHT, I KNOW YOU’RE
IN HERE SOMEWHERE.
OH, IT WAS IN DATA STRUCTURES, IT WASN’T IN SECURITY.
THAT’S WHY. HERE WE GO.
SO WE’VE SUCCESSFULLY IDENTIFIED SOME 41 AS A
FAKE HASH. [ Audio Indiscernible ]
THE LESTER B. PEARSON HASH IS A FAKE HASH. THE PIERSON HASH IS A REAL HASH, THOUGH.
LET’S SEE HERE. MD5 IS REAL.
ANY MORE GUESSES? [ Audio Indiscernible ]
THE PERFECT HASH IS A REAL HASH. THE PERFECT HASH IS A HASH FUNCTION THAT HAS
BEEN — IT’S BEEN CALCULATED IN ADVANCE SO THAT
THE SET OF INPUTS YOU HAVE PERFECTLY MAP TO A SET
OF OUTPUTS. A REALLY INTERESTING HASH.
AND — THE FAKE HASH HERE IS THE COUNTRY HASH. CITY HASH IS REAL, COUNTRY HASH IS FAKE.
PIERSON HASH IS FAKE, LESTER B. PIERSON IS FAKE.
SOME SOME 42 IS REAL, SOME 41 IS BANNED. SO HAVING CONCLUDED THAT, IF YOU HAVE ANY
QUESTIONS, NOW WOULD BE THE TIME TO ASK THEM. [ Applause ]
>>THANK YOU VERY MUCH, AND JUST A REMINDER TO
PLEASE ASK A QUESTION AND TRY TO BE CONCISE. THANKS.
>>AUDIENCE: JUST A QUICK QUESTION. IS THERE A GOOD BLOOM FILTER IMPLEMENTATION
FOR PYTHON ON PYPY OR SOMETHING?
>>I HAVE LOOKED AROUND. I REMEMBER LOOKING AND EVENTUALLY CREATING
MY OWN BUT I’M SURE THERE’S ONE ALREADY IN PYTHON
SOMEWHERE. IT’S A REALLY COMMON DATA STRUCTURE AND MUST
EXIST ALREADY, AND I THINK THE REASON I CREATED ONE WAS JUST BECAUSE I WANTED TO DO RESEARCH
FOR THIS PRESENTATION, SO…
>>AUDIENCE: THANK YOU.>>THAT’S ALL THE QUESTIONS.
>>THANK YOU VERY MUCH.>>THANKS, EVERYONE.
[ Applause ]

5 thoughts on “Curtis Lassam – Hash Functions and You: Partners in Freedom – PyCon 2015

Leave a Reply

Your email address will not be published. Required fields are marked *