Joe Regenstein, CPA, FPAC

The Challenge of Distinguishing Reality: An Insight into Real vs. AI-Generated Images

Joe Regenstein, CPA, FPAC Photo by Steve Johnson / Unsplash

In today's digital age, where technology is rapidly evolving and artificial intelligence is creating content that rivals human-made productions, the question arises: Can we genuinely distinguish between real and machine-generated images? My son Ryan embarked on a journey to find out. Through an experiment, he sought to understand how well individuals can differentiate between real photographs and those generated by AI algorithms.

In this post we will explore the results which, from time to time, get into statistical analysis. I am a black belt in Six Sigma and have studied statistical methods, but the statistics summarized below were completed by ChatGPT using its data analysis feature. I simply prompted the AI assistant to look at the data in an excel file and asked questions. How many participants accurately identified the images? Did age or gender sway the outcomes, and if they did, were these differences statistically significant? Can you plot these results? Was it possible high scores were simply random? Throughout my commentary I will include the snippets of Python code ChatGPT created to manipulate the data so it could answer questions.

Interestingly, ChatGPT made mistakes with the code. It analyzed the provided error, rewrote the code, and tried again without being prompted.

At the end of the post, we will see the correct answers and the percentage of those who answered correctly.

Overall Results

The findings were intriguing. Out of all the participants:

    import pandas as pd
    # Load the Excel file into a DataFrame
    df = pd.read_excel("/mnt/data/Real or AI-Generated Survey (Responses).xlsx")
    # Display the first few rows to get an overview of the data
    # Extract the column names that indicate whether the image was real or AI-generated
    image_labels = df.columns[4:]
    # Create a dictionary to store the correct answers for each image
    correct_answers = {}
    for label in image_labels:
        # Extract the real label from the column name (either "Real" or "AI-generated")
        real_label = label.split(".")[0]
        correct_answers[label] = real_label
    # Calculate the number of correct identifications by users for each image
    correct_counts = {}
    for label, real_label in correct_answers.items():
        correct_counts[label] = (df[label] == real_label).sum()
    # Calculate the percentage of correct identifications for real and AI-generated images
    real_columns = [col for col in image_labels if "Real" in col]
    ai_columns = [col for col in image_labels if "AI-generated" in col]
    real_correct_percentage = sum(correct_counts[col] for col in real_columns) / (len(df) * len(real_columns)) * 100
    ai_correct_percentage = sum(correct_counts[col] for col in ai_columns) / (len(df) * len(ai_columns)) * 100
    real_correct_percentage, ai_correct_percentage

These numbers underscore the remarkable capabilities of modern AI in creating lifelike visuals. The participants all faced challenges distinguishing between real and AI-generated images, highlighting the blurry line between reality and AI-produced content.

Does Age Matter?

When diving deeper into the data, we observed variations across age groups:

    # Calculate the percentage of correct identifications by age group for both real and AI-generated images
    # Function to calculate the correct percentage for a given subset of data
    def calculate_correct_percentage(subset_df, image_type_columns):
        return sum((subset_df[col] == correct_answers[col]).sum() for col in image_type_columns) / (len(subset_df) * len(image_type_columns)) * 100
    # Calculate percentages by age group
    age_groups = df['Age'].unique()
    age_correct_percentages = {
        "Real": {},
        "AI-generated": {}
    for age_group in age_groups:
        subset_df = df[df['Age'] == age_group]
        age_correct_percentages["Real"][age_group] = calculate_correct_percentage(subset_df, real_columns)
        age_correct_percentages["AI-generated"][age_group] = calculate_correct_percentage(subset_df, ai_columns)

However, were these differences statistically significant? Upon analysis of the Chi-Square test ChatGPT ran, the answer was clear. A p-value less than 0.05 would mean age did make a difference in answering correctly. In this case, the p-values were well above 0.05, 0.39 for real images, and 0.29 for ai-generated. This means the variations in accuracy among age groups were not statistically significant. This means age might not affect the ability to answer correctly.

    # Create a contingency table for real images based on age groups
    contingency_real_age = []
    for age_group in age_groups:
        subset_df = df[df['Age'] == age_group]
        correct_count = sum((subset_df[col] == correct_answers[col]).sum() for col in real_columns)
        incorrect_count = len(subset_df) * len(real_columns) - correct_count
        contingency_real_age.append([correct_count, incorrect_count])
    # Create a contingency table for AI-generated images based on age groups
    contingency_ai_age = []
    for age_group in age_groups:
        subset_df = df[df['Age'] == age_group]
        correct_count = sum((subset_df[col] == correct_answers[col]).sum() for col in ai_columns)
        incorrect_count = len(subset_df) * len(ai_columns) - correct_count
        contingency_ai_age.append([correct_count, incorrect_count])
    # Conduct Chi-Square tests for age groups
    chi2_stat_real_age, p_val_real_age, _, _ = chi2_contingency(contingency_real_age)
    chi2_stat_ai_age, p_val_ai_age, _, _ = chi2_contingency(contingency_ai_age)
    p_val_real_age, p_val_ai_age

This surprised me, I was hoping we would see if life experience made a difference in the results. When reviewing the images at the end of the post we will see that AI makes identifiable mistakes.

Does Gender Matter?

We also asked participants to list their gender:

Yet, when tested for statistical significance using Chi-Square, these differences also did not prove statistically significant, meaning there are no gender differences. This did not surprise me, I can't imagine how gender would make a difference in identifying real content.

Diving into the Distribution: Understanding User Performance

Another aspect to consider is how individual users perform. The survey results were anonymous so we used a unique timestamp as a placeholder for a user. I asked ChatGPT to visualize the distribution using a histogram, box plot, and density plot. The latter was to compare actual results to what the distribution would look like if everyone just randomly guessed.

A Peek at the Histogram

When we visualize the data using a histogram, a pattern emerges. A peak is visible in the 20-30% range, meaning a chunk of our participants correctly identified roughly a quarter of the images. This suggests distinguishing between real and AI-generated images is no walk in the park. This aligns with the observation participants made on LinkedIn where Ryan sought out participants.

"I am not a good AI detector" - JT

"Even though it was just a survey, I still feel like I failed. 😂" - Sean A.

"The funny part about the survey is that I am not even able to spot any difference amongst the images." - Adewale A.

The Box Plot

For a more summarized view, the box plot shown above helps us determine who was above and below average and what quartile participants landed in. Here, we observed that the median score (the middle dark blue line inside the box) lies below 40%. This means that half of our participants scored below this mark, while the other half scored above. The width of the box, representing the interquartile range, stretches approximately from 20% to 60%. This tells us that the middle 50% of the scores (from the 25th to the 75th percentile) scored between 20% and 60%.

Interestingly, the box plot's whiskers (indicating the data's general spread) show that most users scored between 10% and 70%. Scores outside of this range, especially on the higher end show a few outliers. The highest score was 90%. This person may have a keen eye, but it could be simple randomness. Here's a visualization comparing the distribution of actual scores (in blue) against simulated scores (in red), representing random guessing.

The Density Distribution With Random Guessing

    # Create a combined plot of actual scores vs. simulated scores
    plt.figure(figsize=(12, 7))
    # Plot density of simulated scores
    sns.kdeplot(simulated_scores, shade=True, label="Simulated (Random Guessing)", color="#e74c3c")
    # Plot density of actual scores
    sns.kdeplot(user_correct_percentages, shade=True, label="Actual Scores", color="#3498db")
    plt.title("Distribution of Actual vs. Simulated Scores")
    plt.xlabel("Percentage Correct")
    plt.grid(axis='y', linestyle='--', alpha=0.7)

This suggests the highest score could have been achieved randomly. To iron this out, we would need to increase the number of images to make a high score harder to achieve by simply choosing randomly. If you get 90% correct with 20 images, it should be more skill than chance.

Inference from the Distribution

What does this distribution tell us? For one, it underscores the complexity and effectiveness of modern AI in generating images. Many participants gravitated towards the lower score ranges, emphasizing the difficulty. Yet, some participants seemed to have a sharp eye, scoring remarkably well and thus pushing the boundaries of the distribution.

As AI continues to evolve, its creations will become increasingly indistinguishable from reality, even more so today. Let's look at the images from the most distinguishable to the least. There were 5 real and ai-generated images. To add variety, 2 of the 5 in each category were cats.

Image #4 | Real | 54.61% Correct

Joe Regenstein, CPA, FPAC

This image, along with #10 are professional headshots.

Image #1 | Real | 48.68% Correct

Joe Regenstein, CPA, FPAC

This cat made two appearances in the survey.

Image #6 | AI | 47.37% Correct

Joe Regenstein, CPA, FPAC

Like the other AI-generated humans, skin imperfections make it difficult. I struggled with this one but think the eye socket on their left side is strange.

Image #7 | Real | 40.79% Correct

Joe Regenstein, CPA, FPAC

This is a professional who was featured on LinkedIn. It could be the professional aspects of this image that make it difficult to tell if it's real or not.

Image #3 | Real | 36.18% Correct

Joe Regenstein, CPA, FPAC

This is the same cat in image #1 but scored much lower.

Image #8 | AI | 36.18% Correct

Joe Regenstein, CPA, FPAC

AI can make mistakes, in this case the legs seem off.

Image #2 | AI | 28.29% Correct

Joe Regenstein, CPA, FPAC

Like the prior image, the legs seem strange but not as noticeable as #8.

Image #9 | AI | 25.66% Correct

Joe Regenstein, CPA, FPAC

In this image, AI made a few minor mistakes. The hair on their right cheek seems strange and the eyebrow on their left looks incomplete. However, like the previous AI-generated person, it shows a level of imperfection we would expect.

Image #5 | AI | 18.42% Correct

Joe Regenstein, CPA, FPAC

Like the previous image, there are small giveaways. The person is looking off to the side, but their hair seems to come all the way to their glasses on their left side. If you look closely, the hair actually appears over the glasses.

Image #10 | Real | 17.76% Correct

Joe Regenstein, CPA, FPAC

The same professional shot this image along with #1. However, this one scored much lower.


Thank you to all who participated and helped bring this experiment to life. Ryan's experiment offered a compelling glimpse into the challenges presented by AI advancements. As technology continues to push boundaries, our ability to identify what is real will become more difficult. We may need new tools to navigate the digital realm with a discerning eye since ours may be unable to tell the difference.

#AI #ChatGPT