Damir Kahvedžić, Author at Prosearch

Face to Face with GenAI Hype

Damir Kahvedžić — Tue, 13 May 2025 17:32:16 +0000

Author: Damir Kahvedžić, Ph.D.

My wife came to me a couple of weeks ago needing to write a last-minute presentation for her work. She asked “Can we use CoPilot? I hear it can generate a presentation for me.” “Sure” I said “Your company has licensed it on your laptop. Enter a prompt and it will make slides in PowerPoint”. “Perfect!” she said. “Let’s do that”. We ran to the laptop, typed in our prompts and CoPilot dutifully created the presentation and content. It was immediate and impressive. It had the correct length, perfect formatting, a beginning middle and end and no typos. It was perfect.

My wife hated it. The content was generic, it was bland and not at all what she expected. “Where is the detail, it’s missing statistics!” She said. The reality hit her that CoPilot helped her get started but she as a subject matter expert had to finish it. She ran head-first into Amara’s Law[1].

“WE TEND TO OVERESTIMATE TECHNOLOGY IN THE SHORTTERM BUT UNDERESTIMATE THE EFFECT IN THE LONGTERM”

You can’t really blame her excitement though. This is where we were a few years ago. We were promised that we could type in our criteria and the LLMs will extract ALL privileged documents for us. It would speed up or even end review, lower costs, upend our workflows. But survey after survey has shown that even though the excitement for AI is palpable and that teams want to use it, there are lot of AI adoption challenges ahead. Costs, Data Governance, Hallucinations! Not to mention that we are still waiting for that killer app. How can we foster adoption? How can we bridge the gap between the expectations of end users, like my wife, to the reality that AI service providers like ProSearch can provide? Where in the adoption process are we?

Gartner Hype Cycle

The technology adoption cycle is a well-known process. Gartner calls it the Hype cycle. It’s divided up into distinct phases that most disruptive technologies must travel through as a rite of passage. It starts with some sort of innovation trigger: a new vendor, a news report, an unveiling. Think of Steve Jobs holding up the original iPhone. It is exciting, new, promising the world. It’s the whole internet in your hand! Our excitement grows and reaches the Peak of Inflated Expectations.

We get the iPhone. Once it’s in our hand though the reality sets in and we realised that the technology has drawbacks. It is expensive, has limitations (the original iPhone didn’t have the app store[2]) and that internet you were promised: The original iPhone could not browse the Flash websites we were all using in the 2000s[3]. Our expectations wane. Gradually though, we address these concerns, the costs are lowered, the app store is released, we adjust our habits and stop going to the Flash websites. We eventually incorporate the technology into our day-to-day lives and enter the Plateau of Productivity.

Some technologies don’t make it through the phases. Apple’s other device, the Vision Pro[4], similarly garnered lots of excitement and is an engineering marvel but could not be integrated into our daily lives as intended. The technology has slowly withered away.

Where is AI in this graph? Is it successfully traversing the stages or will it be another Vision Pro?

Gartner AI Hype Cycle

According to Gartner, LLMs are placed past the Peak of Inflated Expectations[5]. It has been released to the public long enough for them to understand its limitations. Hallucinations are the famous example. However, there are issues with costs of ownership and data governance. These roadblocks of reality are dampening the excitement that was made when LLMs were first unveiled a few years ago.

On the other hand, Gartner places Generative AI Applications on the ascendancy. They are still providing excitement in the field. These applications, such as the chat bots, document authoring software, translation software etc, utilise LLMs in a very narrow way for a specific use case. In so doing, these implementations can focus on specific issues introduced by LLMs and minimise their impact. As a result, there is still an expectation that the implementation of LLM software will solve for any inherent LLM problems.

However, by definition, Generative AI applications are based on LLMs. In order to know what features Gen AI applications can implement, we need to know the capabilities of the LLMs it is using.

The Growth of LLMs

LLM releases have come thick and fast. Through an informal survey we can see that in 2024 there were 24 major Large Language Models released. These cover multiple geographies, vendors and use cases. OpenAI by themselves released GPT-4o, GTP-4o mini, o1, o3 and o1-pro in 2024. With the deluge of LLM releases patterns have started to emerge:

Reasoning LLMs (the GTP o1 and o3 models) have become popular. The LLMs mimic human reasoning. Rather than being a black box with a potentially hallucinated answer, they will explain how they came to that conclusion and illustrate any logical assumptions they have made. In so doing they provide a more trustworthy and transparent behaviour.
Open Models (the likes of DeepSeek and Llama) allow you to download the weights and to run the LLMs yourself, should you want to. They are attempting to be illustrate how decisions are made and therefore allow users to interrogate their constructions.

Above all though, the LLMs are getting smarter and smaller. The ‘mini’ LLMs, that is those LLMs that are trained on mere billions parameters, are achieving the same results as those full scale LLMs from a few years ago, such as GPT3.5[6]. The training is therefore becoming more efficient. As a result smaller models tend to be trained more quickly, consume less power and above all are cheaper than their full-scale alternatives.

The Cost of AI

This is the crux of AI adoption in the legal industry. Cost. As much as legal departments want to use AI it is cost that is proving the biggest roadblock. Recent surveys [7]have shown that most legal departments want to use Generative AI but are hampered by budgetary constraints and cost of ownership. The return on investment for AI is difficult to determine and is difficult to justify when compared to cheaper non-AI alternatives.

Traditional processes such as CAL and TAR are well understood and provide just as good results as current Gen AI models. Keyword based searching mechanisms are free and do not incur any per document cost at all. Until the question on the return on investment for Gen AI software is answered the adoption of such software would be low and only for niche edge matters.

The Future

My wife did end up using the AI generated presentation. She may have spent a little more time than she wanted adding more information, but it was ultimately a very useful and fruitful experience. She and I learned a lot. Looking at the rest of the Gartner AI Hype cycle there are several initiatives and technologies sprouting up on the back of the LLMs aimed at addressing the challenges that my wife faced. We already discussed hallucinations being addressed by reasoning models and the issues of cost addressed by mini models, but there a quite a few technologies worth keeping an eye on for the future.

Domain Specific Models: models built on top of the knowledge regarding a specific field are being developed for just the scenario that my wife encountered. Her topic was biochemistry and European road traffic laws. CoPilot simply did not know any specifics about this field. An LLM based on the PubMed library or some other scientific journal would have resulted in much more accurate information.
Agentic AI: this technology is so new it was not part of the AI Hype cycle in 2024. Agentic AIs are LLMs that have agency. Not only would they return an answer to a prompt but they may generate some test data and run simulations to gather the results.

The Gen AI field is exciting and vibrant as ever. Though I do not think that it will fall to the like of the Vision Pro it does have some adoption challenges that it will have to meet for it to be fully adopted. This is a totally normal and natural process. These difficult questions must be asked in order for the technology to be fully integrated into our workflows. I have no doubt that in two or three years we would stop talking about AI software and just refer to is as software. We would have moved out of the Trough of Disillusionment and entered the Plateau of productivity.

[1] What is Amara’s Law and why it’s more important now than ever – The Virtulab

[2] A Look Back at the Original Apple App Store – Apple Gazette

[3] Does the iPhone Support Flash?: EveryiPhone.com

[4] Apple Vision Pro – Apple

[5] Gartner AI Hype Cycle 2024

[6] Stanford

[7] EY Law General Counsel Study, April 2025

The post Face to Face with GenAI Hype appeared first on Prosearch.

Supercharging Slack Data Handling with Workstream

Damir Kahvedžić — Thu, 30 Jan 2025 18:26:14 +0000

Slack is an extremely popular chat messaging service that is often found in ProSearch’s matters. The format in which the data is exported from the service is extremely challenging to handle from an eDiscovery perspective, specifically how to handle attachments. This article illustrates ProSearch’s home grown solution, WorkStream, and how the latest update has addressed key client concerns and drastically cut down the time needed to process it fully.

The Problem

Slack has very limited export options in its native interface. Admins can apply a date range and if the tenant has the Slack Enterprise license, target a specific user of Slack. But that’s it. Slack then gathers all chats across all channels for that person and exports them as a set of JSON files.

Files shared in chats are not included in the Slack delivery. They are only found as URL links with the JSONs. This keeps the export deceptively small. During processing we must follow the links and download a copy of the file during processing. It is not unusual for a delivery to take days to complete and balloon in size ten-fold due to all the downloaded attachments.

Typical discoveries may involve multiple users many of whom may have been part of the same channels or conversations. Data from each of those users must be collected separately. The same messages, attachments and data may be found multiple times across all custodians. Software would have to process all chats and all attachments before any sort of deduplication can take place.

This leaves clients with a problem. Can they get visibility into their datasets before committing to processing a potentially large amount of files? Can they estimate how long the process will take and finally, can they tailor the processing of their Slack deliveries such that work is not needlessly duplicated across multiple deliveries?

WorkStream

ProSearch, has developed WorkStream, a software platform that natively supports the handling of Slack datasets. It accounts for these issues and ensures that data is processed accurately and quickly. Attachment download has been identified as the main time sink for WorkStream processing. The latest WorkStream release has added the ability to defer the downloading of attachments until later in the WorkStream workflow. This deferment has made WorkStream very flexible in how it can handle data. It makes possible the following scenarios:

Clients can abstain from downloading attachments completely if they are only interested in messages.
Clients can defer downloading of attachments until all custodians’ datasets have been processed and deduplicated. Only when a deduplicated Review Set has been agreed upon can they download the attachments of the affected messages. This has the potential to save massive amounts of work and bandwidth.
Clients who are interested in only specific type of attachments may defer attachment processing until later so as to get an initial report on the type of attachments present in the dataset. Once a report is generated, they have the ability to choose which file type they would like to proceed with and which ones they are happy in leaving behind.
Security conscious clients may defer the downloading of attachments until such time as IP address and URLs are whitelisted.

This ensures that time is not wasted processing and downloading duplicate or unnecessary attachments. Users can focus on seeing the chats and conversation quickly and ultimately build a very tailored workflow suitable to their needs. This is especially noticeable if there are a large number of custodians sharing the same channels and conversation. The more custodians there are the more crucial this system becomes.

For more information contact us here.

The post Supercharging Slack Data Handling with Workstream appeared first on Prosearch.

How Does Copilot Work?

Damir Kahvedžić — Wed, 27 Nov 2024 17:18:59 +0000

Microsoft 365 Copilot is an advanced AI-driven assistant integrated within most aspects of the Microsoft infrastructure. It is designed to enhance user productivity and streamline workflows whether it be in Word, Outlook or PowerPoint. It is the latest intelligent virtual assistant that will introduce the power of GenAI and literally put a ChatGPT like prompt in your Microsoft Office documents. It is a key service that is being heavily promoted by Microsoft.

But how can we discover Copilot interactions? What do they look like and are there any key differences that we need to know? This article discusses how Copilot is preserved by Microsoft Purview, the data format, and best practices on how it can be collected.

This work has been done with the help of ProSearch’s Microsoft 365 Advisory services who created the datasets needed for testing.

How does Copilot work?

Copilot is an additional service available in Microsoft 365. Once licensed, it is represented in every Microsoft 365 application (Word, Excel, PowerPoint, Teams and Outlook) as a contextual menu beside the cursor or as part of the application’s Ribbon interface. Clicking the Copilot icon opens a pop-up window or a document side bar. Both contain an input box allowing prompts to be made in the same you would do in ChatGPT. In Teams, the prompt window is literally a conversation with the Copilot bot.

Copilot in Word, as seen as a Contextual Prompt and side bar

Both prompt entry points work in the same way. They allow inputting of AI prompts and supporting files with the results being written into the document or made as a reply in a conversation in the sidebar. Answers may come from the general LLM or fetch information from the local Microsoft 365 tenant. A prompt in Copilot in Word like “Summarise my last email conversation” will result in Copilot fetching that email from Outlook and summarising appropriately. A prompt like “What is WorkStream” in PowerPoint, will result in Copilot fetching that data from and linking to documents found in SharePoint.

Copilot in Word pulling in documents from ProSearch Way to answer about WorkStream

A conversation with Copilot, therefore, may have more than just text associated with it. It may have associated linked files or URLs. Prompts may generate entire PPT presentations, images or other content. There is so much more to this and interested parties can read more on the Microsoft Copilot site.

Exports

Copilot interactions and outputs are preserved and can be found via Purview eDiscovery Standard or Premium. Copilot interactions are stored in the mailbox of the user that interacted with the system regardless of if the interaction was made in Outlook, Excel, PowerPoint or Word. As such, collecting the data follows a very similar process to that of Teams or Email collections. The only thing that changes is the datatype category one needs to select when creating the collection.

To direct a collection to extract only Copilot data simply select Type equals any of “Copilot interactions” during the collection process. Conversely, if Copilot interactions are out of scope of the collection then the Type can be changed to the inverse NOT operator.

Format

Even though Copilot has its own collections category (Copilot interactions), the underlying data looks very similar to Teams data. Purview treats interactions as conversations between the user and the AI system. Much like Teams data, if the tenant has an E3 discovery licence then each interaction will be exported as a separate MSG and bundled in a PST mailbox. If the tenant has an E5 license, then the interactions can be exported as HTML transcripts. At first glance, Copilot data appears like regular Teams dataset, however tests have shown that there are important differences between the two. The following is a small non exhaustive list of difference that we have identified so far.

MSG Format

If exports are done in a PST, then Copilot interactions are stored as individual MSGs in the TeamsMessageData PST folder. This is the same location that Teams chats are stored. However, the Copilot MSGs are not of the same format as the Teams data counterpart. Unlike Teams data, Copilot responses are stored as HTML attachments within that MSGs, not in the email body as is the case for normal Teams data. The message body of Copilot MSGs is largely blank.

An example of this situation is shown below. The images shows a user prompt and a Copilot response. Both are stored as individual email. As can be seen, the Copilot response contains an attachment Microsoft 365 chat.html. The actual Copilot result is found in this file and is illustrated in the final image.

HTML Format

In E5 environments, Copilot data is exported as HTMLs with an associated CSV of metadata that describes the data in the delivery. On the face of it, the Copilot HTMLs look exactly like Teams transcripts with interactions cascading linearly in an analogous way to how they were generated initially. Below is the HTML transcript of the same conversation as the section above. Note that the HTML support Unicode characters.

The underlying HTML format of the Copilot messages is different to that of the Teams counterpart however. Key HTML blocks, like the hidden CDATA tags, that provide vital metadata are not present for the Copilot messages but are found for user created messages.

Another difference is how the Copilot HTMLs are descrived in the accompanying CSV. Copilot HTMLs have the Item_class value: IPM.SkypeTeams.Message.Copilot. where AppName is the application in which the interaction happened. Examples include: IPM.SkypeTeams.Message.Copilot.PowerPoint and IPM.SkypeTeams.Message.Copilot.Word. or in the case of Teams IPM.SkypeTeams.Message.Copilot.BizChat. A comprehensive list can be found on the Microsoft site.

Response Times

The Copilot participant in HTML conversations is named according to the which Copilot application was used. A participant will be called Copilot in PowerPoint for example if the Copilot was used in PowerPoint. Conversation with the Copilot in Teams will have Microsoft 365 Chat as participants etc.

The Copilot participant response times are often immediate and may have the EXACT same timestamp as the prompt. Processing tools that parse and convert the data to other formats such as RSMFs, as is done by Prosearch, may be presented with an ordering problem. One may see an RSMF with a Copilot answer found after the prompt, due to them having the same date and time. Care must be taken to ensure that the second and millisecond values are taken into account to preserve the order of the message.

Conclusions

Copilot is a flagship Microsoft service being deployed in almost all areas of the Microsoft infrastructure. The popularity of GenAI and LLMs will mean that it will inevitably be utilised for day-to-day use. As such it will be found in eDiscovery deliveries in the near futures. The ability to handle that data accurately and efficiently is an important challenge for the eDiscovery community.

Tests show that Copilot data, although having an appearance of a Teams dataset, has very important distinctions that set it apart from the Teams data format. It is recommended that clients collect Copilot data separately to their other collections to allow the processing team to treat it accordingly. Interested parties should contact the M365 Advisory Services for more information.

The post How Does Copilot Work? appeared first on Prosearch.

Dublin Autumn Conference Round Up

Damir Kahvedžić — Fri, 15 Nov 2024 18:46:38 +0000

Last month, the Dublin team participated in three notable conferences which discussed advancements in eDiscovery, GenAI and cybersecurity. These meet ups provided a fascinating insight in how other clients and vendors are dealing with new technology and handling the complexities of our field. This article reviews the key takeaways and highlights from each of them.

Women in eDiscovery: Autumn Meetup

The Dublin chapter of the Women in eDiscovery held its second meet up of the year. It was organised by our new Dublin Engagement Manager Rachel McAdams and featured Katherine Gillespie (Forensic Accountant, KPMG), Magdalena Wojnowska (Senior Operational and Project Manager, A&L Goodbody), and Clare Longworth (Community Enablement Architect at Relativity). The discussion offered a blend of perspectives on current challenges in eDiscovery from investigative, regulatory, project management, and technological viewpoints.

A highlight of the talk was when Katherine talked about investigating alleged malpractice and fraudulent activity. Katherine spoke about how data from various sources often requires to be weaved together to tell the story of potential business malpractice. Once data can be successfully integrating from various sources such as bookkeeping database systems, financial records, mobile chat data, light work can be made of the investigation.

Technology plays a crucial role in addressing these challenges. The panel highlighted case metrics and clustering and the potential use cases for Relativity Air. A significant pain point however, was the inability to effectively filter and drill into chat data sets. It's a subject that we have reviewed before in our CIRCLE meetings and for which WorkStream has been specifically designed to address.

Johnson Hanna: Future of eDiscovery

Johnson Hanna held an informal event on the Future of eDiscovery on their premises in Dublin on the 23rd October. It featured Eugene O'Neill (Executive Vice President of Reveal), Andrew Harbison (previously Head of Legal Technology of Grant Thornton Dublin but now freelance expert witness) and Tom O'Halloran (Johnson Hanna). The discussion centred around GenAI, how it can be used and if it truly is revolutionary for our field.

Andrew Harbison is an author of several academic papers including "Unbiased Validation of Technology-Assisted Review for eDiscovery" in which he evaluated TAR\CAL methodologies, and is positioning himself as an expert in the topic. His perspective centred around the pragmatic application of GenAI and addressing any pitfalls or shortcomings in the technology. Eugene, conversely, was representing Reveal and was more enthusiastic about GenAI and what Reveal is implementing specifically. The two provided contrasting opinions with Andy advocating for more cautious implementation of the technology. Naive applications of GenAI have already caused embarrassing mistakes.

GenAI however is seen as key to addressing the fundamental problem of too much data. It is considered a 'life-raft' that will allow teams to keep up with technology. It was agreed that it can be used reliably for translations, summaries or to look up concepts, but that it is not ready to be used for review classification or privilege determination. TAR was approved for use because it was possible to explain and validate the algorithms in court. Currently, it's not possible to do this for GenAI. OpenAI has said that it doesn't need to know exactly how ChatGPT works in order to release their product. Such a black box experience is not suitable for litigation and Legal tech LLM providers need to be more transparent in the way their systems work. Guardrails need to be provided to mitigate any errors. Human intervention is still paramount.

PwC: Global Cyber Security Summit

PwC held a Global Cyber Security Summit at the end October. It was a 24-hour virtual event designed for the C-suite covering major trends in the cybersecurity and emerging technologies. The byline of the conference was "Bridging the Gap", a refrain that may be familiar to our recent CIRCLE attendees.

The talk "Deploying GenAI for cybersecurity programs", in particular, offered insights in how PwC and their clients evaluated and deployed GenAI. In their Digital Insights Survey, they revealed that 78% of survey respondents have increased investment in GenAI in the last year but that 40% of them lack trust in it. So, to bridge the gap, so to speak, the participants discussed strategies to onboard GenAI technology effectively. Even though ProSearch is in an adjacent field to cybersecurity, we are also exploring ways to integrate GenAI and the lessons discussed are just as applicable.

First, they advise that the organisation recognise areas where high value individuals do manual repetitive analysis that can benefit from automation. Second, identify personas and roles and recognise what tasks in their descriptions can be helped by GenAI. Together, the strategy may help in focus GenAI deployment.

Alternatively, they discuss that some clients have a 'use case' driven approach whereby they list 10 things GenAI can do and then test them individually in the organisation. Others still have stood up a GenAI platform and allowed the wider organisation to use it as a playground to test and R&D use cases.

The general consesus, however, is that GenAI can make life easier but it can't replace some manual interventions. Pit falls include lack of understanding and naive usage. They recommend deep testing to see how it responds. It is exactly the same warning that Andy Harbison spoke about in the Johnson Hanna talk and is a more sober consideration of GenAI. After all, the perfect blend for success is People, Technology and Process. No one part can survive without the others.

The post Dublin Autumn Conference Round Up appeared first on Prosearch.

Productivity vs Privacy: The Debate on Microsoft’s Total Recall

Damir Kahvedžić — Thu, 30 May 2024 16:27:05 +0000

Microsoft has unveiled its latest AI innovation. Copilot+PC is a PC-based generative AI solution that will allow the Windows user to take advantage of AI locally without having to access the cloud. You’ll be able to ask natural language questions, generate AI art and much more. Microsoft calls the Copilot+PC the fastest, most intelligent Windows PC ever built.

Instant Recall

As part of this system it introduced the Recall program. An AI powered tool that Microsoft claims will solve the problem of users forgetting what they did on their PCs. The system will allow them to search for anything they have seen on their screens even if they did not bookmark or save the content.

To do this, the Recall system will automatically log everything a user does, including app activity, browsing sites and more. It will make transcriptions of live meetings and videos, take screenshots of the user’s display every five seconds and recognize the objects and text in the images. All this information will be aggregated to a local Large Language Model (LLM) and made searchable to the user for complete recall.

Your Recall Timeline, Visualized

Your recall timeline, visualized

If you’re thinking this sounds like spyware then you are not alone. The reaction has been swift with BBC calling it a ‘privacy nightmare’, cybersecurity researchers being appalled and comparisons being made to that creepy eye gismo episode in the suspense series Black Mirror. The UK has already opened an investigation into the system, citing confidentiality and consent worries. If somebody gained access to your account they could very accurately and easily trace back your activity timeline and replay the exact things you were doing, searching or viewing on your PC, all with related screenshots.

Forensics

Looking at it objectively, this is nothing new. Windows has always recorded user activity. One group of people who may like this is digital forensics investigators, such as in our forensic team. In deep dive investigation, they are often asked to create a timeline of user activity to uncover what a user did. They look in a number of locations to create these timelines and the process is technically challenging. Recall may just make this easier. PC users may not be aware of how much activity is already recorded.

The Windows registry, for example, is a database of options and configurations in the heart of the Windows OS. One artefact, called a ShellBag, stores the name, path, windows coordinates, and time of folders that were opened on the system. The record persists even if the folder is deleted.

There are also numerous “Most Recently Used” lists that record the files that a user opened, software that they have run, websites they have visited and in which order the actions occurred.

Windows Volume Shadow Copies, also known as Volume Snapshot Service (VSS) or Shadow Copy, is a Windows feature that creates backup copies of files and volumes, even when they are in use. Volume Shadow Copies automatically backs up files and can reveal deleted files, recent changes and other information a user thought were long gone.

Recently, the Microsoft Edge browser included an option to store screenshots of your browsing history in addition to the webpage address record that it already logs. This feature is off by default.

Forensic investigators take all this information and create timelines of user behavior. The Recall system may just make it easier for them to do so.

Microsoft’s Response

Microsoft, to its credit, seems to be aware of the privacy concerns and has tried to allay fears. All information is stored locally in an encrypted format, the GenAI model would be based on the local dataset and would not transfer any data out to Microsoft servers. Users will also be able to turn the system off or pause it if needed. They can exclude logging of specific apps and the private browsing activity of supported web browsers. It’s also still undergoing further testing so Microsoft may change it further.

Nevertheless, it seems to be an opt-out system rather than opt-in, which, in a controversial system like this is not desirable. Recall will record your banking details and anything else you may be doing in normal browsing modes unless you remember to turn it off.

The system comes with a significant hardware cost. Copilot+PC systems need to be powered by Qualcomm’s Snapdragon X Elite chips, which include the necessary neural processing unit (NPU), and at least 25Gb of space for Recall storage (about 3 months’ worth of screenshots). If you really don’t like Recall, just avoid PCs with this type of setup when shopping for new hardware.

Microsoft has also confirmed that the system won’t be available on older systems. So, this is not a concern for organizations using existing Windows PCs.

The ability to scan and rewind your timeline may be a fun concept and useful in very narrow circumstances but it may not be worth the privacy and confidentiality questions that it raises. The screenshot feature in particular is coming under great scrutiny.

Our digital forensics investigators may have a new source of information in the future, but I would not be surprised if this system is severely curtailed once it hits the mainstream users’ PCs.

ProSearchers continually watch new tech developments analyzing the potentail impact on data and risk management

For more on the AI features in CoPilot software, read Ryan Hemmel’s post on Using AI Every Day.

The post Productivity vs Privacy: The Debate on Microsoft’s Total Recall appeared first on Prosearch.

Memeing AI

Damir Kahvedžić — Tue, 31 Oct 2023 18:26:40 +0000

There is a famous episode of Star Trek: The Next Generation called Darmok where Captain Picard is abducted by a captain of an alien species and brought to a barren planet that has a Predator like monster on it. The alien captain wants Picard to help him hunt the monster, while Picard just wants to make peace with his opposite. Trouble is, he can’t understand what the alien captain is saying. The Universal Translator, that magic device in Star Trek that turns all alien language to English, seems to be translating the alien Captain’s words correctly. Phrases that the alien captain says like “Darmok and Jalad at Tenagra” and “Shaka when the walls fell” are grammatically correct but cant be understood by Picard.

Eventually Picard figures it out. Rather than using language to communicate concepts, the aliens use metaphors and allegory to communicate meaning. In the alien’s culture Darmok and Jalad were mythical heroes who met on an island of Tenagra, fought and defeated a monster together and left as friends. Saying “Darmok and Jalad at Tenagra” makes perfect sense to the alien captain. It was a peace offering. But to Picard it was gibberish.

The episode got considerable attention because we use memes, GIFs and emoji in a similar way today; we use metaphors instead of written text. We are increasingly receiving chat data that contain these types of images. We can review this data manually, but as we look to our AI enhanced future, we ask can AI handle this type of data? We have already seen Gen AI summarising documents and creating helpful suggestions for our reviewers. What can it do with Memes, GIFs and emoji? Can it understand their meaning or will be it confused like Picard? And will it be able to learn?

Memes and GIFS in eDiscovery

Memes and GIFs are not new. They have been used in social media and informal chats for as long as the Internet has been alive. GIFs are widespread, available natively in Teams, Slack, WhatsApp and on most chat platforms. They are a popular shorthand for communication in a conversation and are increasingly found in eDiscovery matters. Have a look at your local Team’s chat and check how many GIFs there are for yourself. I’m guessing there is at least one GIF-happy person in your team.

We will talk about emoji and AI at some other. Lots of articles have been written on how to handle this type of data and ProSearch’s product manager Jessica Lee has discussed emojis in depth. .

A meme, also known as an Image Macro, is an idea or behavior that spreads within a culture and often carries symbolic meaning representing a particular phenomenon or theme. They are often humorous, sarcastic and are used instead of the written word. They can even comfortably replace entire conversations. The meaning is not direct, it is implied. While you may think memes may be too informal for corporate use these images are still found in eDiscovery datasets. And let’s not mention that memes seem to be the communication of choice for some tech billionaires.

There are even communities churning out memes for eDiscovery.

‘Disaster Girl’ meme

‘Mother Ignoring Kid Drowning In A Pool’ meme

Reviewing Images with AI

Collecting and reviewing Memes and GIFs is not that different to reviewing any other documents. They are simply images in .JPG or .GIF format. They are displayed and reviewed in Relativity just fine. What is significant is the effect they have on AI.

ChatGPT, Bard and ProSearch’s nascent GenAI offering work on text only. Image specific GPT models are being developed by the major tech companies such Google’s Vision AI, Microsoft’s Azure AI Vision and Amazon Rekognition. They can identify faces, scenes and even expressions all in an effort to classify and search for images more easier. Reverse image search from Google and Tineye have been around for years. ProSearch has developed PrivacySuite an AI system that classified Personal Identifyable Information, which in part has a computer vision component to recognise ID cards.

Tests on these systems with memes have had mixed results. The descriptions are predictably robotic and suffer from ‘illusions’ much like their Chat counterpart. Below is the ‘Disaster Girl’ meme as described by one of the larger Computer Vision AI models, Astica. The description is robotic, surprisingly accurate in some parts but completely inaccurate in others. It also completely misses the point of what this image is. Informal tests by others have also concluded that GPT is impressive but still has far to go.

https://www.astica.org/vision/describe/

More specific work on memes has been sparse. Facebook had a project called Rosetta that analysed more than a billion memes and GIFs posted on its social network. The project appears to have only been an advanced OCR system that identifies text for the purpose of screening hate speech and other questionable material. That was in 2018, an eon ago in terms of AI development.

More recently, a paper by Priyadarshini combined text with emotion identification and Sharma described MEMEX, a proof-of-concept system that is trained to recognise the meaning of meme. The latter processes the image and associated documentation (such as the surrounding chat conversation) to generate meaning and context. Both studies were much more limited than Facebook’s work but point to a new area of AI research: Multimodal AI. This is a type of artificial intelligence that can process and understand information from multiple sources, such as text, images, audio, and video. Exactly what memes are. This should allow AI systems to make more accurate and informed decisions than systems that can only process information from a single source.

ChatGPT-4 is multimodal. A preview of its image processing is available through the Bing Chat bot. Below is Bing’s description of the second eDiscovery meme. Although it gets some details wrong it does recognise it as a joke. This feature should be made available soon through the CoPilot service. Informal tests have shown that it has promise identifying humour. Its not perfect, but it is improving.

Summary

With Gen AI buzz still in high fervor, it may be tempting to think that AI understanding Memes and GIFs is only matter of time. However, these datatypes requires understanding multiple facets of the data: context, its surrounding conversation, the text in the image, as well as understanding the scene or movie from which the meme originates. It is sarcasm, humour and cultural reference all rolled into one. It’s the ultimate CAPTCHA. Multimodal AI has promise and can describe elements of the images accurately, but it misses the one thing that bots cant have: human experience. The memes above are funny because they are relatable to the real world.

Captain Picard recognised the “Darmok and Jalad at Tenagra” phrase as a peace offering because he understood the significance of the mythological character’s actions. Sure he needed AI to explain to him who the characters were and what they did but he inferred the rest. As is the case for ChatGPT text summarisations, Image AI descriptions could boost productivity by helping human reviewers understand the basic elements of an image but it would still be up to the human reviewer to connect the final dots.

The post Memeing AI appeared first on Prosearch.

eDiscovery in the Metaverse: ESI Strategies

Damir Kahvedžić — Wed, 06 Sep 2023 21:26:50 +0000

This is part II of a two-part post. Read part I here.

The metaverse is a vision of what many in the tech industry believe is the next iteration of the internet: a single, shared, immersive, persistent, 3D virtual space where humans experience life in ways they could not in the physical world.

The metaverse is transforming how people interact and create content, introducing new dimensions of challenges and opportunities for eDiscovery processes. As this virtual realm evolves, so must the techniques and strategies used to collect, preserve, and review data for legal purposes. Let’s delve into the intricacies of navigating eDiscovery within the dynamic metaverse landscape.

Understanding Metaverse ESI

The electronically stored information within the metaverse can be categorized into three primary types: content, actions, and feelings.

Content: Metaverse environments, referred to as “worlds,” are created by users and offer diverse spaces for socializing, gaming, and more. These worlds can contain chat logs, direct messages, avatars, and multimedia. Additionally, administrators manage access and may control entry to authorized individuals, ensuring context and relevance.
Actions: Users’ behaviors and interactions within the metaverse are essential to consider in eDiscovery. However, since interactions can be ephemeral, capturing these actions requires specific mechanisms. Some metaverse providers have addressed this issue by continually recording the last few minutes of what a user sees. This video is constantly overwritten but can be produced as evidence of the most recent actions a user has taken.

Users can also proactively record a video of an interaction or take “photographs” of the world and the users they are interacting with as well as maintain chat-based conversations that can also be discovered.

Feelings: A unique aspect of the metaverse is its immersive quality, evoking emotions and sensations in users. This emotional layer adds complexity to eDiscovery. Haptic feedback technology, which brings physical sensations to virtual experiences, contributes to the immersive environment and may have implications for legal review.

Collection Challenges and Strategies

Collecting ESI from the metaverse presents distinct challenges due to its decentralized and cloud-based format.

Cloud-based Data: Data within the metaverse is often streamed from cloud servers, rather than stored locally on devices. This requires cooperation from metaverse providers to access and extract relevant information. Collaboration is essential to ensure effective data collection.
Lack of Export Features: Unlike established social media and cloud platforms, metaverse software providers generally lack features for exporting data. This hinders efficient data collection for eDiscovery purposes.
Proactive Review: Passive collection of metaverse ESI may not suffice for thorough investigations. Proactive reviews within the live metaverse environment, under controlled conditions, might be necessary to obtain a comprehensive understanding of user behaviors and interactions.

Meeting the Challenges: Evolving Review Strategies

Traditional eDiscovery review methods must adapt to accommodate the unique qualities of metaverse ESI.

Immersive Review: The immersive nature of the metaverse demands innovative review strategies. Unlike traditional document review, some cases might require reviewers to engage with 3D content in virtual headsets to fully grasp the emotional context and significance of user actions.
Emotional Perspective: Since the metaverse environment can evoke strong emotions, understanding the emotional impact of interactions is vital. Reviewing incidents involving emotional elements may necessitate immersion in the metaverse to gain a comprehensive perspective.
Balancing Privacy and Investigation: While proactive reviews offer deeper insights, they must be conducted with respect for users’ privacy and nonrelevant data. Striking this balance is crucial for maintaining ethical and legal standards.

In the evolving landscape of the metaverse, eDiscovery practitioners must collaborate with legal experts, technology developers, and metaverse providers to establish effective methodologies for collecting, preserving, and reviewing data. As the metaverse continues to shape the digital realm, the exploration of novel eDiscovery processes remains a dynamic and ongoing endeavor.

If you find your organization is facing legal requirements requiring forensic collection, analysis, or review of data from any source – from network stores, email, cloud applications, or the metaverse – contact ProSearch.

The post eDiscovery in the Metaverse: ESI Strategies appeared first on Prosearch.

eDiscovery in the Metaverse: Opportunities and Challenges

Damir Kahvedžić — Tue, 29 Aug 2023 21:42:36 +0000

The metaverse, once a concept confined to science fiction, is now becoming a reality with the advent of consumer devices that immerse users in 3D environments. Bill Gates predicted the transition from traditional 2D video meetings to 3D virtual spaces, and although we’re not quite there yet, progress has been rapid. The metaverse offers exciting possibilities for gaming, content creation, and social interactions, but it also presents a unique set of challenges and issues that need to be addressed.

The Evolving Metaverse Landscape: Opportunities and Challenges

In the short time since its inception, the metaverse has evolved into a space where users can interact with computer-generated environments and each other. Consumer devices like Meta’s Quest and Apple’s Vision Pro are bringing these 3D environments to life, enabling users to engage with a whole new digital dimension. However, while the potential for creative exploration and collaboration is immense, several pressing challenges need attention.

Navigating Metaverse Challenges

Toxic Behavior and Inappropriate Content: The metaverse, like any online space, has experienced issues related to toxic behavior, inappropriate content, and even criminal activities. Reports of sexual harassment, grooming of underage users, and other harmful behaviors have surfaced, raising concerns about user safety and content moderation.
Privacy and Data Protection: With personal avatars and interactions, users generate a wealth of personal data within the metaverse. This data raises privacy concerns, especially as it’s often stored on remote servers and accessed through VR devices. This dynamic calls for robust data protection measures and the consideration of legal frameworks like GDPR.
Legal and Ethical Implications: As the metaverse blurs the lines between real and virtual worlds, it introduces complex legal and ethical questions. How should virtual property, copyright, and user-generated content be treated? How can law enforcement navigate this new realm to address criminal activities?
eDiscovery in the Metaverse: For litigation purposes, the challenges of collecting, preserving, and reviewing data from the metaverse are substantial. The unique nature of 3D interactions, user behaviors, and immersive experiences demands a reimagining of traditional eDiscovery processes.

As the metaverse continues to evolve, stakeholders must collaborate to address these challenges effectively. This involves developing guidelines, regulations, and technological solutions to ensure a safe and productive virtual environment for all users.

The post eDiscovery in the Metaverse: Opportunities and Challenges appeared first on Prosearch.

Microsoft Loop

Damir Kahvedžić — Fri, 23 Jun 2023 16:14:11 +0000

Portable components that stay in sync and move freely across Microsoft 365 apps

– Damir Kahvedžić, Ph.D. and Ryan Hemmel –

Ever since being released in 2016, Microsoft Teams has become the de facto standard in corporate collaborative chat software. It has subsumed Microsoft Classrooms¹ and Skype for Business² and is a serious challenger to Slack and Zoom³. In the aftermath of COVID-19 and subsequent lockdowns around the globe, Teams usage has further skyrocketed⁴. Microsoft has invested significant energy into developing Teams, and new features have been integrated at a brisk pace. Not only has the pace of change given us some hilarious Teams usage fails as users come to grips with the new features, but it has also introduced several compliance and eDiscovery challenges. How can we keep up with these changes and present a comprehensive and accurate data set to our clients? Sometimes not even Microsoft has an answer.

Exhibit A: Microsoft Loop components. Loop components were introduced to Microsoft Teams last year, allowing chat participants to collaboratively edit a single message in real time across multiple Microsoft 365 tools⁵. The problem is that Loop components are saved as custom independent files in OneDrive, and once collected, there is no way to review the contents. There are no tools within Microsoft or otherwise that can accurately display the content within a Loop component. Microsoft’s own advice is to turn the feature off if this presents a problem for discovery or compliance. Given that Loop is enabled by default, this guidance may come too late for clients deep into their litigations when their users have already generated and shared Loop content.

Let’s explore Loop components and the challenges they present.

What are Loop Components?

Loop components are a new Microsoft data type allowing users to create sharable content that can be modified collaboratively by multiple users across many Microsoft 365 applications, including Teams, Outlook, and Word. Any user with access to the chat, email, or file where a Loop component is shared can collectively edit the contents of that Loop. Think of them as much smaller shared Microsoft 365 documents. Where a document is a large object containing lots of different content, a Loop is a small object (a list, a table, a small piece of text) that can be embedded within other applications, in our case a Teams chat. In both cases, documents and Loop components, when the resource is edited in one location, the changes are propagated in real time to all locations where the file has been shared.

Figure 1. Loop components in the Teams interface with a sample list Loop

Currently, six types of Loop components can be created in Teams: Bulleted List, Checklist, Numbered List, Paragraph, Table, and Task List. Regardless of type, the Loop component is stored as a file with a .fluid extension in the OneDrive of the person who created it. It is simply presented as a link in any target document that displays it. They are essentially treated as a new type of modern attachment.

What’s the Problem?

A new feature is not necessarily good news for eDiscovery or compliance. Here are just a few issues that we have identified:

Collection Is Difficult

The Loop component is a new binary file type viewable only in Microsoft 365 applications. The .fluid file in which the data is stored cannot be previewed or processed using Microsoft’s own eDiscovery platform, Purview, or with other processing tools. The file is not indexed and thus will not be found when performing a keyword search to look for content within the Loop component.

Instead, the file can only be collected as part of a general sweep of documents in a custodian’s OneDrive or, since it is a modern attachment, as part of a collection targeting data types where that Loop is shared, such as a Teams chat. Microsoft Purview eDiscovery Premium is needed to collect modern attachments.

No Native Review

Microsoft 365 Purview exports a native fluid file and a text file of the extracted text. There is no offline viewer for the native, and the corresponding text file is an empty 0KB file.

Microsoft is currently working on an “offline consumable export format” to address this issue, but there is no date for when this will become available. The only solution for viewing the file is for users to upload the file into their own Microsoft 365 instance and view it there. This is obviously not an appropriate solution in controlled eDiscovery environments. Relativity viewer integration is even further off.

Audit and History Tracking is Tricky

The main purpose of Loop components is continuous collaborative editing of content. If we could review the fluid files, we would also like to know who edited what part of the component and when. A typical scenario is an old Loop component being edited now. When doing collections within a certain time range, we would want to see the version of the Loop during the specified period, not the most current one. True, this is an issue in any modern attachments, such as a Word or Excel file, but it is doubly important for Loop components since collaborative editing is the entire point of their existence.

The good news is that Microsoft is planning to roll out a feature that will allow the collection of a modern attachment’s version as it was when the document was shared. It will be interesting to see how this is handled with fluid files since their content is much more (dare I say it) fluid than the file analogues.

Figure 2: Versions of a sample Loop

Handling Loop Components

Relativity Review

It is possible that Loop files form parts of document deliveries already. These files would have a .fluid extension and can be easily found by searching for that extension. Relativity can’t process the binary file and would flag the documents as Relativity Native Type: Unknown format.

Interestingly, Relativity does extract some text from the native fluid file. Even though the file is a binary file, some elements of the Loop are found in clear text and are therefore searchable. For example, the below extracted text was found for the Loop in the screenshot at the top of the page. The user-added content is found as text within the Loop binary. Of course, this just finds text. It does not show the formatting, audit, images, or structure of the Loop. But in the absence of anything else, it is something.

Figure 3: Extracted text of a sample Loop

Turning Off Loop

Given the issues that Loop components present for eDiscovery and compliance, some organizations may wish to disable them within their Microsoft 365 tenant. To do so, Microsoft 365 administrators can use the latest version of the SharePoint PowerShell module. After connecting to the module, run the command Get-SPOTenant | Select- Object -Property IsFluidEnabled to verify that the value for IsFluidEnabled is set to true. To disable Loop components, run the command Set -SPOTenant -IsFluidEnabled $false.

Figure 4: Disabling Loop via PowerShell

Per Microsoft, this change can take a short time to apply across your organization’s Microsoft 365 tenant. If your organization has multiple regions and/or Microsoft 365 tenants, you will need to disable Loop components in each of those regions/tenants.

Once disabled, the Loop button will no longer appear for end users within the Teams client. Previously shared Loop components will no longer render within Teams but will instead be displayed with the text “Loop component,” which links to the component within SharePoint/OneDrive.

Figure 5: Behavior in Teams after disabling Loop

Is This Really a Problem?

New software, new data types, and new files are constantly being developed. We can’t expect our favorite eDiscovery platforms to support every type of file out there. So what’s the problem? Given that 270 million people use Teams in their day-to-day working lives and that remote work is here to stay, it is safe to say that Loop components will feature in many a data set in the future6.

The Loop feature is on by default in all Microsoft 365 tenants. Loop components have been recognized as a significant Teams addition by Microsoft, and Microsoft has announced plans to further expand Loop capabilities. At ProSearch, we have already seen data sets with Loop components in them. They will quickly become an unavoidable data type within discovery collections.

By the time clients realize they are a problem for litigation, it may be too late. There may be relevant content in Loop components that litigation teams will want to review. ProSearch’s Microsoft 365 Advisory Services team has a finger on the pulse of the latest developments and keeps clients informed of them so that they can identify issues before they become a problem. For more information on Loop components, or any other Microsoft 365 features, contact Damir Kahvedžić, or Ryan Hemmel.

1 https://support.microsoft.com/en-us/topic/microsoft-teams-5aa4431a-8a3c-4aa5-87a6-b6401abea114?redirectSourcePath=%252farti- cle%252faea2bae4-40d3-4a10-bd69-ea8fc7313795

2 https://learn.microsoft.com/en-us/microsoftteams/skype-for-business-online-retirement

3 https://zapier.com/blog/zoom-vs-teams/

4 https://www.windowscentral.com/microsoft-teams-now-has-more-270-million-monthly-active-users

5 https://learn.microsoft.com/en-us/microsoftteams/loop-components-in-teams

6 https://www.businessofapps.com/data/microsoft-teams-statistics/

View Original PDF

The post Microsoft Loop appeared first on Prosearch.

Microsoft Build 2023

Damir Kahvedžić — Wed, 14 Jun 2023 20:37:01 +0000

Microsoft Build 2023

Last month Microsoft held its annual developers’ conference: Microsoft Build 2023. Over three days Microsoft covered everything from new tools and new features to workshops and even some metaphysical questions. It was useful if you are a programmer, but also for finding out what Microsoft is encouraging its developers to make.

AI . . . AI Everywhere

Not surprisingly Build 2023 was all about AI and how Microsoft is making good on its investment into OpenAI. Bing is the new default search engine in ChatGPT and will be used to keep it as current as possible. Development tools have been enhanced to more easily create AI models as well as use AI models to make coding itself easier.

Developers can employ Microsoft tools to add custom data sets for domain-specific AI bots. You can generate a legal expert bot, for example, by simply gathering legal domain knowledge, building a generative AI model based on ChatGPT or other sources, and then packing it all into a chatbot for use.

One of the more interesting features of Microsoft’s AI development stack is the focus on AI knowledge provenance. Developers can take advantage of this feature to explicitly state the AI model’s reference material. It only works with images and media for now, but it’s a great step to addressing content creators’ rights. Maybe in the future it will take care of the hallucinating problem as well.

Copilots

Copilots are Microsoft’s latest digital assistants, and these will use large language models to answer questions in a way familiar to ChatGPT users. There will be many of them: Copilot in Azure, Copilot in Teams, and now Copilot in Windows 11 itself. The assistant will function much like ChatGPT and be able to carry out some common Windows tasks. It will summarize documents, answer general questions, and essentially be everything Cortana should have been.

Copilots will use integrations by third-party plug-ins to make more advanced requests. Windows 11 Copilot integrates with Teams to automatically send messages and chats, for example. Microsoft is encouraging other third parties to create more plug-ins as well.

Watch it in action here: https://youtu.be/FCfwc-NNo30.

Immersive Teams

Speaking of Teams, the long-awaited avatars are going to be available soon for general use. So if you are hesitant to show your face on camera in Teams calls, you can use your 3D cartoon avatar instead. Some attendees used them during the Build streaming events. It’s all to make calls more immersive, and they will also be usable in 3D worlds like Mesh or the general metaverse, whenever that takes off.

Metaverse Metaphysical Questions

The metaverse took a back seat to all the news about AI. (Microsoft has drastically downsized metaverse development.) But it did show up in updates to the Mesh metaverse as well as Teams integration. It also figured into the most metaphysical of all discussions.

Today, AI like ElevenLabs’ can accurately deepfake the human voice; MetaHuman can re-create very realistic 3D avatars; and ChatGPT can answer questions in the style of famous people. Just ask ChatGPT to answer queries as Shakespeare or Edgar Allan Poe to see what I mean.

So what happens if somebody trains a chat model on a living person’s knowledge using their social media, photos, and other personal documents as a data set and then merges it with that person’s physical and vocal AI models as well? You now have a complete digital twin that can sound like you, look like you, and answer questions the way you would.

Identity issues aside, what happens to the aggregate AI model once that human passes on? Will it also be deleted, archived, stored in a digital museum? Do you want it to keep going and learning, forever maintaining your presence on the world, or to be cast out into a digital cemetery?! Yeah, deep questions.

That was the focus of Build’s metaverse discussion – the merging of AI and the metaverse. It was far too big a topic to address in 30 minutes, but if you ask me, all we need to do is watch Black Mirror’s Be Right Back episode to see where this leads. The show explored this situation with eerie accuracy 10 long years ago.

Roundup

It’s clear from Build 2023 that Microsoft wants to be the face of AI, both in the consumer space with Copilot as well as the developer segment where it is encouraging the development of domain-specific AIs. It’s likely that developers will at least incorporate some form of AI in their development stacks. AI will be ubiquitous to a point that it won’t need to be described as such. Just like smartphones are now just called phones, soon AI-infused or created software will just be better and more useful software. And if Microsoft has its way, it would all be built on and using Microsoft-owned tools and services.

The post Microsoft Build 2023 appeared first on Prosearch.